VTR file format specification
The VTR file format (with .vtr extensions) contains the transcript text and other information form speech analysis. It is a human-readable JSON formatted text file compressed with Zip. Non-ASCII characters are encoded with UTF8.
Objects
"
provider
" : the engine used to create the transcript (e.g. "intelligentvoice",”speechmatics”)"
language
" : comma-separated list of languages defined or detected in the transcript"
speakers
" : list of speakers with id, name and sentiment values"
topics
:" : list of topics identified in the transcript with score, positions and other information"
words
": list of words in chronological order with position, time, speaker, alternatives and other information"
tcus
": list of turn construction units with text, speaker, time and sentiment value
speakers
The speakers
text is an array of speaker objects.
speaker object
example:
{ "id": "Speaker 1", "iv_id": 1231, "iv_label": "Channel 1", "sentiment": { "positiveAggregatedSentiment": 26.66919087, , "negativeAggregatedSentiment": -75.60949249, , "sentimentGradient": 25.35896311, , "normalisedSentimentGradient": 13.05775275, , "sentimentIntercept": -13.36524594, , "sentimentOutcome": 12.75025956 } }
property | description |
---|---|
| String. An identification which is local to the current conversation. A speaker is referred by this id throughout the rest of the document. |
| Number. This is specific to Intelligent Voice. This is a unique global id within the IV database which identifies the speaker. |
| String. This is specific to Intelligent Voice. This is a label given to the speaker during diarisation process. |
| Object. This is specific to Intelligent Voice. If sentiment processing was enabled for the transcript, the sentiment object contains sentiment values calculated for this speaker regarding this conversation. See /wiki/spaces/VID/pages/950433 for explanation of sentiment values. List of sentiment properties:
|
topics
The topics
text is an array of topic objects.
topic object
example:
{ "topic": "Government Press Office", "score": 0.08300000, "positionInView": 1, "length": 0, "id": 14562, "rawscore": 1.00000000, "seektime": 0, "status": 0, "tagID": 973, "position": [{ "order": 1, "wordIndex": 376, "timestamp": 107.40000000, "offset": 0 } ] }
property | description |
---|---|
| String. A word or phrase identified as a topic by the provider, which is relevant to the conversation. |
| Real number in the range of 0..100. It shows how relevant the topic is to the conversation, determined by the transcript provider. Normalized value, 100.0 is the maximum relevance. |
| Number. A unique position of the topic within the document. |
| Number. Always zero. |
| Number. A global id for this topic in this conversation. |
| Real number. It shows how relevant the topic is to the conversation, determined by the transcript provider. |
| Number. Always zero. |
| Number. Always zero. |
| Number. A global id for this topic across all conversations. |
| Array of Properties of position object:
|
words
Is an array of word objects. It contains the complete transcription of the conversation split into words in chronological order.
word object
example:
{ "word": "to", "confidence": 0.55500000, "speaker": "Speaker 2", "speakerName": "3", "speakerId": 1002294, "time": 12.60000000, "duration": 0.04000000, "alternatives": [{ "word": "the", "confidence": 0.28853413 }, { "word": "they", "confidence": 0.15478531 }, { "word": "<eps>", "confidence": 0.00215280 } ] }
property | description |
---|---|
| String. A word transcribed by the provider with the highest confidence. It usually contains trailing punctuation marks. |
| Real number in the range of 0..1. It shows how confident the provider is in the word is actually the one spoken in the conversation. 1.0 is the highest confidence. |
| String. Display name of the speaker if diarisation is enabled. It is the |
| String. |
| Number. |
| Real number. Timestamp in seconds where the word is spoken in the conversation. |
| Real number. Duration in seconds of the spoken word in the conversation. |
| Array of Properties of
|
tcus
Is an array of tcu
objects. A TCU or Turn Construction Unit is a sentence or similar snippet of a conversation separated by punctuation, speaker change or other means.
tcu object
example:
{ "text": "How'd you get on today? Today", "speaker": "10", "startTime": 0.16000000, "endTime": 2.00000000, "sentiment": 0.01980386 }
property | description |
---|---|
| String. Transcription of the tcu. |
| String. |
| Real number. Beginning timestamp in seconds where the tcu is spoken in the conversation. |
| Real number. Ending timestamp in seconds where the tcu is spoken in the conversation. |
| Real number. Sentiment value of the tcu if sentiment processing was enabled for the transcription. See /wiki/spaces/VID/pages/950433 for details. |