Verba transcript file (.vtr) format

VTR file format specification

The VTR file format (with .vtr extensions) contains the transcript text and other information form speech analysis. It is a human-readable JSON formatted text file compressed with Zip. Non-ASCII characters are encoded with UTF8.

Objects

"provider" : the engine used to create the transcript (e.g. "intelligentvoice",”speechmatics”)
"language" : comma-separated list of languages defined or detected in the transcript
"speakers" : list of speakers with id, name and sentiment values
"topics:" : list of topics identified in the transcript with score, positions and other information
"words": list of words in chronological order with position, time, speaker, alternatives and other information
"tcus": list of turn construction units with text, speaker, time and sentiment value

speakers

The speakers text is an array of speaker objects.

speaker object

example:

{
	"id": "Speaker 1",
	"iv_id": 1231,
	"iv_label": "Channel 1",
	"sentiment": {
		"positiveAggregatedSentiment": 26.66919087, ,
		"negativeAggregatedSentiment": -75.60949249, ,
		"sentimentGradient": 25.35896311, ,
		"normalisedSentimentGradient": 13.05775275, ,
		"sentimentIntercept": -13.36524594, ,
		"sentimentOutcome": 12.75025956
	}
}

property	description
`id`	String. An identification which is local to the current conversation. A speaker is referred by this id throughout the rest of the document.
`iv_id`	Number. This is specific to Intelligent Voice. This is a unique global id within the IV database which identifies the speaker.
`iv_label`	String. This is specific to Intelligent Voice. This is a label given to the speaker during diarisation process.
`sentiment`	Object. This is specific to Intelligent Voice. If sentiment processing was enabled for the transcript, the sentiment object contains sentiment values calculated for this speaker regarding this conversation. See /wiki/spaces/VID/pages/950433 for explanation of sentiment values. List of sentiment properties: `positiveAggregatedSentiment` `negativeAggregatedSentiment` `sentimentGradient` `normalisedSentimentGradient` `sentimentIntercept` `sentimentOutcome`

topics

The topics text is an array of topic objects.

topic object

example:

{
	"topic": "Government Press Office",
	"score": 0.08300000,
	"positionInView": 1,
	"length": 0,
	"id": 14562,
	"rawscore": 1.00000000,
	"seektime": 0,
	"status": 0,
	"tagID": 973,
	"position": [{
			"order": 1,
			"wordIndex": 376,
			"timestamp": 107.40000000,
			"offset": 0
		}
	]
}

property	description
`topic`	String. A word or phrase identified as a topic by the provider, which is relevant to the conversation.
`score`	Real number in the range of 0..100. It shows how relevant the topic is to the conversation, determined by the transcript provider. Normalized value, 100.0 is the maximum relevance.
`positionInView`	Number. A unique position of the topic within the document.
`length`	Number. Always zero.
`id`	Number. A global id for this topic in this conversation.
`rawscore`	Real number. It shows how relevant the topic is to the conversation, determined by the transcript provider.
`seektime`	Number. Always zero.
`status`	Number. Always zero.
`tagID`	Number. A global id for this topic across all conversations.
`position`	Array of `position` objects. A position object represents an occurrence of the topic in the conversation. Properties of position object: `order`: the order of the position within the list `wordIndex`: index of the starting word of the topic occurrence in the word list of the transcript `timestamp`: starting timestamp of the topic occurrence in the transcribed media `offset`: always 0

words

Is an array of word objects. It contains the complete transcription of the conversation split into words in chronological order.

word object

example:

{
	"word": "to",
	"confidence": 0.55500000,
	"speaker": "Speaker 2",
	"speakerName": "3",
	"speakerId": 1002294,
	"time": 12.60000000,
	"duration": 0.04000000,
	"alternatives": [{
			"word": "the",
			"confidence": 0.28853413
		}, {
			"word": "they",
			"confidence": 0.15478531
		}, {
			"word": "<eps>",
			"confidence": 0.00215280
		}
	]
}

property	description
`word`	String. A word transcribed by the provider with the highest confidence. It usually contains trailing punctuation marks.
`confidence`	Real number in the range of 0..1. It shows how confident the provider is in the word is actually the one spoken in the conversation. 1.0 is the highest confidence.
`speaker`	String. Display name of the speaker if diarisation is enabled. It is the `iv_label` property of the speaker object in the `speakers` array.
`speakerName`	String. `id` property of the speaker object. This should be used to reference the speaker in the `speakers` array.
`speakerId`	Number. `iv_id` property of the speaker object. Global id of the speaker for this conversation in the IV database.
`time`	Real number. Timestamp in seconds where the word is spoken in the conversation.
`duration`	Real number. Duration in seconds of the spoken word in the conversation.
`alternatives`	Array of `alternative` objects. Alternative transcriptions of the spoken word with less confidence. Properties of `alternative` object: `word`: string, alternative transcription of the spoken word `confidence`: confidence value for this alternative

tcus

Is an array of tcu objects. A TCU or Turn Construction Unit is a sentence or similar snippet of a conversation separated by punctuation, speaker change or other means.

tcu object

example:

{
	"text": "How'd you get on today? Today",
	"speaker": "10",
	"startTime": 0.16000000,
	"endTime": 2.00000000,
	"sentiment": 0.01980386
}

property	description
`text`	String. Transcription of the tcu.
`speaker`	String. `id` property of the speaker object. This should be used to reference the speaker in the `speakers` array.
`startTime`	Real number. Beginning timestamp in seconds where the tcu is spoken in the conversation.
`endTime`	Real number. Ending timestamp in seconds where the tcu is spoken in the conversation.
`sentiment`	Real number. Sentiment value of the tcu if sentiment processing was enabled for the transcription. See /wiki/spaces/VID/pages/950433 for details.