Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

VTR file format specification

The VTR file format (with .vtr extensions) contains the transcript text and other information form speech analysis. It is a human-readable JSON formatted text file compressed with Zip. Non-ASCII characters are encoded with UTF8.

Objects

  • "provider" : the engine used to create the transcript (e.g. "intelligentvoice",”speechmatics”)

  • "language" : comma-separated list of languages defined or detected in the transcript

  • "speakers" : list of speakers with id, name and sentiment values

  • "topics:" : list of topics identified in the transcript with score, positions and other information

  • "words": list of words in chronological order with position, time, speaker, alternatives and other information

  • "tcus": list of turn construction units with text, speaker, time and sentiment value

speakers

The speakers text is an array of speaker objects.

speaker object

example:

{
	"id": "Speaker 1",
	"iv_id": 1231,
	"iv_label": "Channel 1",
	"sentiment": {
		"positiveAggregatedSentiment": 26.66919087, ,
		"negativeAggregatedSentiment": -75.60949249, ,
		"sentimentGradient": 25.35896311, ,
		"normalisedSentimentGradient": 13.05775275, ,
		"sentimentIntercept": -13.36524594, ,
		"sentimentOutcome": 12.75025956
	}
}

property

description

id

String. An identification which is local to the current conversation. A speaker is referred by this id throughout the rest of the document.

iv_id

Number. This is specific to Intelligent Voice. This is a unique global id within the IV database which identifies the speaker.

iv_label

String. This is specific to Intelligent Voice. This is a label given to the speaker during diarisation process.

sentiment

Object. This is specific to Intelligent Voice. If sentiment processing was enabled for the transcript, the sentiment object contains sentiment values calculated for this speaker regarding this conversation. See /wiki/spaces/VID/pages/950433 for explanation of sentiment values.

List of sentiment properties:

  • positiveAggregatedSentiment

  • negativeAggregatedSentiment

  • sentimentGradient

  • normalisedSentimentGradient

  • sentimentIntercept

  • sentimentOutcome

topics

The topics text is an array of topic objects.

topic object

example:

{
	"topic": "Government Press Office",
	"score": 0.08300000,
	"positionInView": 1,
	"length": 0,
	"id": 14562,
	"rawscore": 1.00000000,
	"seektime": 0,
	"status": 0,
	"tagID": 973,
	"position": [{
			"order": 1,
			"wordIndex": 376,
			"timestamp": 107.40000000,
			"offset": 0
		}
	]
}

property

description

topic

String. A word or phrase identified as a topic by the provider, which is relevant to the conversation.

score

Real number in the range of 0..100. It shows how relevant the topic is to the conversation, determined by the transcript provider. Normalized value, 100.0 is the maximum relevance.

positionInView

Number. A unique position of the topic within the document.

length

Number. Always zero.

id

Number. A global id for this topic in this conversation.

rawscore

Real number. It shows how relevant the topic is to the conversation, determined by the transcript provider.

seektime

Number. Always zero.

status

Number. Always zero.

tagID

Number. A global id for this topic across all conversations.

position

Array of position objects. A position object represents an occurrence of the topic in the conversation.

Properties of position object:

  • order: the order of the position within the list

  • wordIndex: index of the starting word of the topic occurrence in the word list of the transcript

  • timestamp: starting timestamp of the topic occurrence in the transcribed media

  • offset: always 0

words

Is an array of word objects. It contains the complete transcription of the conversation split into words in chronological order.

word object

example:

{
	"word": "to",
	"confidence": 0.55500000,
	"speaker": "Speaker 2",
	"speakerName": "3",
	"speakerId": 1002294,
	"time": 12.60000000,
	"duration": 0.04000000,
	"alternatives": [{
			"word": "the",
			"confidence": 0.28853413
		}, {
			"word": "they",
			"confidence": 0.15478531
		}, {
			"word": "<eps>",
			"confidence": 0.00215280
		}
	]
}

property

description

word

String. A word transcribed by the provider with the highest confidence. It usually contains trailing punctuation marks.

confidence

Real number in the range of 0..1. It shows how confident the provider is in the word is actually the one spoken in the conversation. 1.0 is the highest confidence.

speaker

String. Display name of the speaker if diarisation is enabled. It is the iv_label property of the speaker object in the speakers array.

speakerName

String. id property of the speaker object. This should be used to reference the speaker in the speakers array.

speakerId

Number. iv_id property of the speaker object. Global id of the speaker for this conversation in the IV database.

time

Real number. Timestamp in seconds where the word is spoken in the conversation.

duration

Real number. Duration in seconds of the spoken word in the conversation.

alternatives

Array of alternative objects. Alternative transcriptions of the spoken word with less confidence.

Properties of alternative object:

  • word: string, alternative transcription of the spoken word

  • confidence: confidence value for this alternative

tcus

Is an array of tcu objects. A TCU or Turn Construction Unit is a sentence or similar snippet of a conversation separated by punctuation, speaker change or other means.

tcu object

example:

{
	"text": "How'd you get on today? Today",
	"speaker": "10",
	"startTime": 0.16000000,
	"endTime": 2.00000000,
	"sentiment": 0.01980386
}

property

description

text

String. Transcription of the tcu.

speaker

String. id property of the speaker object. This should be used to reference the speaker in the speakers array.

startTime

Real number. Beginning timestamp in seconds where the tcu is spoken in the conversation.

endTime

Real number. Ending timestamp in seconds where the tcu is spoken in the conversation.

sentiment

Real number. Sentiment value of the tcu if sentiment processing was enabled for the transcription. See /wiki/spaces/VID/pages/950433 for details.

  • No labels