TTML (Timed Text Markup Language)

Timed Text Markup Language (TTML) is a standard for XML captions developed by the World Wide Web Consortium (W3C) in order to unify the increasingly divergent set of existing caption formats. It was designed to incorporate all functionality of existing formats, and therefore, become the standard interchange format between applications. It is the base standard for other XML-based caption formats as well. For a more detailed overview of XML-based captions, including the history of how the various standards were formed, read our History of XML Captions page.

TTML is growing in popularity for use with web-based applications, including Adobe Flash, Premiere Pro, and Microsoft Silverlight. It is also widely supported among video hosting and streaming services, such as YouTube, Netflix, and Amazon Video. Most video platforms such as Brightcove, Ooyala, and Kaltura (MediaSpace) also support TTML.

TTML files can have the file extension (.ttml) or (.xml).

Structure

Being an XML-based format, TTML is very similar to HTML, in that it consists of a collection of nested structural elements, with tags to mark the beginning and end of each element. For example, the general structure of a TTML document is as follows:

<tt>
  <head>
    ...
  </head>
  <body>
    ...
  </body>
</tt>

The outermost element is the Timed Text, or <tt> element. The other elements are nested between the <tt> and </tt> tags, which mark the beginning and end of the <tt> element. The <head> element is optional. It contains information about styles, layouts, and document metadata. The <body> element contains the actual subtitles/captions. Each of these elements is discussed in more detail below.

The <tt> Element

In addition to being the root element (i.e. the overall container for the document), the <tt> element is also used to specify document level metadata. This info may include a document title, description, language, namespaces, and copyright information. In addition, arbitrary metadata drawn from other namespaces may be specified. The following example shows attributes that might typically be included in a <tt> element for TTML file:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts="http://www.w3.org/ns/ttml#styling">
    ...
</tt>

The xml:lang attribute sets the default document language to English. The xlmns attribute defines the default namespace for the document, and the xlmns:tts attribute allows us to use the prefix “tts:” as an alias to “http://www.w3.org/ns/ttml#styling” namespace.

The <head> Element

The <head> element specifies styles, regions, and metadata. Styles are used to indicate the desired look and feel of subtitles/captions. Regions define the size and location of the caption box. Metadata provides information about the document that might be used by editing, processing, or rendering tools. The following example <head> element shows all three sub-elements.

<head>
  <styling>
    <!-- s1 specifies default color, font, and text alignment -->
    <style xml:id="style1" tts:color="blue"/>
  </styling>
  <layout>
    <region xml:id="region1" tts:origin="20% 80%" tts:extent="80% 100%"/>
  </layout>
  <metadata> <!-- info about the document -->
</head>

The <styling> element specifies that any subtitle marked with the style=”style1″ attribute will be shown in blue. The <region> element specifies that any subtitle marked with the region=”region1″ attribute will be shown in the bottom 20% of the screen with a 20% margin on the left and right.

The <body> Element

The <body> element contains the actual subtitles/captions. Each subtitle is wrapped in a <p> element. Each <p> element has a “begin” and “end” attribute specifying the start and end time for the subtitle to be shown on the screen as well as the text to be shown. Other attributes may be specified, the most common being style and region. The <div> element can be used to group <p> elements that all share some common attribute, such as language, region, or font style. The following example shows a complete <body> element illustrating the use of styles, regions, and a <div> container.

<body>
  <div xml:lang="en">
    <p begin="0.76s" end="3.20s" style="s1" region="r1Left">
      I sent a message to the fish:
    </p>
    <p begin="3.20s" end="6.61s" style="s2" region="r2Right">
      I told them "This is what I wish."
    </p>
    <p begin="6.61s" end="9.93s" style="s1" region="r1Left">
      The little fishes of the sea,
     </p>
    <p begin="9.93s" end="12.35s" style="s2" region="r2Right">
      They sent an answer back to me.
    </p>
  </div> 
</body>

There are four subtitles grouped inside of a <div> element, which specifies the language as English. The first and third subtitle will appear on the left (in the “r1Left” region), shown in style “s1”, whereas the second and fourth subtitle will appear on the right (in the “r2Right” region), shown in style “s2”.

The <br> and <span> Elements

Markup can also be used within the caption text itself. For example, the <br/> element is used to force a line break. The <span> element is used to change the font style of only a portion of the text. For example, suppose a subtitle should appear on the screen as follows:

Twinkle, twinkle, little bat!
How I wonder where you’re at!

To produce the desired styling, the markup could be written as follows:

<p begin="00:00:01.20" end="00:00:07.84" tts:textAlign="center">
  Twinkle, twinkle, little bat!<br/>
  How <span tts:fontStyle="italic">I wonder</span> where you're at!
</p>

The <br/> following the word “bat!” forces a line break to occur at that point, rather than allow the caption box to fill to the edges. The <span> tag begins inline styling in italics, and the </span> ends the inline styling.

Time Formats

Times can be expressed either in clock-time format or offset-time format. In either case, they are offsets that are typically relative to the beginning of the video (time zero). Clock-time format can be expressed in one of the following ways:

  • hours:minutes:seconds.fraction (e.g. “00:07:15.25”)
  • hours:minutes:seconds:frames (e.g. “00:07:15:06”)

Note that each segment is zero-padded to two digits. In the first example, seconds are expressed as fractional decimal. The example time of “00:07:15.25” represents 7 minutes, 15 seconds, and 250 milliseconds from the beginning of the video. In the second example, frames are used instead of fractional seconds. The example time of “00:07:15:06” represents the 6th frame after 7 minutes and 15 seconds have passed.

Offset-time format is expressed as a single fractional decimal number followed by unit indicator (aka “metric”). The unit indicator can be one of the following: “h” (hours), “m” (minutes), “s” (seconds), “ms” (milliseconds), “f” (frames), “t” (ticks). The most common unit indicator would be seconds. For example the clock time of “00:07:15.25” expressed in offset-time would be “432.25s”.

Features

TTML was designed to incorporate all the features of existing caption formats, and as such, it includes a rich set of functionality, including:

  • positioning
  • alignment
  • styling
  • animation
  • multiple languages
  • metadata
  • multiple captions on the screen simultaneously

Each of the above features is discussed in more detail below.

Positioning

Traditionally, the location in which captions/subtitles were displayed was left up to the device or software rendering them, and they were generally displayed at the bottom of the screen. However, as technology and capabilities evolved, it became possible to specify the location of subtitles on a case-by-case basis. This is useful to avoid the scenario where subtitles are written on top of text at the bottom of the screen that was part of the video. Positioning can also be used to place captions near the corresponding speaker, so that hard of hearing viewers can identify who is speaking.

Positioning is accomplished by defining one or more <region> elements in the header, and applying region attributes in the <p> elements. For example, suppose we want to have three regions as shown below:

TTML Vertical Regions

Note that the origin (0%, 0%) is the top left corner. The following example shows how the three regions depicted above would be defined in the header and referenced in the body:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts="http://www.w3.org/ns/ttml#styling">
<head>
 <layout>
   <region xml:id="rTop"    tts:origin="10% 10%" tts:extent="80% 20%"/>
   <region xml:id="rMiddle" tts:origin="10% 40%" tts:extent="80% 20%"/>
   <region xml:id="rBottom" tts:origin="10% 70%" tts:extent="80% 20%"/>
 </layout>
</head>
<body>
  <div xml:lang="en">
    <p begin="0.76s" end="3.20s" region="rTop">
      I sent a message to the fish:
    </p>
    <p begin="3.20s" end="6.61s" region="rMiddle">
      I told them "This is what I wish."
    </p>
    <p begin="6.61s" end="9.93s" region="r1Bottom">
      The little fishes of the sea,
    </p>
    <p begin="9.93s" end="12.35s" region="r2Middle">
      They sent an answer back to me.
    </p>
  </div> 
</body>
</tt>

In the above example, the tts:origin attributes are expressed in percentages (x% y%) relative to the top left corner (the origin), while tts:extent attributes are percentages (width% height%) relative to the overall width and height of the video display.

The tts:origin and tts:extent attributes can also be expressed in terms of pixels instead of percentages. In that case, a tts:extent attribute should be added to the <tt> root element to specify the width and height of the video. Doing so will help validation tools flag any regions that extend outside of the video’s dimensions. If the example above were to use pixels as units, it would be written as:

<tt xml:lang="en" tts:extent="1280px 720px"
  xmlns="http://www.w3.org/ns/ttml"
  xmlns:tts="http://www.w3.org/ns/ttml#styling">
<head>
  <layout>
    <region xml:id="rTop" tts:origin="128px 72px" tts:extent="1024px 256px"/>
    <region xml:id="rMiddle" tts:origin="128px 288px" tts:extent="1024px 256px"/>
    <region xml:id="rBottom" tts:origin="128px 504px" tts:extent="1024px 256px"/>
  </layout>
</head>
<body>
...
</body>
</tt>

Alignment

The alignment of subtitle text within regions can be specified using the textAlign and displayAlign style attributes. The textAlign attribute specifies horizontal alignment, whereas the displayAlign attribute specifies vertical alignment. Horizontal alignment can be useful to help indicate which speaker is talking. For example:

Time 0.76s:

Something said by the
person on the left.

Time 2.21s:

Something said by the
person in the center.

Time 4.57s:

Something said by the
person on the right.

To caption the above dialog preserving the positioning shown, we would use the following markup:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts="http://www.w3.org/ns/ttml#styling">
<head>
  <layout>
    <region xml:id="rLeft" tts:origin="10% 70%" tts:extent="80% 20%"/>
      <style tts:textAlign="left"/>
    </region>
    <region xml:id="rCenter" tts:origin="10% 70%" tts:extent="80% 20%">
      <style tts:textAlign="center"/>
    </region>
    <region xml:id="rRight" tts:origin="10% 70%" tts:extent="80% 20%">
      <style tts:textAlign="right"/>
    </region>
  </layout>
</head>
<body>
  <div xml:lang="en">
    <p begin="0.76s" end="2.21s" region="rLeft">
      Something said by the<br/>
      person on the left.
    </p>
    <p begin="2.21s" end="4.57s" region="rCenter">
      Something said by the<br/>
      person in the center.
    </p>
    <p begin="4.57s" end="6.35s" region="rRight">
      Something said by the<br/>
      person on the right.
    </p>
  </div> 
</body>
</tt>

The above example shows how the horizontal alignment of captions can be specified using the textAlign style attribute. The following example shows how the displayAlign attribute can be used to specify vertical alignment:

    <region xml:id="rLowerLeft"/>
      <style tts:displayAlign="after" tts:textAlign="left"/>
    </region>
    <region xml:id="rUpperRight">
      <style tts:displayAlign="before" tts:textAlign="right"/>
    </region>

Finally, the overflow attribute can be used to specify whether or not text that overflows the bounding region is to be shown or hidden:

    <region xml:id="rOverflowShown"/>
      <style tts:overflow="visible" ... />
    </region>
    <region xml:id="rOverflowHidden">
      <style tts:overflow="hidden" ... />
    </region>

Styling

TTML supports a rich set of styling attributes that can be used to specify a variety of aspects of the look and feel of the captions, including:

  • font family, size, weight, emphasis, and color
  • letter spacing and kerning
  • line height
  • borders, padding, background colors, and opacity

Styles can be applied to the most elements, including <region>, <body>, <div>, <p>, and <span>. They can be nested (defined inline as a child of an element) or referenced (defined in the header, for example, and later referenced by the elements they affect). Here is an example of nested styling:

<region xml:id="r1">
 <style tts:color="blue"/>
 <style tts:fontFamily="monospaceSerif"/>
</region>

Here is an example of referential styling:

<style xml:id="s1" tts:color="white"/>
<style xml:id="s2" tts:color="yellow"/>
...
<p style="s1">White 1</p>
<p style="s2">Yellow 2</p>

Child elements inherit styles from ancestor elements, for example:

<tt tts:color="yellow">
...
<region xml:id="r1" tts:fontFamily="monospaceSerif"/>
...
<p region="r1">Yellow Monospace</p>
...
</tt>

Styles can also be chained together:

<style xml:id="s1" tts:color="white" tts:fontFamily="monospaceSerif"/>
<style xml:id="s2" style="s1" tts:color="yellow"/>
...
<p style="s1">White Monospace</p>
<p style="s2">Yellow Monospace</p>

Animation

TTML includes the capability to animate subtitles. This is accomplished by specifying discrete changes to one or more style parameter, applied at a particular time interval over a finite duration. An example of how this feature could be used is to color the words of karaoke captions in sync with the music to show which words should be sung at which times.

Animation can either be inline or out-of-line. Inline animation is accomplished by adding one or more <animate> elements as the children of a <region>, <body>, <br>, <div>, <p>, or <span> element. Here is an example of inline animation:

<p dur="9s">
  <animate dur="9s" tts:color="red;white;blue" repeatCount="indefinite"/>
  Happy Independence Day!
</p>

In the above example, the font color changes from red, to white, to blue over a nine-second period (three seconds on each color), and then repeats that pattern continuously.

If an inline animation is incorporated into a region, all the captions displayed in that region will animate accordingly. For example:

<region xml:id="rFade" timeContainer="seq" tts:opacity="0">
  <animate dur="1s" tts:opacity="0;1"/>
  <set dur="5s" tts:opacity="1"/>
  <animate dur="1s" tts:opacity="1;0"/>
</region>

<body region="rFade">...</body>

In the above example, all captions in the body will inherit the rFade region. This will cause them to fade in (from transparent to fully opaque) over a one-second interval, remain opaque for five seconds, and then fade out over a one-second interval.

Out-of-line animation is accomplished by defining an <animation> element in the header and then referencing its xml:id from an animate attribute on the element to be animated. This allows one type of animation to be defined once, and used multiple times.

<head>
...
  <animation xml:id="aPulsate" repeatCount="indefinite" >
    <animate dur="1s" tts:opacity="0;1"/>
    <set dur="0.5s" tts:opacity="1"/>
    <animate dur="1s" tts:opacity="1;0"/>
  </animation>
...
</head>
<body animate="aPulsate">...</body>

In the above example, the “aPulsate” animation is defined in the header and then referenced as an attribute of the body. All captions will then continuously fade in and out.

Metadata

The <metadata> element is a generic container for adding information to virtually any TTML element. There are seven predefined fields:

  • ttm:title
  • ttm:desc
  • ttm:copyright
  • ttm:agent
  • ttm:name
  • ttm:actor
  • ttm:item

Below are some examples of how the <metadata> element can be used with predefined fields.

At the document level:

<head>
  <metadata xmlns:ttm="http://www.w3.org/ns/ttml#metadata">
    <ttm:title>Hamlet</ttm:title>
    <ttm:desc>The Tragedy of Hamlet, Prince of Denmark, by William Shakespeare.</ttm:desc>
  </metadata>
</head>

Attached to a <div>:

<div>
  <metadata xmlns:ttm="http://www.w3.org/ns/ttml#metadata">
    <ttm:title>Act 1, Scene 2 - A Room of State in the Castle</ttm:title>
    <ttm:desc>King Claudius speaks of the death of his brother, and implores Hamlet to stay in Denmark.</ttm:desc>
  </metadata>
</div>

Note in the above examples, the ttm namespace is defined as an attribute of the metadata. It could also have been defined at the root level (the <tt> element). The <metadata> element can include externally defined fields, as long as they are preceded by a defined namespace. For example:

<div xmlns:ext="http://example.org/ttml#metadata">
  <metadata ext:ednote="remove this division prior to publishing"/>
</div>

Simultaneous Subtitles

TTML allows for the simultaneous display of multiple subtitles/captions at once. This can be useful when multiple speakers are talking at once. For example:

Time 0.76s:

What are your
favorite desserts?

Time 2.21s:

Apple pie with
vanilla ice cream.

Time 2.77s:

Apple pie with Strawberry shortcake
vanilla ice cream. with whip cream.

In the above example, Speaker 1 (on the left) asks a question, Speaker 2 (center) begins to answer first, and Speaker 3 (right) begins to answer a half-second later, while Speaker 2 is still talking. To convey that both speakers were talking at the same time, the captions for the two speakers are shown in sync with when each speaker was talking, resulting in captions being displayed simultaneously. To produce the caption sequence shown above, the following markup is used:

  <div xml:lang="en">
    <p begin="0.76s" end="2.21s" region="rLeft">
      What are your<br/>
      favorite desserts?
    </p>
    <p begin="2.21s" end="4.57s" region="rCenter">
      Apple pie with<br/>
      vanilla ice cream.
    </p>
    <p begin="2.77s" end="5.25s" region="rRight">
      Strawberry<br/>
      shortcake.
    </p>
  </div>

It is possible to force sequential display of captions even if the timing overlaps. This can be done by specifying the timeContainer=seq attribute in the parent <div>. If the timeContainer attribute is left out, its value defaults to “par” (parallel), and simultaneous captions are allowed.

Example

The following video shows an example of what you would get if you ordered Speechpad’s Standard Captions. After you begin playing the video, click the “CC” on the video player to turn the captions on. The text box below the video shows you the TTML file for those same captions. TTML is just one of many formats you can download once the captions have been created. You could then use the TTML file to allow various players and video hosting services to present captions with your video (see compatibility list below).

<?xml version="1.0" encoding="utf-8"?>
<tt xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling">
	<head>
		<metadata>
			<title/>
			<language>en_US</language>
			<region>US</region>
			<guid/>
			<emailid>support@speechpad.com</emailid>
		</metadata>
		<styling>
			<style xml:id="defaultCaption" tts:fontSize="10" tts:fontFamily="SansSerif"
			tts:fontWeight="normal" tts:fontStyle="normal"
			tts:textDecoration="none" tts:color="white"
			tts:backgroundColor="black" />
		</styling>
	</head>
	<body style="s0">
		<div>
			<p begin="00:00:03.400" end="00:00:06.177">In this lesson, we're going to<br />be talking about finance. And</p>
			<p begin="00:00:06.177" end="00:00:10.009">one of the most important aspects<br />of finance is interest.</p>
			<p begin="00:00:10.009" end="00:00:13.655">When I go to a bank or some<br />other lending institution</p>
			<p begin="00:00:13.655" end="00:00:17.720">to borrow money, the bank is happy<br />to give me that money. But then I'm</p>
			<p begin="00:00:17.900" end="00:00:21.480">going to be paying the bank for the<br />privilege of using their money. And that</p>
			<p begin="00:00:21.660" end="00:00:26.440">amount of money that I pay the bank is<br />called interest. Likewise, if I put money</p>
			<p begin="00:00:26.620" end="00:00:31.220">in a savings account or I purchase a<br />certificate of deposit, the bank just</p>
			<p begin="00:00:31.300" end="00:00:35.800">doesn't put my money in a little box<br />and leave it there until later. They take</p>
			<p begin="00:00:35.800" end="00:00:40.822">my money and lend it to someone<br />else. So they are using my money.</p>
			<p begin="00:00:40.822" end="00:00:44.400">The bank has to pay me for the privilege<br />of using my money.</p>
			<p begin="00:00:44.400" end="00:00:48.700">Now what makes banks<br />profitable is the rate</p>
			<p begin="00:00:48.700" end="00:00:53.330">that they charge people to use the bank's<br />money is higher than the rate that they</p>
			<p begin="00:00:53.510" end="00:01:00.720">pay people like me to use my money. The<br />amount of interest that a person pays or</p>
			<p begin="00:01:00.800" end="00:01:06.640">earns is dependent on three things. It's<br />dependent on how much money is involved.</p>
			<p begin="00:01:06.820" end="00:01:11.300">It's dependent upon the rate of interest<br />being paid or the rate of interest being</p>
			<p begin="00:01:11.480" end="00:01:17.898">charged. And it's also dependent upon<br />how much time is involved. If I have</p>
			<p begin="00:01:17.898" end="00:01:22.730">a loan and I want to decrease the amount<br />of interest that I'm going to pay, then</p>
			<p begin="00:01:22.800" end="00:01:28.040">I'm either going to have to decrease how<br />much money I borrow, I'm going to have</p>
			<p begin="00:01:28.220" end="00:01:32.420">to borrow the money over a shorter period<br />of time, or I'm going to have to find a</p>
			<p begin="00:01:32.600" end="00:01:37.279">lending institution that charges a lower<br />interest rate. On the other hand, if I</p>
			<p begin="00:01:37.279" end="00:01:41.480">want to earn more interest on my<br />investment, I'm going to have to invest</p>
			<p begin="00:01:41.480" end="00:01:46.860">more money, leave the money in the<br />account for a longer period of time, or</p>
			<p begin="00:01:46.860" end="00:01:49.970">find an institution that will pay<br />me a higher interest rate.</p>
		</div>
	</body>
</tt>

Compatibility

The TTML file format is supported by most video players, streaming platforms, authoring tools, editing software, including:

  • YouTube
  • Netflix
  • Amazon Video
  • Yahoo
  • AOL
  • Vimeo
  • Dailymotion
  • YouView
  • Metacafe
  • Brightcove
  • Ooyala
  • Kaltura (MediaSpace)
  • Limelight Networks
  • Adobe Media Server
  • Adobe Connect
  • Adobe TV
  • Adobe Premiere Pro
  • Adobe Flash
  • Open Source Media Framework (OSMF)
  • Adobe Presenter
  • Panopto
  • VLC
  • Flowplayer
  • JW Player
  • Subtitle Edit
  • Microsoft PowerPoint 2013 & Office 365 with Office Mix

Speechpad Supports TTML Captions

TTML captions are available with either of Speechpad’s captioning services: Standard Captions or Premium Captions.

If your project requires a specialized version of TTML captions or any other caption format, please contact us, and we’d be happy to assist you.

Learn more about Speechpad’s other XML-based caption formats.