History of XML Captions

XML captions and subtitles conform to one of several formats based on the TTML (Timed Text Markup Language) standard developed by the World Wide Web Consortium (W3C). These include:

  • TTML—The base standard for XML captions defined by the W3C.
  • DFXP—Short for “Distribution Format Exchange Profile,” DFXP is a of collection requirements from the base TTML standard, and is often used interchangeably with “TTML.”
  • SMPTE-TT—Short for “Society of Motion Pictures and Television Engineers Timed Text,” SMPTE-TT is an extension of TTML that includes all of DFXP as well as new features for bitmaps, binary data, and rendering.
  • SMIL—Short for “Synchronized Multimedia Integration Language,” SMIL is a W3C standard XML markup language, designed to support synchronized presentation of various media components such as video, text, images, and audio.

Collectively, these XML-based caption formats have gained significant traction within established industries such as broadcast television and film. Despite the successful adoption within these domains, support for the standard in the web community has been mixed, and XML standards for captions and subtitles are currently competing with WebVTT, a new standard adopted by HTML5.


The best way to understand how we wound up with so many different flavors of XML subtitles/captions is to review the history of how the standards were developed.

A Mess in the Making

Captions for broadcast video have been around since 1970. As media presentation methods have evolved, so have caption formats. The emergence of web video accelerated this evolution exponentially. By the turn of the century, there were dozens of caption formats, and no de facto standard had emerged. It seemed that each video playback or editing application had its own new way of storing captions.

It seems counter-intuitive that the simple task of displaying text on a screen at a particular time would give rise to so many formats. However, the wide-openness of the web fostered unbridled innovation, and this simple requirement quickly evolved as new features were introduced, such as adding positioning and layout information, styling, animation, metadata about the video and its content, hyperlinks, etc. The result—a mess of incompatible formats.

The Cleanup Team

In early 2003, the World Wide Web Consortium (W3C) finally recognized and addressed the problem. The Timed Text Working Group (TTWG) was chartered with the mission of developing an XML-based format used for representing streamable text synchronized with some other timed media, like audio and video. It was designed to incorporate of all functionality of existing formats, and therefore, become the standard interchange format between applications.

The TTWG realized that the standard needed to address the requirements of three major groups:

  • Web developers like Microsoft (Silverlight), Google (YouTube), and Adobe (Flash)
  • Movie producers who author and distribute captions
  • Live broadcasters like FOX, CNN, ABC, CBS, and NBC, who have lots of video assets with traditional closed captions and would like to convert them to the new web-based standards as they re-purpose content for streaming delivery.

The group foresaw that one standard would not be sufficient to satisfy the needs of all these groups, so they created a framework that allowed multiple standards to be built on top of one base standard. The base was called Timed Text Markup Language (TTML). It defined a set of features needed for captioning. Standards could be defined that incorporated groups of base features. These are called Profiles.

To address the needs of web developers, three profiles were defined: DFXP Presentation, DFXP transform, and DFXP full. DFXP stands for “Distribution Format Exchange Profile.” The DFXP Presentation profile is used by video players to render captions, and the DFXP Transform profile is used in video editing to convert from/to other caption formats. The DFXP Full profile includes all the features defined in the base standard.

Enter the TV and Film Industry

The Society of Motion Pictures and Television Engineers (SMPTE) is an international organization established with a goal of advancing moving-imagery education and engineering across the communications, technology, media, and entertainment industries, i.e. the TV and film industries. Part of their charter is to develop standards for all aspects of motion-imaging, ensuring that content is seen and heard in the highest possible quality on any display screen. As such, they took a keen interest in the new TTML standard produced by the TTWG.

SMPTE concluded that the TTML standard addressed many of the needs of the TV and film industries, but the feature set is not sufficient to address all of those needs. In particular, it lacked:

  • Features for bitmap images (needed for certain European caption formats).
  • The ability to carry binary data (needed to support the existing CEA-708 standard for live broadcast).
  • The ability to render the caption in multiple ways. The SMPTE wanted to preserve the look & feel of legacy captions in some cases, and in other cases, to take full advantage of the enhanced presentation features afforded by TTML. Therefore, a new feature is needed to tell which mode is in use.

Thankfully, rather than create a new standard, the SMPTE decided to extend the TTML standard to meet their requirements. They did so by defining three new features/extensions: #image for bitmaps, #data for binary data, and #information for presentation mode. They also created a new profile called SMPTE-TT, which encompassed the DFXP Full profile as well as the new extensions.


TTML is gaining international traction among large, established players in the broadcast, film, and web industries.

Broadcast and Film

As mentioned above, The Society of Motion Picture and Television Engineers (SMPTE) uses TTML as the basis for SMPTE-TT. Following the enactment of the U.S. Federal Communications Commission’s (FCC) 21st Century Communications and Accessibility Act (CVAA) in October 2010, the FCC designated SMPTE-TT as “safe harbor interchange and delivery format” for online captioning. The SMPTE also recently introduced a new profile, Internet Media Subtitles and Captions 1.0 (IMSC1), intended for use as an interchange format across subtitle and caption delivery applications worldwide.

Other broadcasting organizations have followed suit. The European Broadcasting Union (EBU) created the EBU-TT profile. The Japanese Association of Radio Industries and Businesses (ARIB) created the ARIM-TTML profile. The Digital Entertainment Content Ecosystem defined a profile of SMPTE-TT to deliver captions and subtitles in the UltraViolet™ digital media format.

Live broadcasters such as FOX, CNN, ABC, CBS, NBC, and PBS have also adopted TTML-based captions for rebroadcast and simulcast applications over the Internet.

Web and Multimedia Applications

TTML has seen broad adoption among major web companies, as well. Microsoft uses DFXP for Silverlight, Expression Studio, and other media streaming technologies and tools. The company has also proposed a “Simple Delivery Profile for Closed Captions (US)” with the goal of establishing a minimum level of interoperability between TTML and legacy caption formats employed in US markets, such as CEA608 and CEA708.

Adobe was one of the early adopters of TTML. Beginning in 2007, the company has implemented support for DFXP throughout its product line, including Flash, Premiere Pro, Adobe Connect, Adobe TV, Open Source Media Framework (OSMF), and Adobe Media Server.

In addition, FlowPlayer, Panopto, VLC, JW Player, and Subtitle Edit all support TTML.

Media Hosting and Delivery Platforms

Most major video hosting and streaming delivery platforms support some form of TTML, generally DFXP. These companies include: Brightcove, Limelight Networks, Ooyala, Kaltura (MediaSpace), and Akamai.

Streaming Portals and Subscription Services

Streaming portals such as YouTube, Yahoo, AOL, Vimeo, Dailymotion, and YouView all support the DFXP caption format. Major streaming video services such as Netflix and Amazon Video require captions to be submitted in TTML-based formats as well.

Speechpad Supports XML-Based Captions

When you order Standard Captions or Premium Captions from Speechpad, you’ll receive captions in XML Caption Formats, including TTML, DFXP and SMPTE-TT. We can provide SMIL captions upon request.

If your project requires a specialized version of XML-based captions, please contact us, and we’d be happy to discuss it with you.

Learn more about Speechpad’s XML caption formats.