architecting a video encoding strategy designed for growth

- 1 -

Architecting a Video Encoding Strategy Designed for GrowthWhen building out an online video strategy, there are myriad decisions that will have a direct impact on how viewers engage with video. By architecting the video experience from the beginning to be flexible and dynamic, it’s possible to build a system that is not only a joy for users to watch, but is designed at its core for growth. In this guide, we will discuss simplifying output renditions for multi-device streaming, dynamically generating playlists with HLS and Smooth Streaming protocols and concatenating video using manifest files.

RENDITIONS AND THE MODERN WORLD

In its most basic form, online video consists of transcoding a single source file into a single output file that will play over the Web. Each of these video files is called a rendition, and an array of renditions defines how video will be delivered to end-users.

When YouTube launched in 2005, it delivered a single output rendition through a basic player. Fast forward to 2013 and the world of online video is defined by HTML5/Flash players, ad-insertion, recommendation engines, paywalls, and anywhere from a handful to a boatload of renditions at different bitrates and in various formats.

It may sound like a confusing mess, and it can be, but there are strategies that can simplify your approach to delivering video, shrink costs, and improve the viewer’s experience. It all starts with renditions.

CLIMB THE LADDER

Imagine the world of devices as a wall. At the bottom of the wall are the least capable, most painful-to-use feature phones with 3G connections and a tiny screen. At the top of the wall, we have a brand-new HDTV with a fast Internet connection. Between the bottom and the top of the wall is a range of devices, each having different processors, GPUs, network connections and screen sizes.

The height of the wall is determined by average content duration; the longer the duration, the higher the wall. Renditions are like ladders that help us start anywhere along the wall and climb up or down smoothly. If the wall is high, there needs to be more rungs on the ladder to ensure users can smoothly climb up and down. If the wall is short, we can get away with only a couple of rungs and still provide a good experience.

Structuring Video Renditions for SimplicityEncoding strategies to keep costs down and quality high.

- 2 -

Step 1: The First Ladder

The first step is to decide on a base format. The base format should be playable on a wide range of devices. It might not always be the best choice on every device, but it should always be playable. The goal of online video is to get in front of everybody.

Zencoder supports a wide swath of the most important output formats for Web, mobile and connected TVs. Valid use cases exist for each of these formats; but, for the vast majority, MP4 is the best option due to its ubiquity across the widest range of devices. The first ladder we build will be based on the MP4 format.CLIMBING THE LADDER

BANDWIDTHDEVICESSCREEN SIZE

SD

720P

1080P

- 3 -

Step 2: Bitrates — Creating the Ladder’s Rungs

Now that we have decided which ladder to create first, we can begin constructing the rungs.

First, decide where on the wall the service should start and end. For example, consider a user-generated content site where the average video duration is one minute. The maximum size of each video is small, so there is no need to worry about buffering or stream disruptions; the player should be able to download the whole stream in a few seconds, which means only a couple of renditions are needed, for example, one HD and one SD.

On the other hand, consider a movie service with an average video length of 120 minutes. The files are large, which means the user’s device won’t be able to download the entire stream. In addition, users generally have higher expectations for the quality of feature films. We need to create a number of renditions so users will be able to watch high-quality videos when they have a strong network connection. If the connection is poor, we still want them to be able to watch a video, and then improve the experience as soon as more bandwidth is available by providing intermediate renditions — stepping up the ladder.

The longer the content and the higher the quality, the more renditions are needed to provide a consistent viewing experience.CLIMBING THE LADDER

BANDWIDTHSCREEN SIZE

SD

720P

1080P

Step 2: Bitrates — Creating the Ladder’s Rungs

250kbps

500kbps

750kbps

1.5mbps

2.5mbps

5mbps

- 4 -

Step 3: Defining the Rungs

We have created a nice, smooth ladder, but there is room for improvement. Aside from bitrate and resolution, H.264 has two other features that are used to target renditions at subsets of devices: profile and level.

Profile defines the complexity of the encoding algorithm required to decode a given rendition ranging from low to high complexity. The three most important profiles are baseline, main and high. Level defines a maximum amount of pixels and bitrate that a certain rendition is guaranteed not to exceed. The three most important levels are 3.0 (SD/legacy mobile), 3.1 (720p/mobile), and 4.1 (1080p/modern devices).

At the bottom rung, we want to provide the widest array of support so that we can always deliver a playable video regardless of the device. That means we should choose either baseline 3.0 or main 3.1, and we should choose a resolution that is fairly modest, most likely between 480x270 or 640x360. As we move up the ladder, we can gradually increment these values until we reach the top, where we can maximize our video quality with high-profile, level 4.1, 1080p videos.CLIMBING THE LADDER

BANDWIDTHSCREEN SIZE

SD

720P

1080P

250kbps

500kbps

750kbps

1.5mbps

2.5mbps

5mbps

Baseline 3.0

Baseline 3.0

Main 3.1

Main 3.1

Main 3.1

High 4.1

Step 3: Defining the Rungs

- 5 -

Step 4: Formats — Duplicating Ladders

Now that our MP4s have been created, we have a stable base format and customers can watch video on a variety of devices; we created a ladder to scale the wall. While MP4 is a strong baseline format, other formats can improve the user’s experience. For example, HLS allows a user’s device to automatically and seamlessly jump up and down the ladder.

Since we have already created MP4s, and because MP4 is a standard format, we can easily repackage it into other formats. In fact, this is such an easy task that Zencoder charges only 25 percent of a normal job to perform this duplication called transmuxing, and it can be done nearly instantly alongside a group of MP4 encodings by using “source,” “copy_video,” and “copy_audio.”

The “source” command tells Zencoder to reuse the file created under a given output “label.” So, if we create a file with “label:”:= “MP4_250,” all we need to do is use “source:” “MP4_250” to tell Zencoder to reuse this rendition. “Copy_video” and “copy audio” will then extract the elemental audio and video tracks, and repackage them into an HLS formatted file.

We can do the same thing for smooth streaming as well. And almost instantly, at a fraction of the cost, we have created two new ladders that let virtually anybody watch great quality video.DUPLICATING LADDERS

SD

720P

1080P

250kbps

500kbps

750kbps

1.5mbps

2.5mbps

5mbps

Baseline 3.0

Baseline 3.0

Main 3.1

Main 3.1

Main 3.1

High 4.1

HLS

250kbps

500kbps

750kbps

1.5mbps

2.5mbps

5mbps

Baseline 3.0

Baseline 3.0

Main 3.1

Main 3.1

Main 3.1

High 4.1

MP4

- 6 -

Step 5: Refine

The most important thing a video service can do is commit itself to constantly improving, revisiting, and refining its renditions.

With the pace of online video accelerating by the day, what seems terrific today might only be sufficient next year. In a couple of years, it will be downright obsolete. Zencoder helps solve these issues by being a driving force behind the bleeding edge of video encoding technology. We are constantly updating and building our tools to make the encoding platform faster and more stable with higher quality. The next step is up to you.

Constantly testing new variations to find the best set of renditions for your users will result in a more stable and optimized delivery infrastructure and a more engaged user base.

The Dynamic Generation of PlaylistsFor years, there were two basic models of Internet streaming: server-based proprietary technology such as RTMP or progressive download. Server-based streaming allows the delivery of multi-bitrate streams that can be switched on demand, but it requires licensing expensive software. Progressive download can be done over Apache, but switching bitrates requires playback to stop.

The advent of HTTP-based streaming protocols such as HLS and Smooth Streaming meant that streaming delivery was possible over standard HTTP connections using commodity server technology such as Apache. Seamless bitrate switching became commonplace and delivery over CDNs was simple as it was fundamentally the same as delivering any file over HTTP. HTTP streaming has resulted in nothing short of a revolution in the delivery of streaming media, vastly reducing the cost and complexity of high-quality streaming.

When designing a video platform there are countless things to consider; however, one of the most important and oft-overlooked decisions is how to treat HTTP-based manifest files.

A STATIC MANIFEST FILE

In the physical world, when you purchase a video, you look at the packaging, grab the box, head to the checkout stand, pay the cashier, go home and insert it into your player.

Most video platforms are structured pretty similarly; fundamentally, a group of metadata (the box) is associated with a playable media item (the video). Most video platforms start with the concept of a single URL that connects the metadata to a single MP4 video. As a video platform becomes more complex, there may be multiple URLs connected to the metadata representing multiple bitrates, resolutions, or perhaps other media associated with the main item such as previews or special features.

Things become more complicated when trying to extend the physical model to an online streaming world that includes HTTP-based streaming protocols such as HLS. HLS is based on many fragments of a video file linked together by a text file called a manifest. When implementing HLS, the most straightforward method is to simply add a URL that links to the manifest, or m3u8 file. This has the benefit of being extremely easy and fitting into the existing model.

The drawbacks are that HLS is not really like a static media item. For example, an MP4 is very much like a video track on a DVD; it is a single video at a single resolution and bitrate. The HLS manifest consists, most likely, of multiple bitrates, resolutions, and thousands of fragmented pieces of video. HLS has the capacity to do so much more than an MP4, so why treat it the same?

- 7 -

THE HLS PLAYLIST

An HLS playlist includes some metadata that describes basic elements of the stream, and an ordered set of links to fragments of the video. By downloading each fragment, or segment of the video and playing them back in sequence, the user is able to watch what appears to be a single continuous video.

EXTM3U #EXT-X-PLAYLIST-TYPE:VOD #EXT-X-TARGETDURATION:10 #EXTINF:10, file-0001.ts #EXTINF:10, file-0002.ts #EXTINF:10, file-0003.ts #EXTINF:10, file-0003.ts #EXT-X-ENDLIST

Above is a basic m3u8 playlist. It links to four video segments. To generate this data programmatically, all that is needed is the filename of the first item, the target duration of the segments (in this case, 10), and the total number of segments.

THE HLS MANIFEST

An HLS manifest is an unordered series of links to playlists. There are two reasons for having multiple playlists: to provide various bitrates and to provide for backup playlists. Here is a typical playlist where each of the .m3u8’s is a relative link to another HLS playlist:

#EXTM3U #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2040000 file-2040k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1540000 file-1540k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1040000 file-1040k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=640000 file-640k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=440000 file-440k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=240000 file-240k.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=64000 file-64k.m3u8

The playlists are of varying bitrates and resolutions in order to provide smooth playback regardless of the network conditions. All that is needed to generate a manifest are the bitrates of each playlist and their relative paths.

FILLING IN THE BLANKS

There are many other important pieces of information that an online video platform should be capturing for each encoded video asset: video codec, audio codec, container, and total bitrate are just a few. The data stored for a single video item should be meaningful to the viewer (description, rating, cast), meaningful to the platform (duration, views, engagement), and meaningful for applications (format, resolution, bitrate). With this data, you enable a viewer to decide what to watch, the system to decide how to program, and the application to decide how to playback.

- 8 -

By capturing the data necessary to programmatically generate a playlist, a manifest and the codec information for each of the playlists, it becomes possible to have a system where manifests and playlists are generated per request.

EXAMPLE — THE FIRST PLAYLIST

The HLS specification determines that whichever playlist comes first in the manifest will be the first chosen to playback. In the previous section’s example, the first item in the list was also the highest quality track. That is fine for users with a fast, stable Internet connection, but for people with slower connections it will take some time for playback to start.

It would be better to determine whether the device appeared to have a good Internet connection then customize the playlist accordingly. Luckily, with dynamic manifest generation, that is exactly what the system is set up to accomplish.

For the purposes of this exercise, assume a request for a manifest is made with an ordered array of bitrates. For example, the request [2040,1540,1040,640,440,240,64] would return a playlist identical to the one in the previous section. On iOS, it is possible to determine if the user is on WiFi or a cellular connection. Since data has been captured about each playlist including bitrate, resolution, and other such parameters, an app can intelligently decide how to order the manifest.

For example, it may be determined that it is best to start between 800-1200kbps if the user is on WiFi and between 200-600kbps if the user is on a cellular connection. If the user were on WiFi, the app would request an array that looks something like: [1040,2040,1540,640,440,240,64]. If the app detected only a cellular connection, it would request [440,2040,1540,1040,640,240,64].

EXAMPLE — THE LEGACY DEVICE

On Android, video support is a bit of a black box. For years, the official Android documentation only supported the use of 640x480 baseline h264 MP4 video, even though certain models were able to handle 1080p. In the case of HLS, support is even more fragmented and difficult to understand.

Luckily, Android is dominated by a handful of marquee devices. With dynamic manifests, the app can target not only which is the best playlist to start with, but can exclude playlists that are determined to be incompatible.

Since our media items are also capturing data such as resolution and codec information, support can be targeted at specific devices. An app could decide to send all of the renditions: [2040,1540,1040,640,440,240,64]. Or, an older device that only supports up to 720p could remove the highest rendition: [1540,1040,640,440,240,64]. Furthermore, beyond the world of mobile devices, if the app is a connected TV, it could remove the lowest quality renditions: [2040,1540,1040,640].

Choosing a static manifest model is perfectly fine. Some flexibility is lost, but there is nothing wrong with simplicity. Many use cases, especially in the user-generated content world, do not require the amount of complexity dynamic generation involves; however, dynamic manifest generation opens a lot of doors for those willing to take the plunge.

- 9 -

Video Concatenation Using Manifest Files CONCATENATION AND THE OLD WAY

Content equals value, so, in the video world, one way to create more value is by taking a single video and mixing it with other videos to create a new piece of content. Many times this is done through concatenation, or the ability to stitch multiple videos together, which represents a basic form of editing. Add to that the creation of clips through edit lists and you have two of the most basic functions of a non-linear editor.

As promising as concatenation appears, it can also introduce a burden on both infrastructure and operations. Imagine a social video portal. Depending on the devices they target, there could be anywhere between a handful to many dozens of output formats per video. Should they decide to concatenate multiple videos to extend the value of their library, they will also see a massive increase in storage cost and the complexity of managing assets. Each time a new combination of videos is created, a series of fixed assets are generated and need to be stored.

Traditional concatenation involves creating new video files that are combinations of multiple existing files, creating a mess of large files.

STORAGE

ONE OF THE AVAILABLEconcatenated videosis sent to the player

Traditional concatenation involves creating new video files that are combinations of multiple existing files, creating a mess of large files.

CONCATENATED VIDEO

PLAYER REQUEST

- 10 -

HLS1 VIDEO CONCATENATION USING MANIFEST FILES

The introduction of manifest-driven, HTTP-based streaming protocols has created an entirely new paradigm for creating dynamic viewing experiences. Traditionally, the only option for delivering multiple combinations of clips from a single piece of content was through editing, which means the creation of fixed assets. With technology such as HLS – since the playable item is no longer a video file, but a simple text file – making edits to a video is the same as making edits to a document in a word processor.

For a video platform, there are two ways to treat the HLS m3u8 manifest file. Most simply, the m3u8 file can be treated as a discreet, playable asset. In this model, the m3u8 is stored on the origin server alongside the segmented TS files and delivered to devices. The result is simple and quick to implement, but the m3u8 file can only be changed through a manual process.

Instead, by treating the manifest as something that is dynamically generated, it becomes possible to deliver a virtually limitless combination of clips to viewers. In this model, the m3u8 is generated on the fly, so it does not sit on the server but will be created and delivered every time it is requested.

1 This article is focused on HTTP Live Streaming (HLS), but the basic concepts are valid for other HTTP-based streaming protocols as well.

By generating HLS manifests on the fly, an unlimited combination of videos can be seamlessly delivered instantly to end-users.

ANY COMBINATION of segmented TS files is sent to the playerSTORAGE

M3U8GENERATION

PLAYER REQUEST

CONCATENATED VIDEO

By generating HLS manifests on the fly, an unlimited combination of vieos can be seamlessly delivered to end-users.

- 11 -

DYNAMIC MANIFEST GENERATION

What is a manifest file? It is a combination of some metadata and links to segments of video:

Exemplary Video A #EXTM3U #EXT-X-MEDIA-SEQUENCE:0 #EXT-X-TARGETDURATION:10 #EXTINF:10, Exemplary_A_segment-01.ts #EXTINF:10, Exemplary_A_segment-02.ts

The above m3u8 has two video segments of 10 seconds each, so the total video length is 20 seconds. Exemplary Video A, which, by the way is a truly great video, is 20 seconds long. Now let’s imagine we also have:

Exemplary Video B #EXTM3U #EXT-X-MEDIA-SEQUENCE:0 #EXT-X-TARGETDURATION:10 #EXTINF:10, Exemplary_B_segment-01.ts #EXTINF:10, Exemplary_B_segment-02.ts

And let’s also say that we know that a particular viewer would be thrilled to watch a combination of both videos, with Video B running first and Video A running second:

Superb Video #EXTM3U #EXT-X-MEDIA-SEQUENCE:0 #EXT-X-TARGETDURATION:10 #EXTINF:10, Exemplary_B_segment-01.ts #EXTINF:10, Exemplary_B_segment-02.ts #EXT-X-DISCONTINUITY #EXTINF:10, Exemplary_A_segment-01.ts #EXTINF:10, Exemplary_A_segment-02.ts

Instantly, without creating any permanent assets that need to be stored on origin, and without involving an editor to create a new asset, we have generated a new video for the user that begins with Video B followed by Video A. As if that wasn’t cool enough, the video will play seamlessly as though it was a single video.

You may have noticed a small addition to the m3u8, the “Discontinuity Flag:”

#EXT-X-DISCONTINUITY

Placing this tag in the m3u8 tells the player to expect the next video segment to be a different resolution or have a different audio profile than the last. If the videos are all encoded with the same resolution, codecs, and profiles, then this tag can be left out.

EXTENDING THE NEW MODEL

The heavy lifting for making a video platform capable of delivering on-the-fly, custom playback experiences is to treat the m3u8 manifest not as a fixed asset, but as something that needs to be generated per request. That means that the backend must be aware of the location of every segment of video, the total number of segments per item, and the length of each segment.

There are ways to make this more simple. For example, by naming the files consistently, only the base filename needs to be known for all of the segments, and the segment iteration can be handled programmatically. It can be assumed that all segments except the final segment will be of the same target duration, so only the duration of the final segment needs to be stored. So, for a single video file with many video segments, all that needs to be stored is base path, base filename, number of segments, average segment length, and length of the last segment.

By considering even long-form titles to be a combination of scenes, or even further, by considering scenes to be a combination of shots, there is an incredible amount of power that can be unlocked through dynamic manifest generation. If planned for and built early, the architecture of the delivery platform can achieve a great deal of flexibility without subsequent increase in operational or infrastructure costs.

CONTACT

[email protected] zencoder.com

architecting a video encoding strategy designed for growth

Technology