maximizing audience engagement in media delivery (med303) | aws re:invent 2013

Post on 27-Jan-2015

105 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Providing a great media consumption experience to customers is crucial to maximizing audience engagement. To do that, it is important that you make content available for consumption anytime, anywhere, on any device, with a personalized and interactive experience. This session explores the power of big data log analytics (real-time and batched), using technologies like Spark, Shark, Kafka, Amazon Elastic MapReduce, Amazon Redshift and other AWS services. Such analytics are useful for content personalization, recommendations, personalized dynamic ad-insertions, interactivity, and streaming quality. This session also includes a discussion from Netflix, which explores personalized content search and discovery with the power of metadata.

TRANSCRIPT

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Maximizing Audience Engagement in Media

Delivery

Usman Shakeel, Amazon Web Services

Shobana Radhakrishnan, Engineering Manager at Netflix

November 14th 2013

Consumers Want …

• To watch content that matters to them – From anywhere

– For free

• Content to be easily accessible – Nicely organized according to their taste

– Instantly accessible from all different devices anywhere

• Content to be of high quality – Without any “irrelevant” interruptions

– Personalized ads

• To multitask while watching content – Interactivity, second screens, social media, share

Ultimate Customer Experience

• Ultimate content discovery

• Ultimate content delivery

Content Disc overy

Content Choices Evolution

… And More

+

Content Discovery Evolution

… And More

Personalized

Row Display

Similarity

Algorithms

Unified Search

Content Delivery …

User Engagement in Online Video

[Source: Conviva Viewer Experience Report – 2013]

Personalized Ads?

Mountains of raw data …

Data Sources

• Content discovery – Meta data

– Session logs

• Content delivery – Video logs

– Page-click event logs

– CDN logs

– Application logs

• Computed along several high cardinality dimensions

• Very large datasets for a specific time frame

Mountains of Raw Data …

Some numbers from Netflix – Over 40 million customers

– More than 50 countries and territories

– Translates to hundreds of billions of events in a short period of time

– Over 100 Billion Meta-data operations a day

– Over 1 Billion viewing hours per month

… To Useful Information ASAP

• Historical data – Batch Analysis

• Live data – Real-time Analysis

Historic vs. Real-time Analytics

100% Dynamic 100% Pre Computed

• Always computing on the fly

• Flexible but slow

• Scale is very hard

• Content Delivery

• Superfast lookups

• Rigid

• Do not cover all the use cases

• Content Discovery

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Back-end Processing Storage/DWH

Ingest & Stream

Agenda

• Ultimate Content Discovery – How Netflix creates personalized content and the power of

Metadata

• Ultimate Content Delivery – The toolset for real-time big data processing

Content Discovery Personalized Experience

Why Is Personalization Important?

• Key streaming metrics

– Median viewing hours

– Net subscribers

• Personalization consistently

improves these

• Over 75% of what people watch

comes from recommendations

What Is Video Metadata?

• Genre

• Cast

• Rating

• Contract information

• Streaming and deployment information

• Subtitles, dubbing, trailers, stills, actual content

• Hundreds of such attributes

Used For..

• User-specific choices, e.g., language, viewing behavior, taste preferences

• Recommendation algorithms

• Device-specific rendering and playback

• CDN deployment

• Billboards, trailer display, streaming

• Original programming

• Basically everything!

Our Solution

Mountains of Video Metadata

Batch Processing

Relevant Organized Data for

Consumption

Data snapshots to Amazon S3 (~10)

Metadata publishing engine (one

EC2 instance per country) generates

Amazon S3 facets (~10 per country)

Metadata cache reads Amazon S3

periodically, servers high-availability

apps deployed on EC2 (~2000

m2.4xl and m2.2xl)

Metadata Platform Architecture

Put Facets (10 per Country per Cycle)

Playback Devices API Algorithms

…..

Offline Metadata

Processing

Publishing Engine

Amazon S3

(com.netflix.mpl.test.coldstart.services)

Get Snapshot Files

(EC2 Instances –

m2.2xl or m2.4xl)

Various Metadata Generation and Entry Tools

Amazon S3

Get Blobs (~7GB files, 10 Gets per Instance Refresh)

(netflix.vms.blob.file.instancetype.region)

Put Snapshot Files – One per Source

Data Entry and

Encoding Tools Persistent Storage AmazonS3 Publishing Engine In-memory Cache Apps

Metadata entered

Hourly data

snapshots Check for

snapshots

Generate, write

artifacts

• Efficient resource

utilization

• Quick data

propagation

Java API calls

Periodic cache refresh

• Low footprint

• Quick start-

up/refresh

Initially..

~2000 EC2

Instances

Target Application Scale

• File size 2 GB–15 GB

• ~10 per country (20 total)

• ~2000 instances (m2.2x or m2.4xl) accessing these files once an

hour via cache refresh from Amazon S3

• Availability goal : Streaming: 99.99%, sign-ups: 99.9%

– 100% of metadata access in memory

– Autoscaling to efficiently manage, startup time

And Then..

6000+ EC2

Instances

Target Application Scale

• File size 2 GB–15 GB

• ~10 per country (20 500 total)

• ~2000 6000 instances (m2.2x or m2.4xl) accessing via cache refresh from Amazon S3

• 100% of access in-memory to achieve high availability

• Autoscaling to efficiently manage, startup time

Effects

• Slower file writes

• Longer publish time

• Slower startup and cache refresh

Amazon S3 Tricks That Helped

• Fewer writes – Region-based publishing engine instead of per-

country

– Blob images rather than facets

– 10 Amazon S3 writes per cycle (down from 500)

• Smaller file sizes – Deduping moved to prewrite processing

– Compression: Zipped data snapshot files from source

• Multipart writes

Results

• Significant reduction in average memory

footprint

• Significant reduction in application startup times

• Shorter publish times

What We Learned

• In-memory cache

(NetflixGraph) effective for

high availability

• Startup time important when

using autoscaling

• Use Amazon S3 best practices

• Circuit breakers

Future Architecture

• Dynamically controlled cache

• Dynamic code inclusion

• Parallelized publishing engine

Content Delivery

Toolset for real-time big-data processing

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Backend Processing Storage/DWH

Ingest & Stream

Ingest and Stream

Amazon S3 Amazon SQS Amazon DynamoDB

Amazon

Kinesis Kafka

PU

T

GET

A

WS

VIP

User App.1 [Aggregate & De-

Duplicate]

User Data

Sources

User Data

Sources

User Data

Sources

Amazon Kinesis Enabling Real-time ingestion & processing of streaming data

Amazon Kinesis Amazon Kinesis

Enabled Application

Control Plane

User Data

Sources

User App.2 [Metric

Extraction]

Amazon S3

DynamoDB

Amazon Redshift

User App.3

[Sliding Window]

EC2 Instance

EC2 Instance

A quick intro to

Amazon Kinesis

Producer Producer Producer

W1

Producers • Generate a Stream of data

• Data records from producers are Put into a Stream

using a developer supplied Partition Key which that

are places records within a specific Shard

Kinesis Cluster

• A managed service captures and transports data

Streams.

• Multiple Shards. Each supports 1MB/sec

• Developer controls number of shards – all shards

stored for 24 hours.

• HA & DD by 3-way replication (3X AZs)

• Each data record has a Kinesis-assigned Sequence #

Workers

• Each Shard is processed by a Worker running on

EC2 instances that developer owns and controls

S

0

S

1

S

3

S

4

W2 W3 W4 W5 W6

Kinesis

Cluster

S

2

Processing

Amazon EMR Amazon Redshift

A Quick Intro to Storm

• Similar to Hadoop cluster

• Topolgy vs. dobs (Storm vs. Hadoop) – A topology runs forever (unless you kill it)

storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

• Streams – unbounded sequence of tuples – A stream of tweets into a stream of trending topics

– Spout: Source of streams (e.g., connect to a log API and emit a stream of logs)

– Bolt: Consumes any number of input streams, some processing, emit new streams

(e.g. filters, unions, compute)

*Source: https://github.com/nathanmarz/storm/wiki/Tutorial

A Quick Intro to Storm

Example: Get the count of ads that were clicked on and watched in a stream LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder (‘‘reach’’); //Create a topology

builder.addBolt(new GetStream(), 3); //Get the stream that showed an ad. Transforms a stream of [id, ad] to [id, stream]

builder.addBolt(new GetViewers(), 12).shuffleGrouping(); //Get the viewers for ads. Transforms [id, stream] to [id, viewer]

builder.addBolt(new PartialUniquer(), 6).fieldsGrouping(new Fields(‘id’, ‘viewer’)); //Group the viewers stream by viewer id. Unique count of subset viewers

builder.addBolt(new CountAggregator(), 2).fieldsGrouping(new Fields(‘id’)); //Compute the aggregates for unique viewers

*Adopted from Source: https://github.com/nathanmarz/storm/wiki/Tutorial

Putting it together …

User Data

Source P

UT

GET

A

WS

VIP

Amazon Kinesis

Control Plane

User App.1 [Aggregate & De-

Duplicate]

Amazon Kinesis Enabled Application

User App.2 [Metric

Extraction]

User App.3

[Sliding Window]

User Data

Source

A Quick Intro to

• Language-integrated interface in Scala

• General purpose programming interface can be used for interactive data mining on clusters

• Example (count buffer events from a streaming log) lines = spark.textFile("hdfs://...") //define a data structure

errors = lines.filter(_.startsWith(‘BUFFER'))

errors.persist() //persist in memory

errors.count()

errors.filter(_.contains(‘Stream1’)).count()

errors.cache() //cache datasets in memory to speed up reuse

*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing

A Quick Intro to

Logistic Regression Performance

127 s / iteration

First iteration 174 s Further iterations 6 s

30 GB dataset

80 core cluster

Up to 20x faster than Hadoop interactive jobs

Scan 1TB dataset with 5 – 7 sec latnecy

*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing

Conviva GeoReport

• Aggregations on many keys w/ same WHERE clause

• 40× gain comes from: – Not rereading unused columns or filtered records

– Avoiding repeated decompression

– In-memory storage of deserialized objects

Time (hours)

Back-end Storage

Amazon

Redshift Amazon S3 Amazon

DynamoDB Amazon RDS HDFS on

Amazon EMR

Batch Processing

• EMR – Hive on EMR

– Custom UDF (user-defined functions) needs for data warehouse

• Redshift – More traditional data warehousing workload

Amazon

Redshift

Amazon

EMR

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Back-end Processing Storage/DWH

Ingest & Stream

The Solution

Mountains of Raw Data

Real-time Processing

Back-end Processing Storage/DWH

Ingest & Stream Amazon S3

Amazon SQS

Amazon DynamoDB

Kafka

Amazon Kinesis

Audience

Engagement Amazon EMR

Amazon Redshift

Storm

Spark

Amazon S3

Amazon RDS

Amazon DynamoDB

Amazon RedShift

Amazon EMR

What’s next …

• On the fly adaptive bit rate in the future frame rate and resolution

• Dynamic personalized ad Insertions

• Session Analytics for more monetization oportunities

• Social Media Chat

Pick up the remote

Start watching

Ultimate entertainment experience…

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

MED303

top related