maximizing audience engagement in media delivery (med303) | aws re:invent 2013

55
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Maximizing Audience Engagement in Media Delivery Usman Shakeel, Amazon Web Services Shobana Radhakrishnan, Engineering Manager at Netflix November 14 th 2013

Upload: amazon-web-services

Post on 27-Jan-2015

105 views

Category:

Technology


1 download

DESCRIPTION

Providing a great media consumption experience to customers is crucial to maximizing audience engagement. To do that, it is important that you make content available for consumption anytime, anywhere, on any device, with a personalized and interactive experience. This session explores the power of big data log analytics (real-time and batched), using technologies like Spark, Shark, Kafka, Amazon Elastic MapReduce, Amazon Redshift and other AWS services. Such analytics are useful for content personalization, recommendations, personalized dynamic ad-insertions, interactivity, and streaming quality. This session also includes a discussion from Netflix, which explores personalized content search and discovery with the power of metadata.

TRANSCRIPT

Page 1: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Maximizing Audience Engagement in Media

Delivery

Usman Shakeel, Amazon Web Services

Shobana Radhakrishnan, Engineering Manager at Netflix

November 14th 2013

Page 2: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Consumers Want …

• To watch content that matters to them – From anywhere

– For free

• Content to be easily accessible – Nicely organized according to their taste

– Instantly accessible from all different devices anywhere

• Content to be of high quality – Without any “irrelevant” interruptions

– Personalized ads

• To multitask while watching content – Interactivity, second screens, social media, share

Page 3: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Ultimate Customer Experience

• Ultimate content discovery

• Ultimate content delivery

Page 4: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Disc overy

Page 5: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Choices Evolution

Page 6: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

… And More

+

Page 7: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Discovery Evolution

Page 8: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Page 9: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

… And More

Personalized

Row Display

Similarity

Algorithms

Unified Search

Page 10: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Delivery …

Page 11: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

User Engagement in Online Video

[Source: Conviva Viewer Experience Report – 2013]

Page 12: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Personalized Ads?

Page 13: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Mountains of raw data …

Page 14: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Data Sources

• Content discovery – Meta data

– Session logs

• Content delivery – Video logs

– Page-click event logs

– CDN logs

– Application logs

• Computed along several high cardinality dimensions

• Very large datasets for a specific time frame

Page 15: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Mountains of Raw Data …

Some numbers from Netflix – Over 40 million customers

– More than 50 countries and territories

– Translates to hundreds of billions of events in a short period of time

– Over 100 Billion Meta-data operations a day

– Over 1 Billion viewing hours per month

Page 16: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

… To Useful Information ASAP

• Historical data – Batch Analysis

• Live data – Real-time Analysis

Page 17: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Historic vs. Real-time Analytics

100% Dynamic 100% Pre Computed

• Always computing on the fly

• Flexible but slow

• Scale is very hard

• Content Delivery

• Superfast lookups

• Rigid

• Do not cover all the use cases

• Content Discovery

Page 18: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Back-end Processing Storage/DWH

Ingest & Stream

Page 19: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Agenda

• Ultimate Content Discovery – How Netflix creates personalized content and the power of

Metadata

• Ultimate Content Delivery – The toolset for real-time big data processing

Page 20: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Discovery Personalized Experience

Page 21: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Why Is Personalization Important?

• Key streaming metrics

– Median viewing hours

– Net subscribers

• Personalization consistently

improves these

• Over 75% of what people watch

comes from recommendations

Page 22: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

What Is Video Metadata?

• Genre

• Cast

• Rating

• Contract information

• Streaming and deployment information

• Subtitles, dubbing, trailers, stills, actual content

• Hundreds of such attributes

Page 23: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Used For..

• User-specific choices, e.g., language, viewing behavior, taste preferences

• Recommendation algorithms

• Device-specific rendering and playback

• CDN deployment

• Billboards, trailer display, streaming

• Original programming

• Basically everything!

Page 24: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Our Solution

Mountains of Video Metadata

Batch Processing

Relevant Organized Data for

Consumption

Data snapshots to Amazon S3 (~10)

Metadata publishing engine (one

EC2 instance per country) generates

Amazon S3 facets (~10 per country)

Metadata cache reads Amazon S3

periodically, servers high-availability

apps deployed on EC2 (~2000

m2.4xl and m2.2xl)

Page 25: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Metadata Platform Architecture

Put Facets (10 per Country per Cycle)

Playback Devices API Algorithms

…..

Offline Metadata

Processing

Publishing Engine

Amazon S3

(com.netflix.mpl.test.coldstart.services)

Get Snapshot Files

(EC2 Instances –

m2.2xl or m2.4xl)

Various Metadata Generation and Entry Tools

Amazon S3

Get Blobs (~7GB files, 10 Gets per Instance Refresh)

(netflix.vms.blob.file.instancetype.region)

Put Snapshot Files – One per Source

Page 26: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Data Entry and

Encoding Tools Persistent Storage AmazonS3 Publishing Engine In-memory Cache Apps

Metadata entered

Hourly data

snapshots Check for

snapshots

Generate, write

artifacts

• Efficient resource

utilization

• Quick data

propagation

Java API calls

Periodic cache refresh

• Low footprint

• Quick start-

up/refresh

Page 27: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Initially..

~2000 EC2

Instances

Page 28: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Target Application Scale

• File size 2 GB–15 GB

• ~10 per country (20 total)

• ~2000 instances (m2.2x or m2.4xl) accessing these files once an

hour via cache refresh from Amazon S3

• Availability goal : Streaming: 99.99%, sign-ups: 99.9%

– 100% of metadata access in memory

– Autoscaling to efficiently manage, startup time

Page 29: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

And Then..

6000+ EC2

Instances

Page 30: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Target Application Scale

• File size 2 GB–15 GB

• ~10 per country (20 500 total)

• ~2000 6000 instances (m2.2x or m2.4xl) accessing via cache refresh from Amazon S3

• 100% of access in-memory to achieve high availability

• Autoscaling to efficiently manage, startup time

Page 31: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Effects

• Slower file writes

• Longer publish time

• Slower startup and cache refresh

Page 32: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Amazon S3 Tricks That Helped

• Fewer writes – Region-based publishing engine instead of per-

country

– Blob images rather than facets

– 10 Amazon S3 writes per cycle (down from 500)

• Smaller file sizes – Deduping moved to prewrite processing

– Compression: Zipped data snapshot files from source

• Multipart writes

Page 33: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Results

• Significant reduction in average memory

footprint

• Significant reduction in application startup times

• Shorter publish times

Page 34: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

What We Learned

• In-memory cache

(NetflixGraph) effective for

high availability

• Startup time important when

using autoscaling

• Use Amazon S3 best practices

• Circuit breakers

Page 35: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Future Architecture

• Dynamically controlled cache

• Dynamic code inclusion

• Parallelized publishing engine

Page 36: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Content Delivery

Toolset for real-time big-data processing

Page 37: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Backend Processing Storage/DWH

Ingest & Stream

Page 38: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Ingest and Stream

Amazon S3 Amazon SQS Amazon DynamoDB

Amazon

Kinesis Kafka

Page 39: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

PU

T

GET

A

WS

VIP

User App.1 [Aggregate & De-

Duplicate]

User Data

Sources

User Data

Sources

User Data

Sources

Amazon Kinesis Enabling Real-time ingestion & processing of streaming data

Amazon Kinesis Amazon Kinesis

Enabled Application

Control Plane

User Data

Sources

User App.2 [Metric

Extraction]

Amazon S3

DynamoDB

Amazon Redshift

User App.3

[Sliding Window]

Page 40: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

EC2 Instance

EC2 Instance

A quick intro to

Amazon Kinesis

Producer Producer Producer

W1

Producers • Generate a Stream of data

• Data records from producers are Put into a Stream

using a developer supplied Partition Key which that

are places records within a specific Shard

Kinesis Cluster

• A managed service captures and transports data

Streams.

• Multiple Shards. Each supports 1MB/sec

• Developer controls number of shards – all shards

stored for 24 hours.

• HA & DD by 3-way replication (3X AZs)

• Each data record has a Kinesis-assigned Sequence #

Workers

• Each Shard is processed by a Worker running on

EC2 instances that developer owns and controls

S

0

S

1

S

3

S

4

W2 W3 W4 W5 W6

Kinesis

Cluster

S

2

Page 41: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Processing

Amazon EMR Amazon Redshift

Page 42: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

A Quick Intro to Storm

• Similar to Hadoop cluster

• Topolgy vs. dobs (Storm vs. Hadoop) – A topology runs forever (unless you kill it)

storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

• Streams – unbounded sequence of tuples – A stream of tweets into a stream of trending topics

– Spout: Source of streams (e.g., connect to a log API and emit a stream of logs)

– Bolt: Consumes any number of input streams, some processing, emit new streams

(e.g. filters, unions, compute)

*Source: https://github.com/nathanmarz/storm/wiki/Tutorial

Page 43: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

A Quick Intro to Storm

Example: Get the count of ads that were clicked on and watched in a stream LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder (‘‘reach’’); //Create a topology

builder.addBolt(new GetStream(), 3); //Get the stream that showed an ad. Transforms a stream of [id, ad] to [id, stream]

builder.addBolt(new GetViewers(), 12).shuffleGrouping(); //Get the viewers for ads. Transforms [id, stream] to [id, viewer]

builder.addBolt(new PartialUniquer(), 6).fieldsGrouping(new Fields(‘id’, ‘viewer’)); //Group the viewers stream by viewer id. Unique count of subset viewers

builder.addBolt(new CountAggregator(), 2).fieldsGrouping(new Fields(‘id’)); //Compute the aggregates for unique viewers

*Adopted from Source: https://github.com/nathanmarz/storm/wiki/Tutorial

Page 44: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Putting it together …

User Data

Source P

UT

GET

A

WS

VIP

Amazon Kinesis

Control Plane

User App.1 [Aggregate & De-

Duplicate]

Amazon Kinesis Enabled Application

User App.2 [Metric

Extraction]

User App.3

[Sliding Window]

User Data

Source

Page 45: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

A Quick Intro to

• Language-integrated interface in Scala

• General purpose programming interface can be used for interactive data mining on clusters

• Example (count buffer events from a streaming log) lines = spark.textFile("hdfs://...") //define a data structure

errors = lines.filter(_.startsWith(‘BUFFER'))

errors.persist() //persist in memory

errors.count()

errors.filter(_.contains(‘Stream1’)).count()

errors.cache() //cache datasets in memory to speed up reuse

*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing

Page 46: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

A Quick Intro to

Page 47: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Logistic Regression Performance

127 s / iteration

First iteration 174 s Further iterations 6 s

30 GB dataset

80 core cluster

Up to 20x faster than Hadoop interactive jobs

Scan 1TB dataset with 5 – 7 sec latnecy

*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing

Page 48: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Conviva GeoReport

• Aggregations on many keys w/ same WHERE clause

• 40× gain comes from: – Not rereading unused columns or filtered records

– Avoiding repeated decompression

– In-memory storage of deserialized objects

Time (hours)

Page 49: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Back-end Storage

Amazon

Redshift Amazon S3 Amazon

DynamoDB Amazon RDS HDFS on

Amazon EMR

Page 50: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Batch Processing

• EMR – Hive on EMR

– Custom UDF (user-defined functions) needs for data warehouse

• Redshift – More traditional data warehousing workload

Amazon

Redshift

Amazon

EMR

Page 51: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

The Challenge

Mountains of Raw Data

Real-time Processing Dashboards/

User Personalization/

User Experience

Back-end Processing Storage/DWH

Ingest & Stream

Page 52: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

The Solution

Mountains of Raw Data

Real-time Processing

Back-end Processing Storage/DWH

Ingest & Stream Amazon S3

Amazon SQS

Amazon DynamoDB

Kafka

Amazon Kinesis

Audience

Engagement Amazon EMR

Amazon Redshift

Storm

Spark

Amazon S3

Amazon RDS

Amazon DynamoDB

Amazon RedShift

Amazon EMR

Page 53: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

What’s next …

• On the fly adaptive bit rate in the future frame rate and resolution

• Dynamic personalized ad Insertions

• Session Analytics for more monetization oportunities

• Social Media Chat

Page 54: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Pick up the remote

Start watching

Ultimate entertainment experience…

Page 55: Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

MED303