statsever-samza: near real-time analytics

28
StatServer-Samza: Near Real-time Analytics Ranking infrastructure, Tomy Tsai Special thanks to: David Stein, Clement Fung

Upload: chang-ming-tsai

Post on 16-Apr-2017

322 views

Category:

Software


1 download

TRANSCRIPT

Page 1: StatSever-Samza: Near Real-Time Analytics

StatServer-Samza: Near Real-time Analytics

Ranking infrastructure, Tomy Tsai

Special thanks to: David Stein, Clement Fung

Page 2: StatSever-Samza: Near Real-Time Analytics

Why need Real-Time Counting

Click

Page 3: StatSever-Samza: Near Real-Time Analytics

Impression Discounting

Click

Page 4: StatSever-Samza: Near Real-Time Analytics

Other Use Cases• News feed– Show trending news• Impression boosting

• Most popular Ads• Real-time CTR = (# click / # views)

Page 5: StatSever-Samza: Near Real-Time Analytics

StatServer @ LinkedIn• Near real-time counting for in-session

relevance applications• Widely used in company products• Recently, we re-designed and implemented

StatServer in Samza– Why?

Page 6: StatSever-Samza: Near Real-Time Analytics

Outline• What does StatServer do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Page 7: StatSever-Samza: Near Real-Time Analytics

Query form• For each platform

how many times has the job titled “Engineer@LinkedIn” been viewed in last 5 minutes

• SELECT platform, count(*) FROM job_log WHERE title = “Engineer@LinkedIn” AND action = “view” AND time > now – 5 minutes GROUP BY platform

Page 8: StatSever-Samza: Near Real-Time Analytics

Pre-materialized query• Store counts in a Key-Value store– Key: job_title, action, time– Value: { platform: counts }

• Answer query by store lookup – Fast– High concurrency

Page 9: StatSever-Samza: Near Real-Time Analytics

Query form variation• Form 1:– Filter-by: job_title, action, time– Group-by: { platform: counts }

• Form 2– Filter-by: job_title, platform, time– Group-By: {action: counts}– Can reuse the extracted feature values from form 1

Page 10: StatSever-Samza: Near Real-Time Analytics

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Page 11: StatSever-Samza: Near Real-Time Analytics

Infrastructure

StatServer

Kafka Tracking Data

Materialized Count Data

inVoldemort

Writer

Reader

Client Application Client Application

Page 12: StatSever-Samza: Near Real-Time Analytics

Writer Workflow

StatServer-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job ViewJob ViewFeature

Extractor

Batcher(Count

Aggregator)

Writerqueue

WriterWriterWriterWriter

Page 13: StatSever-Samza: Near Real-Time Analytics

Query form variationStatServer-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue

WriterWriterWriterWriter

{job_title, platform, time} -> {action: counts}{job_title, action, time} -> {platform: counts}

Page 14: StatSever-Samza: Near Real-Time Analytics

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Page 15: StatSever-Samza: Near Real-Time Analytics

StatServer issues• Multi-tenant problem– One application used too much system resource• all other applications were impacted

– Need a clean way to scale out applications

Page 16: StatSever-Samza: Near Real-Time Analytics

Partition vs Concurrency• All writers may write to the same key– Some writer may be bounced and need re-try– Cannot cache previous value locally

• What if we re-partition the data before writing?– We can do concurrent write – We can cache the value in local machine

Page 17: StatSever-Samza: Near Real-Time Analytics

Why Samza?• Resource isolation– Avoid multi-tenant problem

• Easy and flexible re-partition – Improve concurrency

• Other advantages– Local state support, scalability control

Page 18: StatSever-Samza: Near Real-Time Analytics

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Page 19: StatSever-Samza: Near Real-Time Analytics

Original ArchitectureStatServer-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue

WriterWriterWriterWriter

Page 20: StatSever-Samza: Near Real-Time Analytics

Use Kafka to Replace QueueStatServer-Samza-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)

WriterWriterWriterWriter

Batch queue

Writerqueue

Page 21: StatSever-Samza: Near Real-Time Analytics

Components as Samza JobsStatServer-Samza-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)

Writer

Page 22: StatSever-Samza: Near Real-Time Analytics

Another Query FormStatServer-Samza-Writer

Kafka Data

Materialized Count Data

inVoldemort

Job ClickFeature

ExtractorJob Click

Job ViewJob ViewFeature

Extractor

BatcherFor

Query Form 1

Writer forQuery Form 1

BatcherFor

Query Form 2

Writer forQuery Form 2

Why not combine batcher and writer?

Page 23: StatSever-Samza: Near Real-Time Analytics

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Page 24: StatSever-Samza: Near Real-Time Analytics

Current Status• StatServer-Samza are running successfully– The computed results have been validated with

original StatServer• The writer is twice as efficient, because of– Improved concurrency– Effective local cache

Page 25: StatSever-Samza: Near Real-Time Analytics

Deployment Issues• System resource tuning is not straightforward– Restart causes a burst use of system resource

• Kafka topic partition number also needs fine-tune– Too many partition causes system resource overhead– Too few partition causes system lag behind

Page 26: StatSever-Samza: Near Real-Time Analytics

More Lessons Learned• Windowable task Manual commit• RocksDb (local store) doesn’t support “delete

all keys”– TTL is good for managing caches

Page 27: StatSever-Samza: Near Real-Time Analytics

Conclusion• Samza is a clean, powerful framework for

stream processing– StatServer-Samza is more modularized than before

• Samza/Kafka config can greatly impact performance

Page 28: StatSever-Samza: Near Real-Time Analytics

Thanks to

StatServer DevelopersDavid Stein

Lance Wall

Tomy Tsai

Clement Fung

Joel Young

Samza Team

Kafka Team

Voldemort Team