statsever-samza: near real-time analytics

StatServer-Samza: Near Real-time Analytics

Ranking infrastructure, Tomy Tsai

Special thanks to: David Stein, Clement Fung

Why need Real-Time Counting

Impression Discounting

Other Use Cases• News feed– Show trending news• Impression boosting

• Most popular Ads• Real-time CTR = (# click / # views)

StatServer @ LinkedIn• Near real-time counting for in-session

relevance applications• Widely used in company products• Recently, we re-designed and implemented

StatServer in Samza– Why?

Outline• What does StatServer do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Query form• For each platform

how many times has the job titled “Engineer@LinkedIn” been viewed in last 5 minutes

• SELECT platform, count(*) FROM job_log WHERE title = “Engineer@LinkedIn” AND action = “view” AND time > now – 5 minutes GROUP BY platform

Pre-materialized query• Store counts in a Key-Value store– Key: job_title, action, time– Value: { platform: counts }

• Answer query by store lookup – Fast– High concurrency

Query form variation• Form 1:– Filter-by: job_title, action, time– Group-by: { platform: counts }

• Form 2– Filter-by: job_title, platform, time– Group-By: {action: counts}– Can reuse the extracted feature values from form 1

Outline• What does StatServer Do ?• How did StatServer work ?• Why use Samza instead ?• Re-design Writer in Samza • Results

Infrastructure

StatServer

Kafka Tracking Data

Materialized Count Data

inVoldemort

Writer

Reader

Client Application Client Application

Writer Workflow

StatServer-Writer

Kafka DataMaterialized Count Data

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job ViewJob ViewFeature

Extractor

Batcher(Count

Aggregator)

Writerqueue

WriterWriterWriterWriter

Query form variationStatServer-Writer

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue

{job_title, platform, time} -> {action: counts}{job_title, action, time} -> {platform: counts}

StatServer issues• Multi-tenant problem– One application used too much system resource• all other applications were impacted

– Need a clean way to scale out applications

Partition vs Concurrency• All writers may write to the same key– Some writer may be bounced and need re-try– Cannot cache previous value locally

• What if we re-partition the data before writing?– We can do concurrent write – We can cache the value in local machine

Why Samza?• Resource isolation– Avoid multi-tenant problem

• Easy and flexible re-partition – Improve concurrency

• Other advantages– Local state support, scalability control

Original ArchitectureStatServer-Writer

inVoldemort

Job ClickFeature

Extractor

Batch queue

Job Click

Job View

Job ViewFeature

ExtractorBatcher(Count

Aggregator)

Writerqueue

Use Kafka to Replace QueueStatServer-Samza-Writer

inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)

Batch queue

Writerqueue

Components as Samza JobsStatServer-Samza-Writer

inVoldemort

Job ClickFeature

Extractor

Kafka (Extracted Features)

Job Click

Job View

Job ViewFeature

Extractor

Batcher(Count

Aggregator)

Kafka (Materialized Query)

Writer

Another Query FormStatServer-Samza-Writer

Kafka Data

Materialized Count Data

inVoldemort

Job ClickFeature

ExtractorJob Click

Job ViewJob ViewFeature

Extractor

BatcherFor

Query Form 1

Writer forQuery Form 1

BatcherFor

Query Form 2

Writer forQuery Form 2

Why not combine batcher and writer?

Current Status• StatServer-Samza are running successfully– The computed results have been validated with

original StatServer• The writer is twice as efficient, because of– Improved concurrency– Effective local cache

Deployment Issues• System resource tuning is not straightforward– Restart causes a burst use of system resource

• Kafka topic partition number also needs fine-tune– Too many partition causes system resource overhead– Too few partition causes system lag behind

More Lessons Learned• Windowable task Manual commit• RocksDb (local store) doesn’t support “delete

all keys”– TTL is good for managing caches

Conclusion• Samza is a clean, powerful framework for

stream processing– StatServer-Samza is more modularized than before

• Samza/Kafka config can greatly impact performance

Thanks to

StatServer DevelopersDavid Stein

Lance Wall

Tomy Tsai

Clement Fung

Joel Young

Samza Team

Kafka Team

Voldemort Team

statsever-samza: near real-time analytics

Software

building real-time data products at linkedin with apache...

practical near-data processing for in-memory analytics

querying the internet of things: streaming sql on...

demonstrating near real-time analytics with ibm db2...

practical near-data processing for in-memory analytics...

novetta cyber analytics...novetta cyber analytics is an...

fundamentals of simulation -...

practical near-data processing for in-memory analytics...

scaling up near real-time analytics - qcon sf · pdf...

near-realtime analytics with kafka and hbase

low latency web scale fraud prevention with apache samza,...

bdm8 - near-realtime big data analytics using impala

samza: stateful scalable stream processing at...

modeling analytics for computational storage veronica...

a driver-rider matching application based on apache samza

samza la hug

samza tech talk_2015 - huawei

near real-time analytics · 2017-05-03 · near real-time...

create the next generation iot experience for the future ·...

london hug-samza