real time big data applications with hadoop ecosystem
DESCRIPTION
TRANSCRIPT
![Page 1: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/1.jpg)
04/10/20231 Confidential | Copyright 2013 TrendMicro Inc.
Chris Huang
Sr. Manager, Core Tech
2014/9/24
Real-time Big Data Applications with Hadoop Ecosystem
![Page 2: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/2.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
About – Chris Huang
• Chris Huang– SPN Solution Developer Manager– SPN Hadoop Architect– Hadoop.TW Active Member
• Believes Cloud, Service, Software, Big Data are critical factors for Taiwan’s future economic development
![Page 3: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/3.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 3
Conference Talks
![Page 4: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/4.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 4
Conference Talks
![Page 5: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/5.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 5
Hot Keywords in Hadoop Community
Real-time• Impala, Stinger
Computing Framework• YARN, Tez
In Memory• Spark
Streaming• Kafka, Storm
![Page 6: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/6.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 6
Big Data Applications
• Operational– Real-time– Near Real-time
• Analytical – Batch– Interactive– Near Real-time– Streaming
![Page 7: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/7.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 7
An Online Music Example
• Operational– Recent N login time (listen duration)– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount
• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New
Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network
![Page 8: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/8.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 8
An Online Banking Example
• Operational – Recent N login time / frequency– Recent N items purchased by credit card– Recent N month balance amount– Recent N transfer in/out amount– Recent N investment event– Recent N month investment balance
• Analytical– Know user’s profile more (assets/debts/shopping habits/family)– Recommend right product to right user (investment, credit card,
loan)– Know seasonal trending (tax month/year end/back to school/X’mas)– Know regional investment product leaderboard (by different age)– Recommend product by similar user profile
![Page 9: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/9.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 9
Building Your Big Data Applications
• Think about your data– Entity or Event?
• Think about your use case– Operational or Analytic?
• Think about your data user– External or Internal?
![Page 10: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/10.jpg)
04/10/2023 10Confidential | Copyright 2013 TrendMicro Inc.
Slides from “Apache HBase Application Archetypes”, HBaseCon 2014
You can Replace HBase with similar alternatives, but concepts are the same
Think About Your Data
![Page 11: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/11.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 11
![Page 12: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/12.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 12
![Page 13: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/13.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 13
![Page 14: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/14.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 14
![Page 15: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/15.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 15
![Page 16: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/16.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 16
![Page 17: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/17.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 17
![Page 18: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/18.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 18
![Page 19: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/19.jpg)
Think About Your Use Case
04/10/2023 19Confidential | Copyright 2013 TrendMicro Inc.
![Page 20: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/20.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 20
Operational Use Case 1
MR / Spark
MR / Spark
Real-time
Real-time
BatchBatch
Real-time
HDFS
![Page 21: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/21.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 21
HBase: No Secondary Index (yet)
• Search index building (row key)• Use Solr to make text data searchable
– Snapshot & clone table– Index column qualifier text– Record row-key in Solr document– Use HBase client to fetch data
• Usually less than few seconds
![Page 22: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/22.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 22
Operational Use Case 2 (SPN)
Solr Client
Get, Scanlow latency
high throughput
Index Query
MapReduce
Pig
HDFS
Flume
Feed App
Real-time
Real-time
Batch
![Page 23: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/23.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 23
Operational Use Case 3 (Mixed)
Solr Client
Get, Scanlow latency
high throughput
Index Query
MapReduce
Pig
HDFS
Flume
Feed App
Real-time Real-time
Batch
HBase Client
GetsShort scan
HBase Client
Put, Incr, Append
Bulk Import
HBase Client
MR / Spark Batch
HBase Replication Solr
MR / SparkBatch
![Page 24: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/24.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 24
HBase or HDFS?
• Depends on what’s your data– Entity or Event?
• Depends on your workload– Low latency? – Random read/write? – Short/full scan?– Sequential read/write? – Update?
![Page 25: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/25.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 25
Wait…Batch for
Operational?
![Page 26: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/26.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 26
Yes, Why not?
![Page 27: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/27.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 27
![Page 28: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/28.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 28
![Page 29: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/29.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 29
![Page 30: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/30.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 30
Operational: Batch + Real-time
• Bridge the gap between batch and now• 80/20 rule
– HDFS/MapReduce/Spark solves 80% easily– Remaining 20% takes 80% of the efforts
• Go as close as possible, don’t overdo it!
![Page 31: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/31.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 31
What is Real-time?
• Real-time is NOT always “faster than batch”– If you have really BIG DATA
• Most of the time, we want Timely Information• Minimize the gap between scheduled batch jobs
Hourly Job
Hourly Job
Hourly Job
How to get result at 1:33?
![Page 32: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/32.jpg)
04/10/2023 32Confidential | Copyright 2013 TrendMicro Inc.
Batch/streaming compute
Near real-time/interactive deliver
Analytical Use Case
![Page 33: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/33.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 33
Near Real-time Interactive
![Page 34: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/34.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 34
Recommendation System
![Page 35: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/35.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 35
![Page 36: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/36.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 36
The Online Music Example
• Operational– Recent N login time (listen duration )– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount
• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New
Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network
Do you really want to analytical result (recommendation)
EVERY 50 millisecond?
![Page 37: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/37.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 37
Analytical Use Case 1
Batch
HDFS
Solr Client
Index Query
Real-time
![Page 38: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/38.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 38
Analytical Use Case 2 (SPN)
“A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”, HBaseCon 2014
![Page 39: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/39.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 39
![Page 40: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/40.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 40
![Page 41: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/41.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 41
You Need an Interactive
Analytic Engine
![Page 42: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/42.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 42
Stinger
![Page 43: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/43.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 43
![Page 44: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/44.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala Architecture
Datanode
Tasktracker
Regionserver
impala daemon
2
NN, JT, HMActive
NN, JT, HMStandby
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
State store
Catalog
Hive Metastore
![Page 45: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/45.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
![Page 46: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/46.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
![Page 47: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/47.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
![Page 48: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/48.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
![Page 49: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/49.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Apache Pig (MapReduce)
• Do hourly count on akamai log– A = load 'date://2014/07/20/00'
using AkamaiRCLoader();B = foreach (group A all) COUNT_STAR(A);dump B;
– …0% complete100% complete(194202349)
2
4mins, 28sec
Too Slow for Interactive
![Page 50: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/50.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Using Impala
• No memory cache– > select count(*) from akafast
where day=20140720 and hour=0– 194202349
• with OS cache
• Do a further query:– select count(*) from akafast where day=20140720
and hour=00 and c='US';– 41118019
2
96.46s
9.07s
6.57s
Make Sense Now
![Page 51: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/51.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 51
Don’t Connect Analytic
Engine with Operational Use Case
![Page 52: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/52.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 52
Analytical Use Case 3
low latency
high throughput
Impala/Stinger
HDFS
Flume
Feed App
Real-time
Real-time
Interactive
HBase Client
GetsShort scan
HBase Client
Put, Incr, Append
Bulk Import
HBase Client
MR / Spark Batch
Customer
Analyst
![Page 53: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/53.jpg)
04/10/2023 53Confidential | Copyright 2013 TrendMicro Inc.
Streaming Use Cases
![Page 54: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/54.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 54
![Page 55: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/55.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 55
![Page 56: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/56.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 56
![Page 57: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/57.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 57
![Page 58: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/58.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 58
TME – Trend Message Exchange
http://trendmicro.github.io/tme/
![Page 59: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/59.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 59
Kafka/Storm
Kafka/Storm
Streaming Operational Use Case
low latency
high throughput
HDFS
HBase Client
Put, Incr, Append
HBase Client
GetsShort scan
Real-time
Streaming
Solr Client
Index Query
Streaming
![Page 60: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/60.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 60
Kafka/Storm
Streaming Analytical Use Case
low latency
high throughput
HDFS
HBase Client
Put, Incr, Append
Flume
Feed App
HBase Client
GetsShort scan
Impala/Stinger
Interactive
Analyst
Real-time Customer
Streaming
![Page 61: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/61.jpg)
04/10/2023 61Confidential | Copyright 2013 TrendMicro Inc.
Think About Your Data User
![Page 62: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/62.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 62
Data User
• External– Customer– Partner
• Internal– Business report user– Data researcher– Data analyst– Algorithm developer
• They want instant response• They don’t know (and don’t care) if
the recommendation is computed 1 hour ago or 50 ms ago
• Interactive or near real-time is enough
• Sometimes even wait for batch (make data small and analyze)
• Of course, everyone wants result faster, but it depends on your investment $$
![Page 63: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/63.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 63
No Silver Bullet
For Real-time, Or Big Data Application
![Page 64: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/64.jpg)
04/10/2023
64
Q&A
Confidential | Copyright 2013 TrendMicro Inc.
![Page 65: Real time big data applications with hadoop ecosystem](https://reader036.vdocuments.us/reader036/viewer/2022081413/548e7345b4795968148b4784/html5/thumbnails/65.jpg)
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 65