in-situ mapreduce for log processing
DESCRIPTION
Paper presentation in DB readingTRANSCRIPT
![Page 1: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/1.jpg)
In-situ MapReduce for Log Processing
Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian
![Page 2: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/2.jpg)
2
Log analytics• Data centers with 1000s of servers
• Data-intensive computing: Store and analyze TBs of logs
Examples:• Click logs
– ad-targeting, personalization
• Social media feeds– brand monitoring
• Purchase logs– fraud detection
• System logs– anomaly detection, debugging
![Page 3: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/3.jpg)
3
Log analytics today
• “Store-first-query later”
Problems:• Scale
– Stress network and disks
• Failures– Delay analysis or
process incomplete data
• Timeliness– Hinder real-time apps
Dedicated cluster
MapReduce
Store first ...
... query later
Servers
![Page 4: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/4.jpg)
4
In-situ MapReduce (iMR)
Idea:• Move analysis to the servers• MapReduce for continuous
data• Ability to trade fidelity for
latency
Optimized for:• Highly selective workloads
– e.g., up to 80% data filtered or summarized!
• Online analytics– e.g., ad re-targeting based
on most recent clicksDedicated cluster
MapReduce
Servers
![Page 5: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/5.jpg)
5
An iMR query
The same:• MapReduce API
– map(r) {k,v} : extract/filter data– reduce({k,v[]}) v’ : data aggregation– combine({k,v[]}) v’ : early, partial aggregation
The new:• Provides continuous results
– because logs are continuous
![Page 6: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/6.jpg)
6
Continuous MapReduce
• Input– An infinite stream of logs
• Bound input with sliding windows– Range of data (R)– Update frequency (S)
• Output– Stream of results, one
for each window
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
Log entries
![Page 7: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/7.jpg)
7
Processing windows in-network
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
...
Overlapping data
User’s reduce function
Aggregation tree for efficiency
![Page 8: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/8.jpg)
8
Efficient processing with panes
• Divide window into panes (sub-windows)– Each pane is
processed and sent only once
– Root combines panes to produce window
• Eliminate redundant work– Save CPU & network
resources, faster analysis
...Time
0’’ 30’’ 60’’ 90’’
Map
Combine
Reduce
P1 P2 P3 P4 P5
P2P1
P3P4P5
![Page 9: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/9.jpg)
9
Impact of data loss on analysis
• Servers may get overloaded or fail
Challenges:• Characterize
incomplete results• Allow users to
trade fidelity for latency
...
Map
Combine
Reduce
P1 P2 P3 P4 P5
X?
![Page 10: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/10.jpg)
10
Quantifying data fidelity
• Data are naturally distributed– Space (server nodes)– Time (processing window)
• C2 metric– Annotates result windows
with a “scoreboard”
![Page 11: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/11.jpg)
11
Trading fidelity for latency
• Use C2 to trade fidelity for latency– Maximum latency requirement– Minimum fidelity requirement
• Different ways to meet minimum fidelity– 4 useful classes of C2
specifications
![Page 12: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/12.jpg)
12
Minimizing result latency
• Minimum fidelity with earlier results• Give freedom to decrease latency
– Return the earliest data available
• Appropriate for uniformly distributed events
![Page 13: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/13.jpg)
13
Sampling non-uniform events
• Minimum fidelity with random sampling• Less freedom to decrease latency
– Included data may not be the first available
• Appropriate even for non-uniform data
![Page 14: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/14.jpg)
14
Correlating events across time and space
• Temporal completeness– Include all data from a
node or no data at all
• Spatial completeness– Each pane contains data
from all nodes
Leverage knowledge about data distribution
![Page 15: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/15.jpg)
15
Prototype
• Build upon Mortar – Sliding windows– In-network aggregation trees
• Extended to support:– MapReduce API– Pane-based processing– Fault tolerance mechanisms
![Page 16: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/16.jpg)
16
Processing data in-situ
• Useful when ...• Goal: use available resources intelligently
• Load shedding mechanism– Nodes monitor local processing rate– Shed panes that cannot be processed on
time
• Increase result fidelity under time and resource constraints
![Page 17: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/17.jpg)
17
Evaluation
• System scalability• Usefulness of C2 metric
– Understanding incomplete results– Trading fidelity for latency
• Processing data in-situ– Improving fidelity under load with load
shedding– Minimizing impact on services
![Page 18: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/18.jpg)
18
Scaling
• Synthetic input data, reducer of word count
• 3 reducers provide sufficient processing to handle the 30 map tasks
![Page 19: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/19.jpg)
19
Exploring fidelity-latency trade-offs
Data loss affects accuracy of distribution
• Temporal completeness• Spatial completeness and
random sampling
C2 allows to trade fidelity for lower latency
100% accuracy
>25% decrease
![Page 20: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/20.jpg)
20
In-situ performance
• iMR side-by-side with a real service (Hadoop)
• Vary CPU allocated to iMR– Result fidelity– Hadoop performance (job
throughput)
560%
<11% overhead
![Page 21: In-situ MapReduce for Log Processing](https://reader033.vdocuments.us/reader033/viewer/2022061219/54b92a764a79598b478b46d1/html5/thumbnails/21.jpg)
21
Conclusion
• In-situ architecture processes logs at the sources, avoids bulk data transfers, and reduces analysis time
• Model allows incomplete data under failures or server load, provides timely analysis
• C2 metric helps understand incomplete data and trade fidelity for latency
• Pro-actively sheds load, improves data fidelity under resource and time constraints