hadoop @ facebook - qcon tokyo 2016...
TRANSCRIPT
Hadoop @ Facebook
Usage, Issues, Solutions & Beyond
Joydeep Sen Sarma
About Me Facebook: Ran/Managed Hadoop ~ 3 years
Co-Author Hive
Mentor/PM Hadoop Fair-Scheduler
Used Hadoop/Hive (as Warehouse/ETL Dev)
Re-wrote significant chunks of Hadoop Job Scheduling (incl. Corona)
Qubole: Running World’s largest Hadoop and Presto clusters on AWS
2007
2014
Hadoop @ FB
2007-2011
Use Cases
• Reporting (SQL)
- Complex ETL (Joins)
- Simple Summaries (Counters)
• Index Generation (Map-Reduce/Java)
• Data Mining (C++)
• Adhoc Queries (SQL)
• Data Driven Applications (PYMK)
Friend Map
Success Factors: Hadoop
• Scales Linearly, Centrally Managed
• Operationally Simple (15 people – 25 PB)
• Can do almost anything ..
Success Factors: Hive
• SQL is easy (add scripts for map-reduce)
• Browser Based Interface
Success Factors: Scribe
• Log data using Scribe from any application
• Simple to add attributes to user page views
• JSON encoded logging allows schema evolution
Issues
#1: Hadoop Scheduling
• “Have you no Hadoop Etiquettes?” (c 2007) (reducer count capped in response)
• User takes down entire Cluster (OOM) (c 2007-09)
• Bad Job slows down entire Cluster (c 2009)
• Steady State Latencies get intolerable (c 2010-)
• ”How do I know I am getting my fair share?” (c 2011)
• “Too few reducer slots, cluster idle” (c 2013)
#2: Hive Latency
• Queries take at least 30s
- Users expect MySql like latency
• Hive is too slow for BI tools like Tableau
• Difficult to write/test complex SQL applications
• Not possible to do drill downs into detail data
- No indices and latency too high
#3: Real Time Metrics
• Nightly or hourly aggregations not enough
- Need Real-Time
- Driven by Advertiser requirements
• MySql not great to store long-tail summaries
#3: Cluster Too Big
• Single HDFS runs into space/power/network limits
• Hard to do Queries across Data Centers
#4: Not fit for all applications
• In-memory grid better for Graph applications
Solutions
Hadoop Scheduling - Isolation
• Hadoop Fair-Scheduler
- Fix Preemption and Speculation
• Separate clusters for Production Pipelines
• Platinum, Gold, Silver
• Prevent memory over-consumption
- Slaves, JobTracker, NameNode
Hadoop Scheduling - Corona
• Overlaps with Apache YARN
• Separate Cluster Resource Allocation
- from Map-Reduce Task Scheduling
• Push Tasks to Slaves
- Reduced Latency and Idle Time
• Highly Concurrent Design
- Optimistic Locking
- Fast and Scales to thousands of nodes
Fast Queries – Presto! Hive Presto
Uses Hadoop MR for execution Pipelined execution model
Spills intermediate data to FS Intermediate data in memory
Can tolerate failures Does not tolerate failures
Automatic join ordering User-specified join ordering
Can handle joins of two large tables
One table needs to fit in memory
Supports grouping sets Does not support GS
Plug in custom code Cannot plug in custom code
More data types Limited data types
Presto vs Hive Perf
- Presto is 2.5-7x faster
- Some queries run out of memory
Other FB Projects
• Calligraphus (Scribe++)
• Puma
- Real-Time Processing
- Hbase for storage
• Apache Giraph
- In-Memory graph computations
Scaling Issues – Cloud!
• Easily start large Hadoop clusters
- Isolate large ad-hoc workloads from Production
- Qubole auto-scales Hadoop Up/Down
• AWS S3 is much easier to use than HDFS
- also: GCE and Azure Storage
- Use HDFS as Cache (Qubole Columnar Cache)
insert overwrite table dest
select … from ads join
campaigns on …group by …;
21
Auto-Scaling
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS / GCE
Progress
Master
Slaves
Job Tracker
Hive Latency - Qubole
• Works with Small Files!
– Faster Split Computation (8x)
–Prefetching S3 files (30%)
• Direct writes to S3
–HIVE-1620
• Multi-Tenant Hive Server
• JVM Reuse!
– Fix re-entrancy issues
–1.2-2x speedup
• Columnar Cache
–Use HDFS as cache for S3
–Upto 5x faster for JSON data
Appendix
In Production @