the meta of hadoop - comad 2012

The Meta of Hadoop

Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole

• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:– SysAdmin: operated massive Hadoop/Hive installs– Architect: conceived/wrote Apache Hive. made Hbase@FB

happen– Herded cats: first manager of Data Infra team– IT engineer/DBA: built ETL tools, warehouse/reporting for FB

Virtual Currency

• Founder Qubole Inc. (2011-)

Why Hadoop Succeeded

• Complete Solution and Extensible– useful to Engineers, Data Scientists, Analysts– performance isn’t everything.– Agile – Businesses much faster than before

• Market Dynamics– Captive Super-Reference Customer – Yahoo– Had early market to itself for Long-Time

• Separation of Compute and Storage– Parallel Computing != Database

Why Hadoop Succeeded

• Data Consolidation!– Just store everything in HDFS– MR/Hive/Pig can chew

anything

• Lights Out Architecture– Low System Operational Cost– Low Data Management Cost• Don’t need Data Priests

Meta Takeaways

Adaptive Lights-Out Software

• Successful efforts:– Automatic map-join/skew join implementations– Automatic local mode, resource cache

• Failed:– Statistics: alter table analyze table– Pre-Bucketing tables

Learning Frameworks for Systems Software

Adaptive Lights-Out Software

• Caching + Prefetching is Adaptive– Replication is not– Can bridge gap between Compute and Storage

• Page Cache over Disk >> In-memory– Degrades gracefully

• Provide APIs – not packages

Murphy’s Law

• No Trusted Components

• Defend everything– Rate-Limit access to every resource– Log and Monitor everything

• Clear and Overwhelming Force– Oversize it!

• Think QOS from Day-1

Open Source

• Small is Beautiful– Build small easy to use/understand components– Redis!

• Iterative Small Changes– Operators HATE large releases– Hive (2 weeks) vs. Hadoop (2 years?)

Opportunities

Interesting Problems - I

• Collaborative Analysis– Most analysis is Repeat– Tracking and Searching historical analysis

• Consistency Aware Querying– OLAP: Snapshots instead of live tables– OLTP: Lookup stale caches instead of master

Interesting Problems - II

• SQL is Rope– Better than procedural – but still Rope– Higher Level templates: moving averages

• Data = Mutating + Immutable– Immutable data is easy to manage– Cheap: One copy per data center (Facebook

Haystack)

Think Services, not Software

• Software is getting less interesting– Even Distributed Systems Software

• Run/Operate long-running, hot services– Innovate inside this boundary

the meta of hadoop - comad 2012

Documents

selecting right interestingness measure for rare association...

joviandata mdx engine comad oct 22 2011

introduction to apache hadoop & pig -...

introduction to hadoop and hadoop component

hadoop summit 2010 machine learning using hadoop

continuous delivery for linux/windows/hadoop...beta cluster...

analyzing hadoop with hadoop

introduction to hadoop and hdfs. table of contents hadoop...

new map reduce & hadoop · 2018. 1. 24. · hadoop map...

snapshotting in hadoop distributed file system for hadoop...

comad: collection-oriented modeling and design of scientific...

mapreduce programming with apache hadoop -...

· (page views ? hourly? monthly hadoop node hadoop node...

configuración para hadoop configuración de wps para...

apache hadoop and hive. outline architecture of hadoop...

hadoop 3 (2017 hadoop taiwan workshop)

hadoop interview questions version 2.0.0 author: hadoop...

hadoop, hadoop, hadoop!!! jerome mitchell indiana university

hadoop crash course hadoop summit sj

2. hadoop -...