spark
TRANSCRIPT
Agenda for the Day
• How do we mine useful informa#on from massive datasets ?
• Overview of Problems in Big Data Space. • Apache Hadoop and its Limita#ons. • Introduce Apache Spark. • Switching Gears : Exploring Spark Internals • Discuss an ac#ve research area, “BlinkDB” : Queries with Bounded Reponses #me on very large data.
• Conclusion
Back In 2000 Google had a problem…
• Processing massive amount of raw data ( crawled documents, web request logs )
• Specialized Hardware (Ver#cal Scaling) was super expensive.
• The computa#on thus needed to be distributed. Solu6on : Google’s Map Reduce Paradigm by Jeffrey Dean and Sanjay Ghemawat. Essen6ally a Distributed Cluster Compu6ng Framework.
Why is Map Reduce so important ?
• Using commodity hardware for computa#on. • Overcome Commodity Hardware limita2ons: Provides solu#on for an environment where failures are very frequent.
• Pushing computa#on to data (unlike the other way round)
• Provides abstrac2on to focus on domain logic : A programming / cluster compu#ng model for distributed compu#ng using func#onal primi#ves map and reduce.
Three major Big Data Scenarios
• Interac6ve Queries : Enable faster decision. Example : Query website logs and diagnose why the website is slow ? ( Apache Pig, Apache Hive.. ) • Sophis6cated Batch Data Processing : Enable be_er decisions.
Example : Trend Analysis, Analy2cs • Streaming Data Processing : Real #me decision making.
Example : Fraud Detec2on, detect DDoS aJacks.
Map Reduce Limita6ons
• Itera6ve Jobs : Common algorithms apply a func#on repeatedly to the same dataset. While each itera#on can be expressed as a MapReduce job, each job must store and than reload data from disk.
• Interac6ve Analysis : There is a need to run ad-‐hoc queries on datasets. We want to be able to load a dataset into memory across machines and query it repeatedly.
Introducing Spark • Open Source data analy#cs cluster compu#ng framework.
• Provides primi#ves for In-‐Memory cluster compu#ng that allows user to load data into cluster’s memory and query it repeatedly (spills to HDFS only when needed).
• Allows interac#ve ad-‐hoc data explora#on. (Supports Pipelining & Lazy Ini#aliza#on)
• Unifies batch, streaming and interac6ve computa6on.
The Scala Programming language -‐ Object / Func#onal -‐ Runs on the JVM -‐ Concise Syntax
-‐ Aims to support interac#ve scrip#ng + development
Exploring Spark Internals …
• Spark : Cluster Compu6ng with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Sco_
Shenker, Ion Stoica. Hot Cloud 2010. June 2010.
• Resilient Distributed Datasets: A Fault-‐Tolerant Abstrac6on for In-‐Memory Cluster Compu6ng
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Jus#n Ma, Murphy McCauley, Michael J. Franklin, Sco_ Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award and Honorable Men6on for Community Award.
Core of Spark : RDDs
• RDD : Resilient Distributed Datasets are a distributed memory abstrac6on that lets programmer perform in memory computa#ons on large clusters in fault tolerant manner.
• An RDD is a read-‐only collec6on of objects par66oned across a set of machines.
• RDDs provide an interface based on coarse-‐grained transforma#on that applies same opera#on to many data items. This allows fault tolerance by logging transforma6ons and building a dataset lineage that can be used to rebuilt them if a par66on is lost.
Shared Variables in SPARK : -‐ Broadcast Variables -‐ Accumulators Programmers can create two restricted types of shared variables to support two common simple usage paJerns…
Ini6al Experiment Results
Logis6c Regression 29GB dataset.
Interac6ve Data Mining on Wikipedia 1 TB from disk took 170s
RDDs Expressivity and Generality
• RDDs are able to express a diverse set of programming models as the Restric#ons have li_le impact in parallel applica#ons.
• A lot of these programs naturally apply the same opera#on on many records, making them a good fit.
• Previous systems explored specific problems with MapReduce. However, at the core of the problem is the need for a common data sharing abstrac6on.
• RDDs capture all major op#miza#ons : keeping specific data in memory, custom par##oning to minimize communica#on and recovering from failures effec#vely.
Explore More of BlinkDB • Visit : h_p://blinkdb.org/ • Blink and It’s Done: Interac6ve Queries on Very Large Data : In PVLDB 5(12): 1902-‐1905, 2012, Istanbul, Turkey by Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Anand P. Iyer, Samuel Madden, Ion Stoica.
• Queries with Bounded Errors and Bounded Response Times on Very Large Data : Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB:. In ACM EuroSys 2013, Prague, Czech Republic (Best Paper Award).