![Page 1: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/1.jpg)
Open Source Ingredients for Interactive Data Analysis in SparkHadoop Summit San Jose – June 2016Maxim Lukiyanov, Program ManagerBig Data, Microsoft@maxiluk
![Page 2: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/2.jpg)
AgendaHow it all fits togetherIngredientsApache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud
Resource management
![Page 3: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/3.jpg)
What is your top concern for big data projects?
![Page 4: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/4.jpg)
Length ofDevelopmentCycle
#1
![Page 5: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/5.jpg)
Length of Development CycleUniversal metric to track and improveAffects productivity and project risk
![Page 6: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/6.jpg)
Development phasesData exploration and experimentation
Data sharingDevelopment of production code
Debugging
![Page 7: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/7.jpg)
Interactive Spark on AzureYARN
Spark Application
Spark Application
Spark Application
Spark Application
Command line
Livy server
Thrift server
Jupyter notebooks
REST
SSH
ODBC
Default Queue
Thrift Queue
Local HDFS
Blob Storage
Data Lake Store
IntelliJ IDEA
BI Tools
![Page 8: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/8.jpg)
Ingredients
![Page 9: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/9.jpg)
Apache SparkInteractive compute engineInteractive on small datasetsInteractive on large datasets on large clusters with in-memory or SSD cachingBuilt-in sampling
Upcoming in Spark 2.0Tungsten Phase 2 (5-10x speedup)Structured Streams
Great momentumActive and large communitySupported by all major big data vendorsFast release cadence
![Page 10: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/10.jpg)
Evolution of big data
Data Sources
![Page 11: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/11.jpg)
Spark on Azure Cloud (HDInsight)Fully Managed Service100% open source Apache Spark and Hadoop bitsLatest releases of SparkFully supported by Microsoft and Hortonworks99.9% Azure Cloud SLACertifications: PCI, ISO 27018, SOC, HIPAA, EU-MC
Tools for data exploration, experimentation and developmentJupyter Notebooks (scala, python, automatic data visualizations)IntelliJ plugin (job submission, remote debugging)ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
![Page 12: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/12.jpg)
Resource management
![Page 13: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/13.jpg)
Interactive Spark on AzureYARN
Spark Application
Spark Application
Spark Application
Spark Application
Command line
Livy server
Thrift server
Jupyter notebooks
REST
SSH
ODBC
Default Queue
Thrift Queue
Local HDFS
Blob Storage
Data Lake Store
IntelliJ IDEA
BI Tools
![Page 14: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/14.jpg)
Yarn resource managementDynamic resource allocation (Thrift)Thrift server adds executors when processing SQL queriesAfter timeout it shrinks back
Resource preemption (between queues)Thrift will take resources from other apps during activity and vice versaWhen multiple apps are active the resources are shared fairly
![Page 15: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/15.jpg)
Yarn resource management: LimitationsBugsCapacity resource scheduler + Default resource calculator configuration worksDominant resource calculator breaks preemption logic
LimitationsNo resource preemption between applicationsNo application sharing between notebooks in Livy
![Page 16: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/16.jpg)
Summary: Full list of ingredientsComponentsApache SparkJupyter + sparkmagic kernel (or Zeppelin)Livy job serverApache Yarn resource management using queues and preemptionColumnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight[Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etcAzure Cloud
TechniquesSample, sample, sampleCACHE TABLE (or auto-caching using Alluxio)Scale out on demand using elasticity of the cloud
![Page 17: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/17.jpg)
ResourcesSparkMagic kernel for Jupyter notebookhttps://github.com/jupyter-incubator/sparkmagic
Livy job serverhttps://github.com/cloudera/livy
IntelliJ IDEA plug-in documentationhttps://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/
Azure Spark Documentationhttps://azure.microsoft.com/en-us/documentation/services/hdinsight/
![Page 18: Open Source Ingredients for Interactive Data Analysis in Spark](https://reader035.vdocuments.us/reader035/viewer/2022062905/586fde811a28ab18428b6bf1/html5/thumbnails/18.jpg)
© Microsoft Corporation. All rights reserved.