spark summit eu 2015: revolutionizing big data in the enterprise with spark
TRANSCRIPT
We Have Seen a Lot
Worked with 100s companies to run Spark in production over five years
Collaborate with all major Hadoop and Big Data vendors
2
Need to process data from• Multiple sources• Different data stores and locations • Different formats
Traditional solutions: ETL data into data warehouse, …
Traditional Data Warehouses
ETL
Slow to access and combine data
Data Warehouse
Process data in place or stream it• No need to wait for data to be
ETLed
7
JIT Data Warehouse
ETL
Data Warehouse
Process data in place or stream it• No need to wait for data to be
ETLed
Cache data in memory or SSDs
8
JIT Data Warehouse
Low latency and easy to combine data: value!
Analogy
10
ETL & Query
Data Source A
ETL
Data Warehouse
Data Source B
Data Source B
Data Source A
Data Source B
Data Source B
Stream/Cache + Query
Top-3 Media Company
Data sources• Traditional data warehouse: Customer transaction and profile data • S3: Clickstream and historical logs• Elasticsearch: User-submitted reviews and comments• Kafka: Streaming online event data
Build Spark-based JIT Data Warehouse to perform real-time analytics
11
Unified support for• Batch• Streaming• ML/Graphs• …
13
Spark: Unified Engine
GraphXMLlib
Core
Spark Streaming
Spark SQL SparkR
Easy to manage, learn, and combine functionality
Analogy
First cellularphones
Unified device(smartphone)
Specializeddevices
Better Games Better GPSBetter Phone
Analogy
Batch processing Unified systemSpecialized systems
Real-timeanalytics
Instant fraud detection
Better Apps
Large On-line Service Company
Leverages• Interactive query processing• ML
and combines data from S3, Redshift, and HBase to provide • data analytics for product management team• advanced predictive analytics to deliver new services (e.g.,
customized inventory displays tailored to each user)
16