Download - Not Your Father's Database by Vida Ha
![Page 1: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/1.jpg)
Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture
Spark Summit East 2016
![Page 2: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/2.jpg)
Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture
Spark Summit East 2016
![Page 3: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/3.jpg)
About Me
2005 Mobile Web & Voice Search
3
![Page 4: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/4.jpg)
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
![Page 5: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/5.jpg)
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
![Page 6: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/6.jpg)
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQ
L
![Page 7: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/7.jpg)
But the performance is very different…
Is this your Spark infrastructure?
7
SQ
L
HDFS
![Page 8: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/8.jpg)
Just in Time Data Warehouse w/ Spark
HDFS
![Page 9: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/9.jpg)
Just in Time Data Warehouse w/ Spark
HDFS
![Page 10: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/10.jpg)
Just in Time Data Warehouse w/ Spark
and more…HDFS
![Page 11: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/11.jpg)
11
Know when to use other data stores besides file systems
Today’s Goal
![Page 12: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/12.jpg)
Good: General Purpose Processing
Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores
12
![Page 13: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/13.jpg)
Types of workloads • Batch Workloads • Ad Hoc Analysis
– Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads
13
Good: General Purpose Processing
![Page 14: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/14.jpg)
Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale
14
Good: General Purpose Processing
![Page 15: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/15.jpg)
Bad: Random Access
sqlContext.sql( “select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15
![Page 16: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/16.jpg)
Bad: Random Access
sqlContext.sql( “select * from my_large_table where id=2I34823”)
Will this command run in Spark? Yes, but it’s not very efficient — Spark may have to go through all your files to find your row.
16
![Page 17: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/17.jpg)
Bad: Random Access
Solution: If you frequently randomly access your data, use a database.
• For traditional SQL databases, create an index on your key column.
• Key-Value NOSQL stores retrieves the value of a key efficiently out of the box.
17
![Page 18: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/18.jpg)
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”)
Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow…
18
![Page 19: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/19.jpg)
Bad: Frequent Inserts
Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files.
19
![Page 20: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/20.jpg)
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap: Not an “Anti-pattern” to duplicately store your data.
20
![Page 21: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/21.jpg)
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.21
![Page 22: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/22.jpg)
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental SQL Query
Database Snapshot
+
![Page 23: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/23.jpg)
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS
![Page 24: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/24.jpg)
Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS
![Page 25: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/25.jpg)
Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB
![Page 26: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/26.jpg)
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine learning and data science.
Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc.
26
![Page 27: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/27.jpg)
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable where name like '%xyz%'”)
Spark will go through each row to find results.
27
![Page 28: Not Your Father's Database by Vida Ha](https://reader034.vdocuments.us/reader034/viewer/2022042707/586f750b1a28ab10258b5e8b/html5/thumbnails/28.jpg)
Thank you