scalable complex analytics and dbmss · 15 dbms options • array store (scidb, rasdaman, hdf-5)...
TRANSCRIPT
![Page 1: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/1.jpg)
Scalable Complex Analytics and DBMSs
Michael Stonebraker
![Page 2: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/2.jpg)
2
Simple Analytics
• SQL operations — count, sum, max, min, avg — Optional group_by
• Defined on tables
• User interface is Business Intelligence Tools — Cognos, Business Objects, …
• Appropriate for traditional business applications
![Page 3: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/3.jpg)
3
Simple Analytics
• Well served by the data warehouse crowd
• Who are good at this stuff — even on petabytes
![Page 4: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/4.jpg)
4
Complex Analytics
• Machine learning • Data clustering • Predictive models • Recommendation engines • Regressions • Estimators
![Page 5: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/5.jpg)
5
Complex Analytics
• By and large, they are defined on arrays • As collections of linear algebra operations • They are not in SQL! • And often
— Are defined on large amounts of data — And/or in high dimensions
![Page 6: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/6.jpg)
6
Complex Analytics on Array Data – An Accessible Example
• Consider the closing price on all trading days for the
last 20 years for two stocks A and B • What is the covariance between the two time-
series? (1/N) * sum (Ai - mean(A)) * (Bi - mean (B))
![Page 7: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/7.jpg)
7
Now Make It Interesting …
• Do this for all pairs of 15000 stocks — The data is the following 15000 x 4000 matrix
Stock t1 t2 t3 t4 t5 t6 t7 …. t4000
S1
S2
…
S15000
![Page 8: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/8.jpg)
8
Array Answer
• Ignoring the (1/N) and subtracting off the means ….
Stock * StockT
![Page 9: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/9.jpg)
9
Use Case Requirements
• Complex analytics — Covariance is just the start — Defined on arrays — Graphs are just sparse arrays
• Data management — Leave out outliers — Just on securities with a market cap over $10B
• Scalability to many cores, many nodes and out-of-memory data
![Page 10: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/10.jpg)
10
Data Scientist Job Description
• Ignore the 80 - 90% of the time spent cleaning and assembling the data
— Separate talk on data curation
• Until (tired) { Data management operation(s); Complex analytics operations(s); }
![Page 11: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/11.jpg)
11
Solution Options for Data Management
• Hard Code — Separate stack from the bare metal up for each project (LHS is
40M lines of code) — No uniform treatment of meta data (often encoded in the file
name) — Can’t share data easily — Depends on the “cheap PostDoc” model
![Page 12: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/12.jpg)
12
Solution Options for Data Management
• Use a DBMS — Get sharing, indexing, protection, queries, crash recovery, ….
• Please, please, please use a DBMS — If you get nothing else from this talk, please take note of this! — Take a page from the business data processing playbook!
• Yabut – I can code a faster solution — But you are dooming your successor to maintaining it! — And requirements change!!!!
![Page 13: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/13.jpg)
13
DBMS Options
• Traditional row store (Postgres, MySQL, Oracle, Big Table, ...) — Stores the data on disk row-by-row — Not competitive on data intensive queries — For a collection of very good technical reasons
![Page 14: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/14.jpg)
14
DBMS Options
• Column store (Vertica, Red Shift, DB2-Blu, Impala, …)
— Store the data column-by-column — Easier to compress; much faster executor; often read less than
all columns — Generally 50 X row stores on this kind of stuff
• In the data warehouse market — This technology is in the process of completely taking over
![Page 15: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/15.jpg)
15
DBMS Options
• Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL — Store the data in multi-dimensional tiles (chunks)
• Advantages — Same conceptual model as linear algebra — No table to array conversion required (which is very slow) — Dimensions are not stored (space advantage) — Multi-dimensional queries are very very fast, since the storage
structure is “chunked”
![Page 16: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/16.jpg)
Array Query Language (AQL)
SELECT Geo-Mean ( T.B )FROM Test_Array T WHERE T.I BETWEEN :C1 AND :C2 AND T.J BETWEEN :C3 AND :C4AND T.A = 10GROUP BY T.I;
User-defined aggregate on an attribute B in array T Subsample Filter Group-by
![Page 17: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/17.jpg)
17
DBMS Options
• Map-Reduce (open source version is Hadoop) — Good for embarassingly parallel problems only — Which this stuff is not!!! — Abandoned by Google in 2011 (or so) — Cloudera has a DBMS (Impala) – NOT built on Map-Reduce
• This interface is essentially dead
![Page 18: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/18.jpg)
18
Two Things to Keep in Mind (1) (Data Base 101)
• Always send the query to the data (Kbytes) — Minimizes data comm
• Do not bring the data to the query (Tbytes)! — Forward pointer to HPC
![Page 19: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/19.jpg)
19
Two Things to Keep in Mind (2)
• On matrix multiply, there are five orders of magnitude difference between Python and Intel-optimized C++
• Example — One order of magnitude between LaPack/BLAS/MKL and
“smart Russians in C++” — Java is another order of magnitude down (Spark, Mahout, …)
• Very difficult to compete with optimized packages and Intel engineers!!!
![Page 20: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/20.jpg)
20
Analytics Options
• Code in SQL — Matrix multiply is a 3-way self join — If the data is sparse enough, this may be ok — On dense data this will be a disaster (SQL and Python are
likely to have similar performance)
• Madlib is a package that did this — And was quickly recoded in C++
• Bill Howe will probably have a different opinion
— I suspect
![Page 21: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/21.jpg)
21
Analytics Options (Loose Integration)
• Code in a stat package (R, SAS, SPSS, Mahout, …)
— Copy the world from the DBMS to the package (slow) — Learn 2 interfaces — You’re in the plumbing business! — Parallel packages are just coming into existence — Most stat packages are main-memory only
• I don’t like this option at all! — Long term slog through the swamp
![Page 22: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/22.jpg)
22
Analytics Options (Tight Integration)
• Run stat code as a user-defined function — Inside the DBMS — Called through extensions to SQL
![Page 23: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/23.jpg)
Example Query
SELECT A.i * B.jFROM A, BWHERE
A.k > 100 andB.m < 200
![Page 24: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/24.jpg)
24
Analytics Options (Tight Integration)
• Learn one interface • No “copy the world” problem • Run stat code as a user-defined function
— Inside the DBMS — Automatic parallelism (at least in SciDB)
![Page 25: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/25.jpg)
25
(Some of the) Detailed Options
• Loose coupling — {R, SAS, SPSS} + your favorite DBMS
• Tight coupling — SciDB + Scalapack — SciDB + R — Vertica + R
![Page 26: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/26.jpg)
26
A Note on Hadoop/HDFS
• Impala is not coded on top of HDFS — Drills through to underlying Linux files — Looks exactly like a parallel column store (e.g. Vertica,
Redshift, …)
• “Hadoop market” and “data warehouse market” are converging
• Current marketing slogo is “data lakes” — Creates a data swamp by ignoring data curation issues — Or a junk drawer
![Page 27: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/27.jpg)
27
A Note on Spark
• 70+% of Spark access is SparkSQL • However, Spark has
— No persistence — No meta data — No main memory sharing — Java (slow)
• I expect all of this to get fixed over time — And Spark will follow the trajectory of Hadoop to become a
data warehouse market
• Remainder is Scala (slow) — Remains to be seen how Spark will play in the general
distributed computing space….
![Page 28: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/28.jpg)
28
Issues in Using ScalaPack in SciDB
• Block cyclic organization — which DBMS does not support
• MKL — Which DBMSs won’t use for crash recovery issues
• Tile organization — Scalapack is dense-only — SciDB is a single format for dense and sparse
![Page 29: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/29.jpg)
29
The Future
• Co-design of analytics and DBMS storage organization
— To get rid of these issues — Intel-supported project at MIT and UTenn
![Page 30: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/30.jpg)
30
An Exercise at NERSC
• General NERSC architecture is — A compute server — A storage server — A compute-side file cache; scheduled in advance
![Page 31: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/31.jpg)
31
Issues
• DBMS wants to be “always on” service — Incompatible with scheduling the file cache
• Send the data to the query not the other way around
— Every time somebody wants data access, need to move the world
![Page 32: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/32.jpg)
32
At NERSC
• SciDB runs — Managing many, many Tbytes of data — On dedicated nodes
• Could not get Vertica to run at all — Painful aspects of batch job focus (scheduling the file
cache; open file limit)
![Page 33: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/33.jpg)
33
Summary
• Stand on the shoulders of those who went before you, not on their feet
— Please don’t write a complete stack for each new project
• Want to tightly couple DBMSs and linear algebra
— Or you get 2 interfaces — And copy the world
![Page 34: Scalable Complex Analytics and DBMSs · 15 DBMS Options • Array store (SciDB, Rasdaman, HDF-5) — Data model is an array, not a table — Query language is typically array-SQL](https://reader034.vdocuments.us/reader034/viewer/2022042203/5ea4d995f253c54cf95ffecf/html5/thumbnails/34.jpg)
34
Summary
• Array DBMSs are likely to be attractive — Check out SciDB.org
• Hadoop and Spark will probably morph into something that looks like a DBMS
— Turkey performance in the meantime
• HPC needs to become interactive — Or DBMSs probably won’t run there