scidb : open source data management system for data-intensive scientific analytics
DESCRIPTION
SciDB: Open Source Data Management System for Data-Intensive Scientific AnalyticsTRANSCRIPT
Open Source Data Management System for Data-Intensive
Scientific Analytics
Jacek Becla
San Diego Supercomputer Center
05/29/2009
2
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
3
Size Challenge
• Data set sizes grow dramatically
• Growth rate increases
• Implications– Failures are routine– Provenance tracking is a must– Massive parallelization is a must– Full automation, self-adjustment is a must
4
Analytics Complexity
• More data varieties = more ways to analyze it• Rapid growth of complexity of analytics
– Time series comparisons– N2 and N3 correlations– Proximity and grouping-based searches
• Interactive exploration enables most science• Data uncertainty matters• Provenance is an integral part of analytics• User annotations are important• Ad-hoc integration of derived data
with raw data desired• True for science and industry
5
Today’s Technologies
• Existing databases– Most too monolithic– Expensive to scale– Expensive to provide high availability– Built for perfect schemas and clean data– Relational data model far from ideal for most projects– APIs far from ideal
• Desired intuitive interfaces
• Most very large systems shy away from databases
6
Today’s Solutions
• Metadata in lightweight database plus bulk data in files– BaBar, LHC, LCLS
• Bulk data stored as unstructured data in database– NIF
• Raw data in files, derived data in database– PanSTARRS, LSST (future projects)
• Complete (or mostly) home-grown systems– ATT, Google, Yahoo, Amazon, Facebook– Most common solution
• All in database– WalMart (very expensive)– eBay (very expensive, testing new home grown solution)– SDSS, bio, genomics (small-ish, single-server databases)
• Little reusing, roll-your-own mentality
7
Future
• Emerging trends– Shared nothing parallel database– Lightweight, specialized components– On low-cost commodity hardware– Aggressive compression
• Several attempts to push state-of-the-art forward– Aster Data, Vertica, ParAccel,
EnterpriseDB, Greenplum, Netezza
• Some issues not addressed by anyone– Arrays, provenance, uncertainty,
partial results, intuitive interfaces
8
XLDB Activities
• 2007
• Indentify trends, roadblocks
• Bridge the gaps
• 2008
• Complex analytics
• Bridge the gaps
• 2009
• Reach out to non-US communities
• Connect with remaining sciences
9
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
10
• Philosophy– address common scientific needs– geared for analytics, not OLTP
• Key requirements– open source– commercial quality– peta-scale
New Open Source Science Database System
11
Data Model - Types
• Scalars– standard base types (int, float, string, date, …)– geospatial (3-D points, lines, polygons, boxes)
• Multi-d arrays– regular or irregular– any number of dimensions– nesting allowed– dense or sparse
12
Data Model - Operators
• Native (built-in)– array-sql (filter, project, group_by, aggregation, …)– array (pivot, regridding, reshaping, transformations, nest, flatten, …)
• User-defined functions– Postgres-style– coded in C++
• Native operators coded as UDFs• All UDFs treated equally
– optimizer might do more with built-in UDFs
• Two kinds: per cell, per array• All UDFs executable in parallel
• Paradigm: primitives for data-heavy compute
13
Data Model – Match To Science Needs
astronomy earth and environmental sciences, including
oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)
14
Query Language
• “Parse-tree” representation of operations
• “Bindings” to C++, Python, IDL, ... (TBD)
• Tight integration with popular statistical tools like R or MATLAB
15
Storage Model
• Granularity– “Chunked” arrays
• Chunk = unit of storage, buffering and compression– Chunks partitioned across nodes
• Parallel model– Shared-nothing parallel DBMS
• runs on a grid of computers, uniformity not required– Data exchanged between nodes as needed
• Format– Loaded or in-situ modes
• in-situ: limited capabilities– Adaptors to translate external popular formats
(like HDF5) on the fly
16
Versioning
• No overwrite storage
• Named versions
• Delta compression
17
Provenance
• Need:– what operations led to creation of given element– what operations used this element– what data elements were used as input to this operation– what data elements were created as output from this
operation• Natively supported
– easy if workflow in SciDB• Loading external provenance• Efficient querying• No-overwrite + delta compression helps
18
Uncertainty
• Error bars carried along in the computation• Initial version
– interval arithmetic– uniform error distribution
• More complex models usually science-specific– might consider implementing some
in the future if enough commonalities
• Approximate results
19
Resource Management
• Query scheduling– including shared scans (train scheduling)
• Query progress
• Support for long-running queries (cancel/stop/restart)
• Pre-execution query cost estimates
• Per user/query limits
20
Other Features
• High availability / automatic fail over
• Auto config and auto self-healing
21
Green Computing
• Aggressive compression less disk
• Approximate results stop computing early
• Shared scans share I/O
• Scale out as you go incremental provisioning
22
Science / Industry Needs
• Scale• Complex analytics
– time series, needle in haystack, group based
• Summary statistics @petascale• Arrays• Provenance• Uncertainty• Integration with statistical tools
… all needed by industry
23
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
24
Partnership - Roles
• Science and high-end commercial– provide input, including usecases– provide some resources– review design, test the product
• DBMS brain trust– design, oversee construction, perform research
• Non profit company– manage the project– support resulting system
25
Partnership – Current Players
• Science and high-end commercial– LSST, PNNL, UCSB, LLNL– eBay, Vertica, Microsoft– lighthouse customers: LSST and eBay
• DBMS brain trust– Michael Stonebraker, David DeWitt, Dave Maier,
Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel
• Non profit company– SciDB, Inc. - 501c(3) foundation
• Plus… 5 developers working on 1st prototype
26
Have Usecases From
• astronomy (LSST)• industry (eBay)• genomics (LLNL)• climate (PNNL/ARM)• seismic (Emory Univ)• environmental observation & modeling
(Oregon Univ)• earth remote sensing (UCSB)• fusion (LLNL/NIF)
• WE NEED YOUR USECASES
27
Timeline
• Mid June ‘09– professional-looking scidb.org– start building user community
• Late August ’09– planned demo at VLDB– reach out to non-US communities
through XLDB3
• End of Q1’10 – alpha• End of Q4’10 – beta
28
Manpower
• All work so far in-kind
• 4.5 FTEs working on demo– SLAC, MIT, UW, RAS
• Good chances to have funds available this FY to hire ~5 full time developers
• Actively looking for more partners
– GET INVOLVED
29
Summary
• Many commonalities within science and between science and industry
• Existing off-the-shelf technologies inefficient for very large scale analytics
• SciDB – new open source science DBMS– Community realizes shared software
infrastructure is good– Big lighthouse customers– Strong team– If successful, will enable unprecedented
analyses at extreme scale
30
Related Links
• http://scidb.org
• http://www-conf.slac.stanford.edu/xldb07
• http://www-conf.slac.stanford.edu/xldb08
• http://www-conf.slac.stanford.edu/xldb09