scidb : open source data management system for data-intensive scientific analytics

Open Source Data Management System for Data-Intensive

Scientific Analytics

Jacek Becla

San Diego Supercomputer Center

05/29/2009

2

Outline

• Needs, challenges, today’s solutions and emerging trends

• SciDB design and planned features

• SciDB structure and timeline

3

Size Challenge

• Data set sizes grow dramatically

• Growth rate increases

• Implications– Failures are routine– Provenance tracking is a must– Massive parallelization is a must– Full automation, self-adjustment is a must

4

Analytics Complexity

• More data varieties = more ways to analyze it• Rapid growth of complexity of analytics

– Time series comparisons– N2 and N3 correlations– Proximity and grouping-based searches

• Interactive exploration enables most science• Data uncertainty matters• Provenance is an integral part of analytics• User annotations are important• Ad-hoc integration of derived data

with raw data desired• True for science and industry

5

Today’s Technologies

• Existing databases– Most too monolithic– Expensive to scale– Expensive to provide high availability– Built for perfect schemas and clean data– Relational data model far from ideal for most projects– APIs far from ideal

• Desired intuitive interfaces

• Most very large systems shy away from databases

6

Today’s Solutions

• Metadata in lightweight database plus bulk data in files– BaBar, LHC, LCLS

• Bulk data stored as unstructured data in database– NIF

• Raw data in files, derived data in database– PanSTARRS, LSST (future projects)

• Complete (or mostly) home-grown systems– ATT, Google, Yahoo, Amazon, Facebook– Most common solution

• All in database– WalMart (very expensive)– eBay (very expensive, testing new home grown solution)– SDSS, bio, genomics (small-ish, single-server databases)

• Little reusing, roll-your-own mentality

7

Future

• Emerging trends– Shared nothing parallel database– Lightweight, specialized components– On low-cost commodity hardware– Aggressive compression

• Several attempts to push state-of-the-art forward– Aster Data, Vertica, ParAccel,

EnterpriseDB, Greenplum, Netezza

• Some issues not addressed by anyone– Arrays, provenance, uncertainty,

partial results, intuitive interfaces

8

XLDB Activities

• 2007

• Indentify trends, roadblocks

• Bridge the gaps

• 2008

• Complex analytics

• Bridge the gaps

• 2009

• Reach out to non-US communities

• Connect with remaining sciences

9

Outline




10

• Philosophy– address common scientific needs– geared for analytics, not OLTP

• Key requirements– open source– commercial quality– peta-scale

New Open Source Science Database System

11

Data Model - Types

• Scalars– standard base types (int, float, string, date, …)– geospatial (3-D points, lines, polygons, boxes)

• Multi-d arrays– regular or irregular– any number of dimensions– nesting allowed– dense or sparse

12

Data Model - Operators

• Native (built-in)– array-sql (filter, project, group_by, aggregation, …)– array (pivot, regridding, reshaping, transformations, nest, flatten, …)

• User-defined functions– Postgres-style– coded in C++

• Native operators coded as UDFs• All UDFs treated equally

– optimizer might do more with built-in UDFs

• Two kinds: per cell, per array• All UDFs executable in parallel

• Paradigm: primitives for data-heavy compute

13

Data Model – Match To Science Needs

astronomy earth and environmental sciences, including

oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)

14

Query Language

• “Parse-tree” representation of operations

• “Bindings” to C++, Python, IDL, ... (TBD)

• Tight integration with popular statistical tools like R or MATLAB

15

Storage Model

• Granularity– “Chunked” arrays

• Chunk = unit of storage, buffering and compression– Chunks partitioned across nodes

• Parallel model– Shared-nothing parallel DBMS

• runs on a grid of computers, uniformity not required– Data exchanged between nodes as needed

• Format– Loaded or in-situ modes

• in-situ: limited capabilities– Adaptors to translate external popular formats

(like HDF5) on the fly

16

Versioning

• No overwrite storage

• Named versions

• Delta compression

17

Provenance

• Need:– what operations led to creation of given element– what operations used this element– what data elements were used as input to this operation– what data elements were created as output from this

operation• Natively supported

– easy if workflow in SciDB• Loading external provenance• Efficient querying• No-overwrite + delta compression helps

18

Uncertainty

• Error bars carried along in the computation• Initial version

– interval arithmetic– uniform error distribution

• More complex models usually science-specific– might consider implementing some

in the future if enough commonalities

• Approximate results

19

Resource Management

• Query scheduling– including shared scans (train scheduling)

• Query progress

• Support for long-running queries (cancel/stop/restart)

• Pre-execution query cost estimates

• Per user/query limits

20

Other Features

• High availability / automatic fail over

• Auto config and auto self-healing

21

Green Computing

• Aggressive compression less disk

• Approximate results stop computing early

• Shared scans share I/O

• Scale out as you go incremental provisioning

22

Science / Industry Needs

• Scale• Complex analytics

– time series, needle in haystack, group based

• Summary statistics @petascale• Arrays• Provenance• Uncertainty• Integration with statistical tools

… all needed by industry

23

Outline




24

Partnership - Roles

• Science and high-end commercial– provide input, including usecases– provide some resources– review design, test the product

• DBMS brain trust– design, oversee construction, perform research

• Non profit company– manage the project– support resulting system

25

Partnership – Current Players

• Science and high-end commercial– LSST, PNNL, UCSB, LLNL– eBay, Vertica, Microsoft– lighthouse customers: LSST and eBay

• DBMS brain trust– Michael Stonebraker, David DeWitt, Dave Maier,

Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel

• Non profit company– SciDB, Inc. - 501c(3) foundation

• Plus… 5 developers working on 1st prototype

26

Have Usecases From

• astronomy (LSST)• industry (eBay)• genomics (LLNL)• climate (PNNL/ARM)• seismic (Emory Univ)• environmental observation & modeling

(Oregon Univ)• earth remote sensing (UCSB)• fusion (LLNL/NIF)

• WE NEED YOUR USECASES

27

Timeline

• Mid June ‘09– professional-looking scidb.org– start building user community

• Late August ’09– planned demo at VLDB– reach out to non-US communities

through XLDB3

• End of Q1’10 – alpha• End of Q4’10 – beta

28

Manpower

• All work so far in-kind

• 4.5 FTEs working on demo– SLAC, MIT, UW, RAS

• Good chances to have funds available this FY to hire ~5 full time developers

• Actively looking for more partners

– GET INVOLVED

29

Summary

• Many commonalities within science and between science and industry

• Existing off-the-shelf technologies inefficient for very large scale analytics

• SciDB – new open source science DBMS– Community realizes shared software

infrastructure is good– Big lighthouse customers– Strong team– If successful, will enable unprecedented

analyses at extreme scale

30

Related Links

• http://scidb.org

• http://www-conf.slac.stanford.edu/xldb07



scidb : open source data management system for data-intensive scientific analytics

Technology

data elements

unstructured data

data varieties

data model match

lcls bulk data

dataheavy compute

database nif raw data

size challenge data