scidb : open source data management system for data-intensive scientific analytics

30
Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009

Upload: san-diego-supercomputer-center

Post on 28-Jan-2015

108 views

Category:

Technology


1 download

DESCRIPTION

SciDB: Open Source Data Management System for Data-Intensive Scientific Analytics

TRANSCRIPT

Page 1: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

Open Source Data Management System for Data-Intensive

Scientific Analytics

Jacek Becla

San Diego Supercomputer Center

05/29/2009

Page 2: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

2

Outline

• Needs, challenges, today’s solutions and emerging trends

• SciDB design and planned features

• SciDB structure and timeline

Page 3: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

3

Size Challenge

• Data set sizes grow dramatically

• Growth rate increases

• Implications– Failures are routine– Provenance tracking is a must– Massive parallelization is a must– Full automation, self-adjustment is a must

Page 4: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

4

Analytics Complexity

• More data varieties = more ways to analyze it• Rapid growth of complexity of analytics

– Time series comparisons– N2 and N3 correlations– Proximity and grouping-based searches

• Interactive exploration enables most science• Data uncertainty matters• Provenance is an integral part of analytics• User annotations are important• Ad-hoc integration of derived data

with raw data desired• True for science and industry

Page 5: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

5

Today’s Technologies

• Existing databases– Most too monolithic– Expensive to scale– Expensive to provide high availability– Built for perfect schemas and clean data– Relational data model far from ideal for most projects– APIs far from ideal

• Desired intuitive interfaces

• Most very large systems shy away from databases

Page 6: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

6

Today’s Solutions

• Metadata in lightweight database plus bulk data in files– BaBar, LHC, LCLS

• Bulk data stored as unstructured data in database– NIF

• Raw data in files, derived data in database– PanSTARRS, LSST (future projects)

• Complete (or mostly) home-grown systems– ATT, Google, Yahoo, Amazon, Facebook– Most common solution

• All in database– WalMart (very expensive)– eBay (very expensive, testing new home grown solution)– SDSS, bio, genomics (small-ish, single-server databases)

• Little reusing, roll-your-own mentality

Page 7: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

7

Future

• Emerging trends– Shared nothing parallel database– Lightweight, specialized components– On low-cost commodity hardware– Aggressive compression

• Several attempts to push state-of-the-art forward– Aster Data, Vertica, ParAccel,

EnterpriseDB, Greenplum, Netezza

• Some issues not addressed by anyone– Arrays, provenance, uncertainty,

partial results, intuitive interfaces

Page 8: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

8

XLDB Activities

• 2007

• Indentify trends, roadblocks

• Bridge the gaps

• 2008

• Complex analytics

• Bridge the gaps

• 2009

• Reach out to non-US communities

• Connect with remaining sciences

Page 9: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

9

Outline

• Needs, challenges, today’s solutions and emerging trends

• SciDB design and planned features

• SciDB structure and timeline

Page 10: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

10

• Philosophy– address common scientific needs– geared for analytics, not OLTP

• Key requirements– open source– commercial quality– peta-scale

New Open Source Science Database System

Page 11: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

11

Data Model - Types

• Scalars– standard base types (int, float, string, date, …)– geospatial (3-D points, lines, polygons, boxes)

• Multi-d arrays– regular or irregular– any number of dimensions– nesting allowed– dense or sparse

Page 12: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

12

Data Model - Operators

• Native (built-in)– array-sql (filter, project, group_by, aggregation, …)– array (pivot, regridding, reshaping, transformations, nest, flatten, …)

• User-defined functions– Postgres-style– coded in C++

• Native operators coded as UDFs• All UDFs treated equally

– optimizer might do more with built-in UDFs

• Two kinds: per cell, per array• All UDFs executable in parallel

• Paradigm: primitives for data-heavy compute

Page 13: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

13

Data Model – Match To Science Needs

astronomy earth and environmental sciences, including

oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)

Page 14: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

14

Query Language

• “Parse-tree” representation of operations

• “Bindings” to C++, Python, IDL, ... (TBD)

• Tight integration with popular statistical tools like R or MATLAB

Page 15: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

15

Storage Model

• Granularity– “Chunked” arrays

• Chunk = unit of storage, buffering and compression– Chunks partitioned across nodes

• Parallel model– Shared-nothing parallel DBMS

• runs on a grid of computers, uniformity not required– Data exchanged between nodes as needed

• Format– Loaded or in-situ modes

• in-situ: limited capabilities– Adaptors to translate external popular formats

(like HDF5) on the fly

Page 16: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

16

Versioning

• No overwrite storage

• Named versions

• Delta compression

Page 17: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

17

Provenance

• Need:– what operations led to creation of given element– what operations used this element– what data elements were used as input to this operation– what data elements were created as output from this

operation• Natively supported

– easy if workflow in SciDB• Loading external provenance• Efficient querying• No-overwrite + delta compression helps

Page 18: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

18

Uncertainty

• Error bars carried along in the computation• Initial version

– interval arithmetic– uniform error distribution

• More complex models usually science-specific– might consider implementing some

in the future if enough commonalities

• Approximate results

Page 19: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

19

Resource Management

• Query scheduling– including shared scans (train scheduling)

• Query progress

• Support for long-running queries (cancel/stop/restart)

• Pre-execution query cost estimates

• Per user/query limits

Page 20: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

20

Other Features

• High availability / automatic fail over

• Auto config and auto self-healing

Page 21: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

21

Green Computing

• Aggressive compression less disk

• Approximate results stop computing early

• Shared scans share I/O

• Scale out as you go incremental provisioning

Page 22: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

22

Science / Industry Needs

• Scale• Complex analytics

– time series, needle in haystack, group based

• Summary statistics @petascale• Arrays• Provenance• Uncertainty• Integration with statistical tools

… all needed by industry

Page 23: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

23

Outline

• Needs, challenges, today’s solutions and emerging trends

• SciDB design and planned features

• SciDB structure and timeline

Page 24: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

24

Partnership - Roles

• Science and high-end commercial– provide input, including usecases– provide some resources– review design, test the product

• DBMS brain trust– design, oversee construction, perform research

• Non profit company– manage the project– support resulting system

Page 25: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

25

Partnership – Current Players

• Science and high-end commercial– LSST, PNNL, UCSB, LLNL– eBay, Vertica, Microsoft– lighthouse customers: LSST and eBay

• DBMS brain trust– Michael Stonebraker, David DeWitt, Dave Maier,

Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel

• Non profit company– SciDB, Inc. - 501c(3) foundation

• Plus… 5 developers working on 1st prototype

Page 26: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

26

Have Usecases From

• astronomy (LSST)• industry (eBay)• genomics (LLNL)• climate (PNNL/ARM)• seismic (Emory Univ)• environmental observation & modeling

(Oregon Univ)• earth remote sensing (UCSB)• fusion (LLNL/NIF)

• WE NEED YOUR USECASES

Page 27: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

27

Timeline

• Mid June ‘09– professional-looking scidb.org– start building user community

• Late August ’09– planned demo at VLDB– reach out to non-US communities

through XLDB3

• End of Q1’10 – alpha• End of Q4’10 – beta

Page 28: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

28

Manpower

• All work so far in-kind

• 4.5 FTEs working on demo– SLAC, MIT, UW, RAS

• Good chances to have funds available this FY to hire ~5 full time developers

• Actively looking for more partners

– GET INVOLVED

Page 29: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

29

Summary

• Many commonalities within science and between science and industry

• Existing off-the-shelf technologies inefficient for very large scale analytics

• SciDB – new open source science DBMS– Community realizes shared software

infrastructure is good– Big lighthouse customers– Strong team– If successful, will enable unprecedented

analyses at extreme scale

Page 30: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

30

Related Links

• http://scidb.org

• http://www-conf.slac.stanford.edu/xldb07

• http://www-conf.slac.stanford.edu/xldb08

• http://www-conf.slac.stanford.edu/xldb09