scidb @ nersc · scidb, parallel processing without parallel programming everything+in+arrays+ –...

16
Yushu SciDB @ NERSC 1

Upload: others

Post on 14-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Yushu

SciDB @ NERSC

-­‐  1  -­‐  

Page 2: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Array Like Science Data "– More common than you think

-­‐  2  -­‐  

Page 3: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

SciDB, parallel processing without parallel programming

Everything  in  Arrays  –  Locate  an  element  at  O(constant)  

–  Can  be  very  sparse  –  Best  for  machine/simula9on  generated  structure  data  

–  Good  for  metadata  too  •  Query-­‐like  language,  auto-­‐paralleliza:on  

•  Do  Calcula:ons  inside  the  DB  

 -­‐  3  -­‐   Yushu  Yao  

Page 4: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

NERSC SciDB Testbed

•  Partner  up  with  Science  Teams  –  Hold  their  hands  to  load  the  1st  batch  of  data  and  implement  the  1st  major  analysis  opera9on  

•  15+  Science  Projects  •  Complicated  Workflows  and  Algorithms  •  Mul:ple  Science  Domains:  –  Astronomy,  Climate,  Bio-­‐imaging,  Genomic  

•  Mul:ple  Types  of  Data  –  Spectrums,  Images,  Time  Series  

•  Large  amount  of  data  –  Normally  100GB-­‐1TB,  some  has  5+TB  

-­‐  4  -­‐  

Page 5: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Types of Data Suitable for SciDB

•  Imaging  data:  digital  pictures  from  light  sources  or  telescopes    

•  Time  series  data  collected  from  sensors    •  Spectral  data    •  Graph-­‐like  structures  that  represent  rela:ons  between  en::es  (sparse  matrix)    

-­‐  5  -­‐  

Page 6: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Examples

-­‐  6  -­‐  

Page 7: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

MetAtlas (LIQUID CHROMATOGRAPHY-MASS SPECTROMETRY)

-­‐  7  -­‐  

Page 8: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

MetAtlas (LIQUID CHROMATOGRAPHY-MASS SPECTROMETRY)

-­‐  8  -­‐  

Page 9: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Some Primitives

-­‐  9  -­‐  

Aggregate  along  one  dimension   Aggregate  by  re-­‐gridding  

Page 10: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Benchmark of MetAtlas Workload

-­‐  10  -­‐  

Page 11: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff Workflow

-­‐  11  -­‐  

Page 12: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff In SciDB

-­‐  12  -­‐  

Collec9on  of  Spectrums  

Page 13: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff Scaling

-­‐  13  -­‐  

Page 14: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Strength/Weaknesses of SciDB

-­‐  14  -­‐  

Analysis  

Management  

Usability  

Sharing  

Good  R/Python  Binding  Good  Build-­‐in  analy9cs  Rela9vely  easy  to  extend  in  C++  

Easy  to  put  behind  a  webpage  Need  manual  access  control  

Good  for  subselec9ng/filtering  Need  to  load  data  in  (duplicate)  

Page 15: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

-­‐  15  -­‐  

Page 16: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

New SciDB Service at NERSC

•  Dedicated  Servers  •  Produc:on  Ready  •  To  Request  SciDB:  –  h\ps://www.nersc.gov/users/science-­‐gateways/science-­‐database-­‐request-­‐form/  

-­‐  16  -­‐