scidb @ nersc · scidb, parallel processing without parallel programming everything+in+arrays+ –...

Yushu SciDB @ NERSC 1

Upload: others

Post on 14-Jul-2020

6 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Yushu

SciDB @ NERSC

-‐ 1 -‐

Array Like Science Data "– More common than you think

-‐ 2 -‐

SciDB, parallel processing without parallel programming

Everything in Arrays –  Locate an element at O(constant)

–  Can be very sparse –  Best for machine/simula9on generated structure data

–  Good for metadata too •  Query-‐like language, auto-‐paralleliza:on

•  Do Calcula:ons inside the DB

-‐ 3 -‐ Yushu Yao

Page 4: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

NERSC SciDB Testbed

•  Partner up with Science Teams –  Hold their hands to load the 1st batch of data and implement the 1st major analysis opera9on

•  15+ Science Projects •  Complicated Workflows and Algorithms •  Mul:ple Science Domains: –  Astronomy, Climate, Bio-‐imaging, Genomic

•  Mul:ple Types of Data –  Spectrums, Images, Time Series

•  Large amount of data –  Normally 100GB-‐1TB, some has 5+TB

-‐ 4 -‐

Page 5: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Types of Data Suitable for SciDB

•  Imaging data: digital pictures from light sources or telescopes

•  Time series data collected from sensors •  Spectral data •  Graph-‐like structures that represent rela:ons between en::es (sparse matrix)

-‐ 5 -‐

Page 6: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Examples

-‐ 6 -‐

Page 7: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

MetAtlas (LIQUID CHROMATOGRAPHY-MASS SPECTROMETRY)

-‐ 7 -‐

Page 8: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

MetAtlas (LIQUID CHROMATOGRAPHY-MASS SPECTROMETRY)

-‐ 8 -‐

Page 9: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Some Primitives

-‐ 9 -‐

Aggregate along one dimension Aggregate by re-‐gridding

Page 10: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Benchmark of MetAtlas Workload

-‐ 10 -‐

Page 11: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff Workflow

-‐ 11 -‐

Page 12: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff In SciDB

-‐ 12 -‐

Collec9on of Spectrums

Page 13: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

DustOff Scaling

-‐ 13 -‐

Page 14: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

Strength/Weaknesses of SciDB

-‐ 14 -‐

Analysis

Management

Usability

Sharing

Good R/Python Binding Good Build-‐in analy9cs Rela9vely easy to extend in C++

Easy to put behind a webpage Need manual access control

Good for subselec9ng/filtering Need to load data in (duplicate)

Page 15: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

-‐ 15 -‐

Page 16: SciDB @ NERSC · SciDB, parallel processing without parallel programming Everything+in+Arrays+ – Locate"an"elementat O(constant) – Canbeverysparse – Bestfor"machine/simulaon"

New SciDB Service at NERSC

•  Dedicated Servers •  Produc:on Ready •  To Request SciDB: –  h\ps://www.nersc.gov/users/science-‐gateways/science-‐database-‐request-‐form/

-‐ 16 -‐

Parallel Architecture, Parallel ... - Georgetown University

Parallel Processing and Multiprocessors Why Parallel ...ee565/slides/ch4.pdf · Parallel Processing and Multiprocessors why parallel processing? types of parallel processors ... Flynn

AN OPEN -ACCESS HIGH PERFORMANCE COMPUTING SYSTEM … · • SciDB • Integration • R • Interface • R/Shiny • SciDB Capabilities • CROSS_JOIN: Combine two arrays, aligning

The Scidb Package - An R Interface to SciDB_Lewis_2013_Slides

Large-Scale Linear Algebra with Rillposed.net/boston_r_meetup_2012.pdf · SciDB and R-SciDB is an open-source, array-oriented, parallel, distributed database.-It knows about matrix

Welcome to the first Workshop on Big data Open Source ... · • Apache Flink • Apache Reef • Apache Singa • Apache Spark • Padres • rasdaman • SciDB. Massively Parallel

· SciDB allows to search through 100s ofGB of RAW data and find images features inside it SciDB Testbed at NERSC - Share/Analyze Terabytes of Science Data Easy-to-use, fast, interactive

Parallel Database Systems - NUS Computingtankl/cs5225/2008/parallel...1 CS5225 Parallel DB 1 Parallel Database Systems CS5225 Parallel DB 2 Parallel DBMS • Uniprocessor technology

Notes on Geographic Information Systems, DBMS Technology, and SciDB Dr. Paul G. Brown Paradigm4 / SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads

David Malon*, Peter van Gemmeren, Jack Weinstein Argonne National Laboratory [email protected] ACAT 2011 5-9 September 2011 An Exploration of SciDB in the

Aspects of practical parallel programming Parallel programming models Data parallel

Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p

SCIDB BASED FRAMEWORK FOR STORAGE AND ANALYSIS OF …€¦ · SciDB array is created by specifying its dimensions and attributes of the array. For example a 3-dimensional SciDB array

Implementing Connected Component Labeling as a User ... · SciDB SciDB is an all-in-one data management and advanced analytics platform that features: • Complex analytics inside

SFM TG2 0214 zinio Page 15 - STRONG Fitness Magazine · smash your versonal bestfor dead lifts, but when was the lasttimeyou focused on the techniquesvou learned when you first started

Package ‘scidb’ · SciDB attributes representing each column of the data.frame. The functions as.scidband df2scidb are equivalent in this use case. The SciDB array row and column

Parallel Processing & Distributed Systemshungnq/.../ParallelProcessing... · Parallel Processing Terminology Parallel processing Parallel computer – Multi-processor computer capable

SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

UK - BestFor · UK. Bistrot. 2 3 Why There is a thing that time cannot change: our aspiration to improve ourselves. ... Delicatessens and shops specialized in roasted foods Catering

SubZero: A Fine-Grained Lineage System for Scientiﬁc Databasessirrice.github.io/files/papers/subzero-icde13.pdf · SciDB native operator while the red dotted rectangles are UDFs

SciDB Tutorial Technical Overview, and Best Practices

Expect the Bestfor Your Child - ERICby Margaret Mahy. Collect ads from newspapers or magazines. Look for words and figures of speech (for example, New! Exclusive!) that are meant to

BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

Quick Install Guide - Foxtel€¦ · Power Point Power Pack Cable Grey Connection Cable Yellow Tipped Ethernet Cable Laptop iQ3 2 WiFi Modem . 2.4GHz WiFi Name (SSID) bestfor range

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis

UK - BestFor · of ovens for delicatessens that offers specific cooking treatments for red and white meats, fish, shellfish, oven baked first courses and vegetables, baked goods,

SCALABLE EARTH OBSERVATION ANALYTICS WITH SCIDB · SciDB: A database management system for applications with complex analytics. Computing in Science & Engineering, 15(3), 54-62. 9

Massively Scalable Computational Finance with SciDB

Data Visualization in Python - Biotec · Workflow III: construct basic plot • In our case, we want to plot the X againstthe Y values • dots looks bestfor this application •

Addressing the Big-Earth-Data Variety Challenge with the ...vis.unl.edu/~yu/homepage/publications/paper/2016... · SciDB, as mentioned above, also supports tightly-coupled scientific

Efﬁcient Iterative Processing in the SciDB Parallel Array ... · optimization is applicable, it has been shown to signi cantly improve performance in relational and graph systems

Time Travel in a Scientific Array Database - SciDB - University of

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Scalable Analysis of Multitemporal Images Using an Array ... · SciDB is an open source multidimensional array based database and supports distributed storage, parallel processing,