vo course 10: big data challenges in astronomy
DESCRIPTION
How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, as imparted for the MTAF (Métodos y Técnicas Avanzadas en Física, Advanced Methods and Techniques in Physics) Master at the University of Granada (UGR).TRANSCRIPT
![Page 1: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/1.jpg)
Astronomy’s Big Data ChallengesJuan de Dios Santander Vela (IAA-CSIC)
![Page 2: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/2.jpg)
Overview
What is, exactly, big data?
Which are the dimensions of big data?
Which are the big data drivers in astronomy?
How can we deal with big data?
VO tools for dealing with big data
![Page 3: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/3.jpg)
What is exactly Big Data?
Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
WIKIPEDIA: “BIG DATA”
![Page 4: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/4.jpg)
What is exactly Big Data?Big Data is data with at least one Big dimension
Bandwidth
Number of individual assets
Size of individual assets
Response speed
…
![Page 5: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/5.jpg)
Big Data
Size
Storage
Access techniques
Processing techniques
Flow
Real time
Event Processi
ng
Offline
Data mining
Processing level
Raw Data
Processed Data Statistics
Schemata
Stuctured
Tagging
Unstructured
Value
Files
Formats Durability
Paralell Access
Capabilities
Information Extracted
Tech Debt
![Page 6: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/6.jpg)
Next big data projects in astronomy
![Page 7: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/7.jpg)
Large Synoptic Survey Telescope
![Page 8: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/8.jpg)
The Large Synoptic Survey Telescope Camera
Steven M. KahnStanford/SLAC
(for the LSST Consortium)
![Page 9: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/9.jpg)
LSST Data Rates
* 2.3 billion pixels read out in less than 2 sec, every 12 sec
* 1 pixel = 2 Bytes (raw)
* Over 3 GBytes/sec peak raw data from camera
* Real-time processing and transient detection: < 10 sec
* Dynamic range: 4 Bytes / pixel
* > 0.6 GB/sec average in pipeline
* 5000 floating point operations per pixel
* 2 TFlop/s average, 9 TFlop/s peak
* ~ 18 Tbytes/night
![Page 10: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/10.jpg)
Relative Survey Power
![Page 11: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/11.jpg)
Square Kilometre Array
![Page 12: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/12.jpg)
Signal Transport & Processing
![Page 13: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/13.jpg)
DESIGNS COUNTS!Signal Transport & Processing
![Page 14: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/14.jpg)
Massive Data Flow, Storage & Processing
18 PB/YEAR
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
STORAGE?CAN’T STORE IT!1 DAY STREAM = 150 DAYSGLOBAL INTERNET TRAFFIC
800 PBTemporaryStorage
![Page 15: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/15.jpg)
Massive Data Flow, Storage & Processing
PROCESSING NEEDS109 TOP RANGE PCS > 1 EXAFLOP/S
30 PETAFLOPS/S
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
TemporaryStorage
![Page 16: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/16.jpg)
Massive Data Flow, Storage & Processing
7 PB/S
> 300 GB/S
BANDWIDTHTYPICAL SURVEY, 5 DAYS READ TIME @ 10GB/SEC
Antenna & Front End Systems
Correlation
Data Product Generation
Long Term Storage
High Availability Storage / DB
On-Demand Processing
TemporaryStorage
![Page 17: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/17.jpg)
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
![Page 18: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/18.jpg)
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
0" 10" 20" 30" 40" 50" 60" 70"
LOFAR"
ASKAP"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
![Page 19: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/19.jpg)
0" 5" 10" 15" 20" 25" 30" 35" 40"
ALMA"
LOFAR"
Bandwidth)in)TB/s)
0" 10" 20" 30" 40" 50" 60" 70"
LOFAR"
ASKAP"
Bandwidth)in)TB/s)
MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna & Front End Systems
Correlation
![Page 20: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/20.jpg)
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
![Page 21: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/21.jpg)
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
![Page 22: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/22.jpg)
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
0" 50" 100" 150" 200" 250" 300" 350"
LOFAR"
ASKAP"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
![Page 23: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/23.jpg)
0" 0,0005" 0,001" 0,0015" 0,002"
VLA"
ALMA"
Processing*TFlops/s*
0" 20" 40" 60" 80" 100" 120"
ALMA"
LOFAR"
Processing*TFlops/s*
0" 50" 100" 150" 200" 250" 300" 350"
LOFAR"
ASKAP"
Processing*TFlops/s*
MASSIVE DATA FLOW, STORAGE & PROCESSING
Correlation
![Page 24: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/24.jpg)
Comparison: LHC
![Page 25: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/25.jpg)
CERN/IT/DB
online systemmulti-level triggerfilter out backgroundreduce data volume from40TB/s to 100MB/s
level 1 - special hardware
40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &offline analysis
![Page 26: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/26.jpg)
CERN/IT/DBEvent Filter & Reconstruction
(figures are for one experiment)
switch
data from detector - event builder
high speed network
computer farm
tapeand disk servers
raw datasummary data
input: 5-100 GB/sec
capacity: 50K SI95 (~4K 1999 PCs)
recording rate: 100 MB/sec (Alice – 1 GB/sec)
+ 1-1.25 PetaByte/year+ 1-500 TB/year
20,000 Redwood cartridges every year (+ copy)
![Page 27: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/27.jpg)
Dealing with Big Data
![Page 28: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/28.jpg)
Dealing with Big Data
We cannot allow for arbitrary queries
We can have arbitrary processing instead
We cannot allow full data dumps
We can generate data on the the fly (see above)
![Page 29: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/29.jpg)
Queries as functions
QUERY = FUNCTION { }DATA
QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
![Page 30: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/30.jpg)
Queries as functions
QUERY = FUNCTION { }DATAALL
QUERIES NEED TO BE PRECOMPUTEDARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
![Page 31: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/31.jpg)
Lambda Architecture
Batch Layer
Serving Layer
Speed Layer
STORE MASTER DATASETCOMPUTE ARBITRARY VIEWS
RANDOM ACCESS TO VIEWSUPDATED BY BATCH LAYER
FAST, INCREMENTAL ALGOS.QUERIES NOT ON BATCH L.COMPENSATES FOR LATENCY
![Page 32: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/32.jpg)
Batch Layer
Stores master copy of the dataset
Precomputes batch views on that master dataset
INMUTABLE, CONSTANTLY
GROWING
INMUTABLE, CONSTANTLY
GROWING
![Page 33: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/33.jpg)
Batch Layer
All Data Batch Layer
View 1
View 2
View n
…NEW DATA
UPDATED VIEWS
TYPICALLY, MAP/REDUCE
![Page 34: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/34.jpg)
Serving Layer
Allows for:
batch writes of view updates
random reads on the views
Does not allow random writes
![Page 35: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/35.jpg)
Speed Layer
Allows for:
incremental writes of view updates
short-term temporal queries on the views
Can be discarded!
![Page 36: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/36.jpg)
Figure 2.1 The master dataset in the Lambda Architecture serves as the source oftruth of your Big Data system. Errors at the serving and speed layers can becorrected, but corruption at the master dataset is irreparable.
The master dataset is the only part of the Lambda Architecture that absolutelymust be safeguarded from corruption. Overloaded machines, failing disks, andpower outages all could cause errors, and human error with dynamic data systemsis an intrinsic risk and inevitable eventuality. You must carefully engineer themaster dataset to prevent corruption in all these cases, as fault tolerance is essentialto the health of a long running data system.
There are two components to the master dataset: the data model to use, and howto physically store it. This chapter is about designing a data model for the masterdataset and the properties such a data model should have. You will learn aboutphysically storing a master dataset in the next chapter.
To provide a roadmap for your undertaking, you will
learn the key properties of datasee how these properties are maintained in the fact-based modelexamine the advantages of the fact-based model for the master dataset
©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=787
27
Licensed to Juan de Dios Santander Vela <[email protected]>
![Page 37: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/37.jpg)
Computing over Big DataBatch layer as a computational engine on data
Need to formally specify
Inputs
Processes
OutputsTHAT LOOKS LIKE
A WORKFLOW!
OR SQL
QUERYING
![Page 38: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/38.jpg)
Map/Reduce
![Page 39: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/39.jpg)
Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrt
def,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)
res2_v,=,map(res2,,v)
stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)
PARALELLISABLE!
![Page 40: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/40.jpg)
Map/Reducefrom%random%import%normalvariatefrom,math,import,sqrtfrom,multiprocessing,import,Pooldef,res2(x):,return,pow(mean_v,6,x,,2.)#"Random"vector,"mean"1,"stdev"0.001v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)pool,=,Pool(processes=4)res2_v,=,pool.map(res2,,v)pool.close()stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))print,(mean_v,,stdev)
ONLY FOR MAP, BUT REDUCE ALSO PARALLELISABLE
![Page 41: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/41.jpg)
0,4
0,5
0,6
0,7
0,8
1 2 3 4 5 6 7 8
Dependence of execution time with the number of pool processorsse
cond
s pe
r milli
on e
lem
ents
Number of pool processors
20 millions10 millions5 millions1 million
![Page 42: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/42.jpg)
Conclusions
Big data needs different approaches
Parallelism & data-side processing
Map/Reduce as parallelism engine
Need of ways to formally specify computations
![Page 43: VO Course 10: Big data challenges in astronomy](https://reader035.vdocuments.us/reader035/viewer/2022081404/5579ad35d8b42ac1148b4eeb/html5/thumbnails/43.jpg)
References & Links
“The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray, Microsoft Research
“MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat, Google
MyExperiment