big data era challenges and opportunities in …...lsst: big data challenges mining in real time a...

51
Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ and Related Learning Methods Can Contribute? Prof. Pablo Estévez, Ph.D. Department of Electrical Engineering, Universidad de Chile & Millennium Institute of Astrophysics, Chile Houston, TX, January 8, 2016 WSOM 2016

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Big Data Era Challenges and

Opportunities in Astronomy: How

SOM/LVQ and Related Learning

Methods Can Contribute?Prof. Pablo Estévez, Ph.D.

Department of Electrical Engineering, Universidad de Chile

& Millennium Institute of Astrophysics, Chile

Houston, TX, January 8, 2016WSOM 2016

Page 2: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Contents

� Astronomy Context

� Large Synoptic Survey Telescope

� Millennium Institute of Astrophysics (MAS)

� Big Data

� SOM/LVQ

� Two Examples

� Conclusions

Page 3: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Mirrors of the largest telescopes

SALT

By 2024 Chile will concentrate 70% of the global observingWe have the responsibility to capitalize on this opportunity

Page 4: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Large Synoptic Survey

Telescope (LSST)

Cerro Pachón, Chile, 2022

Page 5: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

3 x3 degrees field of view

All southern hemisphere in 3 days

During 10 years

LSST will produce a 3D video of the Universe

Cosmic Cinematography: Exploration of time domain

In one year it will collect more data than

all previous telescopes as a whole (15 PB/year)

Real time data management

100,000 transients per night

Page 6: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Challenges for ChileAccess to 100% of the data

But LSST will not do the analysis

Avalanche of data

Huge challenges in computational

intelligence and data analysis

New way of doing science:

Data-driven Science

Page 7: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

LSST: Big Data Challenges

� Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years

� Classify more than 50 billion objects and followup many of these events in real time

� Spectrograph to separate light into frequencyspectrum

� Extracting knowledge in real time for ~2 millionevents per night

� Discovering the unknown unknowns(serendipity): the things that we do not evenknow that we don´t know!

Page 8: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Big Data Four V´s

Page 9: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Astronomy Data Production Comparison

Credits: ALMA, Maccarena Gonzalez:

Page 10: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Big Data Analytics

How did

the

Milky Way

form?

Challenge:

Can Data

Mining

discover

new patterns

and

correlations?

What vs.

Why

Challenge:

Providing

good

visualization

tools for

doing

science

High

Performance

Computing,

GPGPUs

Page 11: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Pragmatic Approach

� Knowing what, not why, is good enough

� This is done finding out valuable correlations(including non-linear relationships)

� Correlation allows us analyzing a phenomenonby identifying a good proxy for it

� The grand challenge is the problem of inference: turning data into knowledge throughmodels

� A data-driven approach is used instead of a hypothesis-driven one

Page 12: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

SOM/LVQ Methods

� What is the Best Classifier in the World?

� Source: Fernandez-Delgado et al, JMLR, 2014

� “Including all the relevant classifiers availabletoday”. Comparison of 179 classifiers on 121 data sets

Ranking Classifier

1 Parallel

Random Forest

2 Random Forest

3 SVM-C

77 LVQ

119 Supervised

SOM

Why the GLVQ*classifiers are not inthe list?

• Implementation in R or Python• Easy interface• Automatic parameter tuning

Page 13: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

SOM Journal Papers

0

20

40

60

80

100

120

385 SOM Papers Published in ISI Journals (2010-2015)

Page 14: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Semi-supervised variable star clustering

New paradigm in the field of ML/CI, is semi-supervised learning:

� (unsupervised) find structures, patterns or clusters by

measuring similarities between samples

� (supervised) Incorporate labels if available, this guides the

unsupervised half (label propagation)

Possibility of detecting something novel (patterns not in the

training set) while still discriminating the known classes.

Example: Clustering 10,000 periodic

variable stars from EROS-2 (Sammon

visualization). Only 10% of the data is

labeled.

Purple: EB, blue: CEPH, yellow: RRL,

green: LPV, red: unknown

Page 15: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Active learning with human in the loop

Active learning: The machine can query the expert for labels. In

practice, the number of labels is much less than in the

supervised case.

Query strategy: (1) Ask labels for the most uncertain samples

(boundaries), (2) minimize expected error, (3) minimize output

variance, etc.

Example: AL query

interface for variable

star classification,

show a pair of

samples and choose

if they belong to the

same class.

Page 16: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Millennium Institute of Astrophysics (MAS)

Started in January 2014

Passion for the exploration of the natural world

Page 17: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Millennium Institute of Astrophysics (MAS)Started in January 2014

Milky Way Astroinformatics, Astrostatistics

Exoplanets, Transients Supernovae

Page 18: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Astronomical Time Series: Light

Curves. “LOS PABLOS” Work

� Light Curve: Stellar brightness (magnitude or flux) versus time.

� Variable stars: stars whose luminosity varies over time (3% of the stars in the universe are variables, and 1% are periodic variable stars)

� Light Curve Analysis: Useful for period detection, event detection, stellar classification, extra solar planet discovery, measuring distance to earth, etc.

Page 19: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

An Example of a Light Curve

Page 20: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Variable stars

Eclipsing binary stars Pulsating star

Page 21: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Folded Light Curves

� The transformation “t modulus T” plots successive

cycles atop one another, where T is the period

� Usually all periods within a range are tried to find the

one that maximizes a criterion (sweep).

Folding a light curve Estimating the period

Page 22: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Automated period detection

� Correntropy (generalized correlation) is used to compute similarities between samples

� Go beyond second order statistics, taking into account higher order moments

� Robustness to outliers and noise

� Spectral decomposition of correntropy using advanced signal processing techniques

� Gaussian basis functions are used instead of sinusoids

� Go beyond Fourier representation to get super-resolution, more localized and sparser spectra

Page 23: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Example

Page 24: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

EROS-2 Survey

� Survey of the Magellanic Clouds and the Galactic bulge

� Data taken from ESO Observatory, in La Silla, Chile

� 38.2 million light curves with two channels each (blue and red).

� EROS dataset processed automatically in 18 hours using GPGPU cluster. We found 120,000 periodic variables.

� Near future (within MAS):

� Be able to process a billion light curves per day

Page 25: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

� DEMO Period_finding_demo_Python

Page 26: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Real-time Transient Detection

Pipeline (PANCHO´S Work)

� Quest to complete the luminosity-time diagram for low luminosities and short cadences

� Discovery of new transient phenomena

� New instruments like DECam and LSST will allow us to detect for example the explosion of a supernova in real time.

� A custom real-time transient pipeline has been developed.

Page 27: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion
Page 28: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

High Cadence Transient Survey

(F. Forster et al.)HiTS scientific objective: Find

evidence of shock breakouts (SBO).

SBO: Event that occurs instants after

the explosion of a supernova.

Supernova: Explosion by the end of

the life cycle of massive stars

Dark Energy Camera (DECam)

1. Formation of neutron star (~secs)2. Shock emergence (~hrs)3. Glowing ejecta (~days/weeks)4. Renmant diffusion (~kyrs)

Figures: nasa.gov

Page 29: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

HiTS Image Reduction Pipeline

● At this point, candidates are dominated by artifacts 1:10K

● ML to find the needles in the haystack

Data Capture Preprocessing Alignment

PSF Matching SubtractionCandidate

Selection

Candidate

Filtering

Visual

Inspection

Page 30: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Non-negative Matrix Factorization

Page 31: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Feature Extraction using NMF

In NMF, we aim to decompose V into factors W and H by

solving

where the non-negative constraints are element-wise.

Nonnegativity: Only additive combinations. Sparse and

part based decompositions

The NMF problem is non-convex in W and H at the same

time (ill-posed). Regularization can alleviate this.

Page 32: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Principal Component Analysis

Lee and Seung, Nature 1999

Page 33: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Non-negative Matrix Factorization

Lee and Seung, Nature 1999

Page 34: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Cosmic rays and noise

Current Reference Difference Current Reference Difference

Page 35: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Stars!

Current Reference Difference Current Reference Difference

Page 36: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

2D Visualization of the astronomical images using Self Organizing Maps (SOM)

● The SOM is an unsupervised neural network using for visualization...

● The database contain ~1000 21x21 images of objects labeled as variable by the CMM pipeline.

● We use Non-negative matrix factorization (NMF) to capture the different behaviors in the stamps and reduce dimensionality (441 → 16)

● We train the SOM with the NMF coefficients and obtain a u-matrix visualization

Page 37: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

The whole picture

Page 38: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Proposed method: Offline training

phase

Slicing &

Scaling

Training

data

NMF

Im1

W1

H1

W2

H2

Ws

Hs

Train RF

modelIm2

Ims ● Training set: 500,000 labeled

HiTS candidates

● NMF parameters:K and Lambda

Nx3x441

Nx441 NxKKx441

Page 39: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Results: Classification Performance

NMF outperforms PCA and the raw model

(a) FPR: 0.1%, FNR: 4% (NMF), 10% (PCA), 15% (raw)

(b) FPR: 0.1%, FNR: 15% (NMF), 30% (PCA), 40% (raw)

NMF is less affected when SNR decreases

SNR>6 5<SNR<6

Page 40: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Traditional Pattern Recognition

Model

Page 41: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Inspired in Inception Movie

PANCHO, WE NEED TO GO

DEEPER

Page 42: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Deep Learning

• For many years shallow neural network

architectures were used

• Classical multilayer perceptron trained by error

backpropagation (gradient descent)

• Problem of vanishing gradients for deeper

arquitectures, number of examples,

computational time

Page 43: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

21

21

5

5

17

20

3

20

6

50

2

2

200 200

3

6

3

17

34

4

Convolutional Neural Nets

Applied to HiTS4

convolutionalpooling

Guillermo Cabrera, Ignacio Reyes, Pablo Estevez, Francisco Förster,

Juan-Carlos Maureira

Page 44: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

ConvNet Applied to HiTSDetection error tradeoff curve

455,393 data-set:- 350,393 training set- 5,000 validation set- 100,000 test set

It takes ~12 mins using Theano over a GPU Tesla K20(Graphical Processor Unit)

Page 45: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Movies!

http://www.das.uchile.cl/~fforster/ATEL/summary_das.html

Page 46: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Movies!

http://www.das.uchile.cl/~fforster/ATEL/summary_das.html

Page 47: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion
Page 48: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion
Page 49: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Movies!

20 Young supernovae were detected in real-time

Plus other 70 supernovae

In 6-nights of observation at DECam

Page 50: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

Conclusions

Big Data is here to stay

Deep learning is a computational

intelligence/machine learning technique that

allows us to extract features automatically

The importance of multidisciplinary teams

TO DO List: Larger sets and more heterogeneous

sets, combine features (NMF, PCA, DL,

engineered features), active learning

Page 51: Big Data Era Challenges and Opportunities in …...LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion

References

� P. Huijse, P. Estevez, P. Protopapas, JC Principe, P. Zegers, “Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases”, IEEE Computational Intelligence Magazine, August 2014.

� P. Protopapas, P. Huijse, P. Estevez, P. Zegers, JC Principe, JB Marquette “A novel, fully automated pipeline for period estimation in the EROS 2 data set”, Astrophysical Journal Supplement Series, 216:25, February 2015.

� P.Huijse, P. Estevez, P. Protopapas, P. Zegers, J. Principe, “An Information Theoretic Algorithm for Finding Periodicities in Stellar Light Curves”, IEEE Transactions on Signal Processing, Vol. 60, n°10, pp. 5135-5145, 2012.

� P.Huijse, P. Estevez, P. Zegers, P. Protopapas, J. Principe, “Period Estimation in Astronomical Time Series using Slotted Correntropy”, IEEE Signal Processing Letters, Vol. 18, n°6, pp. 371-374, 2011.