facilitating knowledge discovery in large archives of

1
Facilitating Knowledge Discovery in Large Archives of Astronomical Spectra using Distributed Cloud-based Engine P ETR ˇ S KODA 1 ,J AKUB K OZA 2 ,A NDREJ P ALI ˇ CKA 2 ,J I ˇ R ´ I N ´ ADVORN ´ IK 2 AND T OM ´ A ˇ S P ETERKA 2 1 Astronomical Institute of the Czech Academy of Sciences Ondˇ rejov, Czech Republic 2 Faculty of Informatics of the Czech Technical University Prague, Czech Republic Abstract Current spectroscopic surveys containing millions of spectra (such as SDSS or LAMOST) are excellent source of homogenised, pre-processed data for applica- tions of Astroinformatics. Advanced statistics and machine learning techniques may help in the identi- fication of new candidates of interesting astronomical objects (e.g. emission line stars, cataclysmic variables, or blazars) based on specific shapes of their spectral lines with great potential of new discoveries of yet unknown types of objects. This motivation initiated our development of VO-CLOUD, the distributed cloud-based engine, providing the user with the comfortable web-based en- vironment for conducting machine learning experiments on large amount of spectra, allowing at the same time the visual backtracking of the individual in- put spectrum in all stages of its processing. This feature is very important for checking the nature of outliers or precision of classification. We present the architecture of VO-CLOUD, its capabilities and typical work- flow used for machine learning of spectra, accenting also the role of the Virtual Observatory protocols used for remote data access. 1 Big Spectral Surveys The currently largest collections of millions of spectra comes from two projects: Sloan Digital Sky Survey (SDSS). In its DR12 there are more than 4 millions of optical spectra. Two spectrographs were so far fed by 640 fibres placed in pre-drilled holes of focal plate, recently a new spectrograph BOSS with 1000 fibres has been used as well as other specialised spectrographs APOGEE and MARVELS. LAMOST survey. Its DR2 contains 4.1 millions spectra, The LAMOST 16 spectrographs are fed by 4000 fibres positioned by micro-motors. There are more than 2.2 million of stars with estimated parameters. Fig. 1. The SDSS telescope and its focal plane with fibres in drilled holes Fig. 2. LAMOST telescope and its focal plane with fibres moved by micro-motors The processing of both surveys is done by several automatic pipelines which classify objects by best match templates and measure red shift. The global shape of spectra is the primary source of information in most cases. The individual local features (e.g. individual line profiles) are mostly ignored. 2 Emission line objects There is a lot of objects in the Universe that may show interesting shapes of some important spectral lines. Very interesting are objects pre- senting emission lines, as are Be stars, cataclysmic variables or quasars, where a gaseous envelope in the shape of a sphere or a disk is ex- pected. The emission lines may present under different physical con- ditions single peak, double peak with different ratios or even compli- cated combined emission and absorption profiles as shown on Fig. 3. Fig. 3. Examples of H α line profiles in Be stars 3 Automatic Classification by Supervised Learning To find emission line objects in a big survey, the automatic procedure must be used based on principles of supervised machine learning. It is basically the pat- tern recognition problem. The shape of a line is described by several parameters (called feature vector). Than a sample of both positive and negative examples (assigning labels manually) is selected for training the machine learning classi- fier. The samples must be randomly mixed and the many-fold cross-validation is applied until the system correctly recognises maximum of positive samples in any mixture of input vectors. Resulting classifier is applied on unknown spec- tra. 4 Finding Outliers with Unsupervised Learning While the supervised training described above helps to classify the spectra archive and thus helps to find the objects of given class, that was already iden- tified in a sample and labelled accordingly, the unsupervised learning tries to identify similar classes automatically without the human intervention. One of a very useful method is the Kohonen Self-Organising Map (SOM) which can identify outliers. So unknown rare objects with strange features hidden in the spectral archive, or even sources with yet undiscovered physical mechanism may be found using SOM. 5 Machine Learning in Mega-Spectra Surveys Running machine learning experiments on a archive with millions of spectra is a challenging problem. To tune the parameters of algorithms used requires mul- tiple runs on the same data combined with detailed visualisation of individual spectra in various stage of processing. In addition the parallelisation on multi- ple nodes will reduce the execution time of the problem considerably. The user has to work with relatively large data sets multiple times while the tuning needs only small changes in few parameters of the machine learning procedures. So the concept of a scientific cloud controlled even by smart mobile is well suiting the problem. It was a main motivation for our the development of the VO- CLOUD infrastructure. 6 VO - CLOUD VO-CLOUD is a distributed cloud-based system implementing basic principles and concepts of Virtual Observatory. The system consists of one master server providing graphical user interface for communication with a user and several distributed nodes where the computational tasks (workers) are executed. Whole system is implemented by using modern technologies of Java EE platform including Java Persistence API (JPA), Context and Dependency In- jection (CDI), Enterprise Java Beans (EJB), JavaServer Faces (JSF) and for the asynchronous communication between computational workers and the master server the Universal Worker Service protocol (UWS) is involved. Despite using Java technology in the distributed system, the computation itself is not limited to it. Any technology like Python, C++ or Fortran can be used for computation on distributed nodes and it can even utilise CUDA hardware acceleration. The big advantage is the VO-CLOUD’s capability to execute computational exper- iments on data directly downloaded from VO-archives by using standardised astronomical protocols SSAP and Datalink. VO-CLOUD contains several parts described below: Fig. 4. Schema of VO-CLOUD deployment 6.1 Data Manager The Data Manager is a tool for acquiring the spectra for experiments from re- mote archives , preferable from Virtual observatory servers. The privileged user may prepare the base of knowledge (BoK) by setting the collection of spectra on a shared data space of VO-CLOUD. As the data volume may be quite large (even Terabytes), the setup of of new BoK may be done only by experienced user who takes the responsibility for occupying large space. The Data Manager allows to obtain input spectra by several means: Direct upload from local disk Remote HTTP download using given URL Virtual observatory votable with list of spectra. The table may be prepared e.g. in TOPCAT tool. Using Simple Spectra Access (SSA) protocol of Virtual observatory. In ad- dition to getting data the additional VO protocol DataLink may be used to convert spectra on-the-fly (e.g. normalise to continuum or cut to given wave- length range). Fig. 5. Data management screen 6.2 Pre-processing The important part of data preparation before applying machine learning is the pre-processing. The data are read as binary table FITS from the shared storage. Then all spectra have to be cut to the same wavelength range and re-binned into the same grid of wavelength points. In addition various dimensionality re- duction methods may be used as PCA, or wavelet transform. The result of the preprocessing is the big CSV file with all spectra and some metadata. Very im- portant is the feature showing a preview of selected spectra after pre-processing directly from this CSV file. It can be also uploaded to shared storage, or down- loaded from cloud to the local computer. Many different pre-processing jobs may run in parallel driven by the UWS job scheduler. 6.3 Interactive Job Control System The user interacts with the master database and job scheduler using web envi- ronment. He or she can create new job, run it, modify its parameters and resend for execution. It is also possible to suspend already running experiments or cancel them completely. Older experiments may be re-executed or older results re-viewed. The results of the job are stored in user database and may be sent automatically by e-mail. Supervisor may see jobs of other users, ordinary user sees only his jobs. The success or failure of the job execution is displayed in different colors and all the logs are available even in a case of job abortion. Fig. 6. Jobs status screen 6.4 Machine Learning Modules Currently the system contains modules for unsupervised training based on SOM and supervised classification based on Random Decision Forests (RDF). More modules will be added in future. Every worker must have installed the required libraries and single script for execution as well as its code for produc- ing HTML graphics (which is produced on client in JavaScript). 6.5 Web-based Visualisation Results of machine learning experiments are presented visually using web- based graphics. Currently the JavaScript-based dygraph library is run for visu- alising spectra and Highchart library for other graphics (e.g. clickable confu- sion matrix, SOM map or U-Matrix for SOM map). 6.6 Future Extensions of VO-CLOUD VO-CLOUD is designed as a modular system. Several machine learning mod- ules are currently being developed. One of the module to be included in VO- CLOUD in near future is vocloud-deeplearning module aiming to ease the usage of deep neural networks on both numerical sequences and tabular as- tronomical data. It is based on the convolutional neural network framework CAFFE. A novel experiment has been conducted to prove the utility of convo- lutional networks for spectra classification. Convolutional layer is known from image recognition and it gives very promising results on one dimensional nu- merical series. The module is also able to run on both CPU and GPU peaking up to 55 times speedup when using GPU over CPU. As the current deployment of workers with different algorithmic payload on several nodes is a little bit complicated, taken into account different config- uration of individual nodes, we want to simplify the deployment of different workers and their libraries as individual containers using Docker system. This will allow to create identical OS images and deploy them automatically when more nodes are available. Conclusions Big spectral archives are good source of data suitable for machine learning of in- teresting objects according to their characteristic spectral line shape. The super- vised learning can be used to find the objects of given class, e.g. emission stars or quasars, however the advanced unsupervised methods like Self-Organising Maps help to identify outliers, even possibly yet unknown objects. The VO- CLOUD engine enables the machine learning of millions of spectra thanks to its dedicated design of distributed system with many parallel workers as well as detailed visualisation of results of machine learning experiments. The im- portant role also play protocols of Virtual observatory allowing both powerful queries of multiple databases as well as on-the-fly pre-processing of spectra. Acknowledgements This work was supported by grant 13-08195S of Czech Science Foundation as well as the COST Action TD1403 BIG-SKY-EARTH. For this research were used a number of spectra from Ondˇ rejov 2m Perek telescope, public LAMOST DR1 survey and Sloan Digital Sky Survey. Poster presented at Astroinformatics 2015, 5 – 10th October 2015, Dubrovnik, Croatia

Upload: others

Post on 19-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Facilitating Knowledge Discovery in Large Archives of

Facilitating Knowledge Discovery in Large Archives of Astronomical

Spectra using Distributed Cloud-based Engine

PETR SKODA1, JAKUB KOZA2, ANDREJ PALICKA2, JIRI NADVORNIK2 AND TOMAS PETERKA2

1Astronomical Institute of the Czech Academy of SciencesOndrejov, Czech Republic

2Faculty of Informatics of the Czech Technical UniversityPrague, Czech Republic

Abstract

Current spectroscopic surveys containing millions of spectra (such as SDSS orLAMOST) are excellent source of homogenised, pre-processed data for applica-tions of Astroinformatics.

Advanced statistics and machine learning techniques may help in the identi-fication of new candidates of interesting astronomical objects (e.g. emission linestars, cataclysmic variables, or blazars) based on specific shapes of their spectrallines with great potential of new discoveries of yet unknown types of objects.

This motivation initiated our development of VO-CLOUD, the distributedcloud-based engine, providing the user with the comfortable web-based en-vironment for conducting machine learning experiments on large amount ofspectra, allowing at the same time the visual backtracking of the individual in-put spectrum in all stages of its processing. This feature is very important forchecking the nature of outliers or precision of classification.

We present the architecture of VO-CLOUD, its capabilities and typical work-flow used for machine learning of spectra, accenting also the role of the VirtualObservatory protocols used for remote data access.

1 Big Spectral Surveys

The currently largest collections of millions of spectra comes from two projects:• Sloan Digital Sky Survey (SDSS). In its DR12 there are more than 4 millions

of optical spectra. Two spectrographs were so far fed by 640 fibres placed inpre-drilled holes of focal plate, recently a new spectrograph BOSS with 1000fibres has been used as well as other specialised spectrographs APOGEE andMARVELS.

• LAMOST survey. Its DR2 contains 4.1 millions spectra, The LAMOST 16spectrographs are fed by 4000 fibres positioned by micro-motors. There aremore than 2.2 million of stars with estimated parameters.

Fig. 1. The SDSS telescope and its focalplane with fibres in drilled holes

Fig. 2. LAMOST telescope and its focalplane with fibres moved by micro-motors

The processing of both surveys is done by several automatic pipelines whichclassify objects by best match templates and measure red shift. The global shapeof spectra is the primary source of information in most cases. The individuallocal features (e.g. individual line profiles) are mostly ignored.

2 Emission line objects

There is a lot of objects in the Universe that may show interestingshapes of some important spectral lines. Very interesting are objects pre-senting emission lines, as are Be stars, cataclysmic variables or quasars,where a gaseous envelope in the shape of a sphere or a disk is ex-pected. The emission lines may present under different physical con-ditions single peak, double peak with different ratios or even compli-cated combined emission and absorption profiles as shown on Fig. 3.

Fig. 3. Examples of Hα line profiles in Be stars

3 Automatic Classification by Supervised Learning

To find emission line objects in a big survey, the automatic procedure must beused based on principles of supervised machine learning. It is basically the pat-tern recognition problem. The shape of a line is described by several parameters(called feature vector). Than a sample of both positive and negative examples(assigning labels manually) is selected for training the machine learning classi-fier. The samples must be randomly mixed and the many-fold cross-validationis applied until the system correctly recognises maximum of positive samples inany mixture of input vectors. Resulting classifier is applied on unknown spec-tra.

4 Finding Outliers with Unsupervised Learning

While the supervised training described above helps to classify the spectraarchive and thus helps to find the objects of given class, that was already iden-tified in a sample and labelled accordingly, the unsupervised learning tries toidentify similar classes automatically without the human intervention. One ofa very useful method is the Kohonen Self-Organising Map (SOM) which canidentify outliers. So unknown rare objects with strange features hidden in thespectral archive, or even sources with yet undiscovered physical mechanismmay be found using SOM.

5 Machine Learning in Mega-Spectra Surveys

Running machine learning experiments on a archive with millions of spectra isa challenging problem. To tune the parameters of algorithms used requires mul-tiple runs on the same data combined with detailed visualisation of individualspectra in various stage of processing. In addition the parallelisation on multi-ple nodes will reduce the execution time of the problem considerably. The userhas to work with relatively large data sets multiple times while the tuning needsonly small changes in few parameters of the machine learning procedures. Sothe concept of a scientific cloud controlled even by smart mobile is well suitingthe problem. It was a main motivation for our the development of the VO-CLOUD infrastructure.

6 VO - CLOUD

VO-CLOUD is a distributed cloud-based system implementing basic principlesand concepts of Virtual Observatory. The system consists of one master serverproviding graphical user interface for communication with a user and severaldistributed nodes where the computational tasks (workers) are executed.

Whole system is implemented by using modern technologies of Java EEplatform including Java Persistence API (JPA), Context and Dependency In-jection (CDI), Enterprise Java Beans (EJB), JavaServer Faces (JSF) and for theasynchronous communication between computational workers and the masterserver the Universal Worker Service protocol (UWS) is involved. Despite usingJava technology in the distributed system, the computation itself is not limitedto it.

Any technology like Python, C++ or Fortran can be used for computationon distributed nodes and it can even utilise CUDA hardware acceleration. Thebig advantage is the VO-CLOUD’s capability to execute computational exper-iments on data directly downloaded from VO-archives by using standardisedastronomical protocols SSAP and Datalink. VO-CLOUD contains several partsdescribed below:

Fig. 4. Schema of VO-CLOUD deployment

6.1 Data Manager

The Data Manager is a tool for acquiring the spectra for experiments from re-mote archives , preferable from Virtual observatory servers. The privileged usermay prepare the base of knowledge (BoK) by setting the collection of spectraon a shared data space of VO-CLOUD. As the data volume may be quite large(even Terabytes), the setup of of new BoK may be done only by experienceduser who takes the responsibility for occupying large space. The Data Managerallows to obtain input spectra by several means:• Direct upload from local disk

• Remote HTTP download using given URL

• Virtual observatory votable with list of spectra. The table may be preparede.g. in TOPCAT tool.

• Using Simple Spectra Access (SSA) protocol of Virtual observatory. In ad-dition to getting data the additional VO protocol DataLink may be used toconvert spectra on-the-fly (e.g. normalise to continuum or cut to given wave-length range).

Fig. 5. Data management screen

6.2 Pre-processing

The important part of data preparation before applying machine learning is thepre-processing. The data are read as binary table FITS from the shared storage.Then all spectra have to be cut to the same wavelength range and re-binnedinto the same grid of wavelength points. In addition various dimensionality re-duction methods may be used as PCA, or wavelet transform. The result of thepreprocessing is the big CSV file with all spectra and some metadata. Very im-portant is the feature showing a preview of selected spectra after pre-processingdirectly from this CSV file. It can be also uploaded to shared storage, or down-loaded from cloud to the local computer. Many different pre-processing jobsmay run in parallel driven by the UWS job scheduler.

6.3 Interactive Job Control System

The user interacts with the master database and job scheduler using web envi-ronment. He or she can create new job, run it, modify its parameters and resendfor execution. It is also possible to suspend already running experiments orcancel them completely. Older experiments may be re-executed or older resultsre-viewed. The results of the job are stored in user database and may be sentautomatically by e-mail. Supervisor may see jobs of other users, ordinary usersees only his jobs. The success or failure of the job execution is displayed indifferent colors and all the logs are available even in a case of job abortion.

Fig. 6. Jobs status screen

6.4 Machine Learning Modules

Currently the system contains modules for unsupervised training based onSOM and supervised classification based on Random Decision Forests (RDF).More modules will be added in future. Every worker must have installed therequired libraries and single script for execution as well as its code for produc-ing HTML graphics (which is produced on client in JavaScript).

6.5 Web-based Visualisation

Results of machine learning experiments are presented visually using web-based graphics. Currently the JavaScript-based dygraph library is run for visu-alising spectra and Highchart library for other graphics (e.g. clickable confu-sion matrix, SOM map or U-Matrix for SOM map).

6.6 Future Extensions of VO-CLOUD

VO-CLOUD is designed as a modular system. Several machine learning mod-ules are currently being developed. One of the module to be included in VO-CLOUD in near future is vocloud-deeplearning module aiming to ease theusage of deep neural networks on both numerical sequences and tabular as-tronomical data. It is based on the convolutional neural network frameworkCAFFE. A novel experiment has been conducted to prove the utility of convo-lutional networks for spectra classification. Convolutional layer is known fromimage recognition and it gives very promising results on one dimensional nu-merical series. The module is also able to run on both CPU and GPU peakingup to 55 times speedup when using GPU over CPU.

As the current deployment of workers with different algorithmic payloadon several nodes is a little bit complicated, taken into account different config-uration of individual nodes, we want to simplify the deployment of differentworkers and their libraries as individual containers using Docker system. Thiswill allow to create identical OS images and deploy them automatically whenmore nodes are available.

Conclusions

Big spectral archives are good source of data suitable for machine learning of in-teresting objects according to their characteristic spectral line shape. The super-vised learning can be used to find the objects of given class, e.g. emission starsor quasars, however the advanced unsupervised methods like Self-OrganisingMaps help to identify outliers, even possibly yet unknown objects. The VO-CLOUD engine enables the machine learning of millions of spectra thanks toits dedicated design of distributed system with many parallel workers as wellas detailed visualisation of results of machine learning experiments. The im-portant role also play protocols of Virtual observatory allowing both powerfulqueries of multiple databases as well as on-the-fly pre-processing of spectra.

Acknowledgements

This work was supported by grant 13-08195S of Czech Science Foundation aswell as the COST Action TD1403 BIG-SKY-EARTH. For this research were useda number of spectra from Ondrejov 2m Perek telescope, public LAMOST DR1survey and Sloan Digital Sky Survey.

Poster presented at Astroinformatics 2015, 5 – 10th October 2015, Dubrovnik, Croatia