extreme big data - applications

1
http://www.gsic.titech.ac.jp/sc16 Collaborative work with DENSO CORPORATION and DENSO IT LABORATORY, INC Ultra-fast Metagenome Analysis Extreme Big Data - Applications Next Generation Big Data Infrastructure Technologies Towards Yottabyte / Year Corpus: large collection of texts Efficient and compact in- tra-node data structure based on ternary trees Distributed implemen- tation with MPI library Co-occurrence matrix in Compressed Sparse Row format stored to parallel storage in hdf5 Integration with SLEPC high performance distributed sparse linear algebra library for dimensionality reduction Various machine learning algorithms The whole pipeline orchestrated from Python For more details, please visit http://vsm.blackbird.pw 50 0 10 20 30 40 0 200 400 600 Top-5 validaon error [%] Epoch 48 GPU System 1 GPU System 1.5% worse than 1 GPU training Fig. 1 Validaon Error of ILSVRC2012 Classificaon Task on Two Plaorms Fig. 2 Target DL App. Architecture 0 100 200 300 400 500 (1,1) (1,4) (1,8) (1,11) (2,1) (2,4) (2,8) (2,11) (4,1) (4,4) (4,8) (4,11) (8,1) (8,4) (8,8) (8,11) (16,1) (16,4) (16,8) (16,11) Average mini-batch size (# Node, Sub-batch size) Measured Predicted Fig. 3 Measured/Predicted Average Mini- batch Size On TSUBAME-KFC/DL Sub-batch size # Node Fig. 4 Predicted Epoch Time of ILSVRC2012 Classificaon Task on TSUBAME-KFC/DL The fastest conguraons within tolerable mini-batch size 138 ± 25% Faster but may degrade accuracy Many studies have shown Deep Convolutional Neural Networks (DCNNs) exhibit great accura- cies given large training datasets in image recognition tasks. Optimization technique known as mini-batch Stochastic Gradient Descent (SGD) is widely used for deep learning because it gives fast training speed and good recognition accuracies when mini-batch size is set in an appropriate range. We propose a performance model of a distributed DCNN training system called “SPRINT”, which uses asynchronous GPU processing based on mini-batch SGD, with considering average mini-batch size that is averaged number of training samples used in a single weight update. Our performance model takes DCNN architecture and machine specifications as input parameters, and predicts time to sweep entire dataset and average mini-batch size with 8% error in average on cer- tain supercomputer. Experimental results on two different supercomputers show that our model can steadily choose the fastest machine configuration that nearly meets a target mini-batch size. O(n) Meas. data O(m) Reference Databases O (m x n) calculation Similarity search A typical EBD x EBD calculation! GHOSTZ Algorithm [1][2] Similarity filtering using triangle inequality among subsequences GHOSTZ-GPU [3] : GPU Acceleration GPU/MPI Parallelization GHOSTZ is ~185-261x faster than BLASTX. [1] Suzuki S, Kakuta M, Ishida T, Akiyama Y. PLoS ONE, 9(8): e103833, 2014. [2] Suzuki S, Kakuta M, Ishida T, Akiyama Y. Bioinformatics, 31(8): 1183-1190, 2015. [3] Suzuki S, Kakuta M, Ishida T, Akiyama Y. PLoS ONE, 11(8): e0157338, 2016. Acknowledgments. This work was partly supported by the Strategic Programs for Innovative Research (SPIRE) Field 1 Supercomputational Life Science of the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan and Core Research for Evolutional Science and Technology (CREST) “Extreme Big Data” from the Japan Science and Technology Agency (JST). 1-million PPIs can be predicted in a day with 3,600 CPU cores + 1,200 K20x GPUs Available on CPU cluster, GPU cluster [5] , MIC (Xeon Phi) cluster [6] [4] Ohue M, Matsuzaki Y, Uchikoga N, Ishida T, Akiyama Y. Protein and Peptide Letters, 21(8): 766-778, 2014. [5] Ohue M, Shimoda T, Suzuki S, Matsuzaki Y, Ishida T, Akiyama Y. Bioinformatics, 30(22): 3281-3283, 2014. [6] Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. BMC Systems Biology, 9(Suppl 1): S6, 2015. Acknowledgments. This work was partly supported by the Next-generation Integrated Living Matter Simulation (ISLiM) project of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) Japan, Core Research for Evolutional Science and Technology (CREST) “Extreme Big Data” from the Japan Science and Technology Agency (JST). HPCI system research project (hp120131), Microsoft Japan Co., Ltd. and Leave a Nest research grant. [4] Exhaustive PPI Predictions Predicting Statistics of a Distributed DL System Algorithms and Tools for Computational Linguistics Process billion-words scale corpora in-memory, distributed, fast Bibliography [1] A. Drozd, A. Gladkova, and S. Matsuoka, “Word embeddings, analogies, and ma- chine learning: beyond king - man + woman = queen,” accepted for COLING 2016. [2] A. Gladkova, A. Drozd, and S. Matsuoka, “Analogy-based detection of morpholog- ical and semantic relations with word embeddings: what works and what doesn’t.,” in Proceedings of the NAACL-HLT SRW, San Diego, CA, June 12-17, 2016, pp. 47–54. [3] A. Gladkova and A. Drozd, “Intrinsic evaluations of word embeddings: what can we do better?,” in Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, 2016, pp. 36–42. [4] A. Drozd, A. Gladkova, and S. Matsuoka, “Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora,” in Proceedings of 2015 IEEE Interna- tional Conference on Data Science and Data Intensive Systems (DSDIS), 2015, pp. 61–68. [5] A. Drozd, A. Gladkova, and S. Matsuoka, “Python, Performance, and Natural Language Processing,” in Proceedings of the 5th Workshop on Python for High-Per- formance and Scientific Computing, New York, NY, USA, 2015, p. 1:1–1:10. Acknowledgments. This research was partly supported by JST, CREST (Research Area: Advanced Core Technologies for Big Data Integration). Acknowledgments. This research was supported by JST, CREST (Research Area: Advanced Core Technologies for Big Data Integration). Metagenome Analysis GHOSTZ-MP: Ultra-fast Sequence Homology Search Oral Metagenome Application Protein-Protein Interactions (PPIs) MEGADOCK: Ultra-fast PPI Predictions Azure-based PPI Prediction Platform Elucidation of protein-protein interactions(PPIs) is important for understanding disease mechanisms and for drug discovery Node CPU Updang thread GPU thread GPU Interconnect SSD GPU Part of dataset Weights CNN model Weights CNN model GPU thread Σ Grad Σ Grad MPI All- reduce buer Latset weights Momentum etc. (1) (2) (3) (a) (b) (c) image by lanl.gov Bibliography [1] Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)

Upload: others

Post on 08-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extreme Big Data - Applications

http://www.gsic.titech.ac.jp/sc16

Collaborative work with DENSO CORPORATION and DENSO IT LABORATORY, INC

Ultra-fast Metagenome Analysis

Extreme Big Data - ApplicationsNext Generation Big Data Infrastructure TechnologiesTowards Yottabyte / Year

Corpus: largecollection of texts

Efficient and compact in-tra-node data structure based on ternary trees

Distributed implemen-tation with MPI library

Co-occurrence matrix in Compressed Sparse Row format stored to parallel storage in hdf5

Integration with SLEPC high performance distributed sparse linear algebra library for dimensionality reduction

Various machine learning algorithms

The whole pipeline orchestrated from Python

For more details, please visithttp://vsm.blackbird.pw

50

0

10

20

30

40

0 200 400 600

Top-

5 va

lidati

on e

rror

[%]

Epoch

48 GPU System

1 GPU System

1.5% worse than 1 GPU training

Fig. 1 Validation Error of ILSVRC2012Classification Task on Two Platforms Fig. 2 Target DL App. Architecture

0

100

200

300

400

500

(1,1

)(1

,4)

(1,8

)(1

,11)

(2,1

)(2

,4)

(2,8

)(2

,11)

(4,1

)(4

,4)

(4,8

)(4

,11)

(8,1

)(8

,4)

(8,8

)(8

,11)

(16,

1)(1

6,4)

(16,

8)(1

6,11

)

Aver

age

min

i-bat

ch si

ze

(# Node, Sub-batch size)

Measured

Predicted

Fig. 3 Measured/Predicted Average Mini-batch Size On TSUBAME-KFC/DL

Sub-batch size

# N

ode

Fig. 4 Predicted Epoch Time of ILSVRC2012Classification Task on TSUBAME-KFC/DL

The fastest configurationswithin tolerable mini-batchsize 138 ± 25%

Faster but may degrade accuracy

Many studies have shown Deep Convolutional Neural Networks (DCNNs) exhibit great accura-cies given large training datasets in image recognition tasks. Optimization technique known as mini-batch Stochastic Gradient Descent (SGD) is widely used for deep learning because it gives fast training speed and good recognition accuracies when mini-batch size is set in an appropriate range. We propose a performance model of a distributed DCNN training system called “SPRINT”, which uses asynchronous GPU processing based on mini-batch SGD, with considering average mini-batch size that is averaged number of training samples used in a single weight update. Our performance model takes DCNN architecture and machine specifications as input parameters, and predicts time to sweep entire dataset and average mini-batch size with 8% error in average on cer-tain supercomputer. Experimental results on two different supercomputers show that our model can steadily choose the fastest machine configuration that nearly meets a target mini-batch size.

O(n)Meas.data

O(m) ReferenceDatabases

O (m x n) calculationSimilarity search

A typical EBD x EBD calculation!

GHOSTZ Algorithm [1][2]

Similarity filtering using triangle inequality among subsequences

GHOSTZ-GPU [3]: GPU Acceleration

GPU/MPI ParallelizationGHOSTZ is ~185-261x faster than BLASTX.

[1] Suzuki S, Kakuta M, Ishida T, Akiyama Y. PLoS ONE, 9(8): e103833, 2014.[2] Suzuki S, Kakuta M, Ishida T, Akiyama Y. Bioinformatics, 31(8): 1183-1190, 2015.[3] Suzuki S, Kakuta M, Ishida T, Akiyama Y. PLoS ONE, 11(8): e0157338, 2016.Acknowledgments. This work was partly supported by the Strategic Programs for Innovative Research (SPIRE) Field 1 Supercomputational Life Science of the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan and Core Research for Evolutional Science and Technology (CREST) “Extreme Big Data” from the Japan Science and Technology Agency (JST).

• 1-million PPIs can be predicted in a daywith 3,600 CPU cores + 1,200 K20x GPUs

• Available on CPU cluster, GPU cluster[5],MIC (Xeon Phi) cluster[6]

[4] Ohue M, Matsuzaki Y, Uchikoga N, Ishida T, Akiyama Y. Protein and Peptide Letters, 21(8): 766-778, 2014.[5] Ohue M, Shimoda T, Suzuki S, Matsuzaki Y, Ishida T, Akiyama Y. Bioinformatics, 30(22): 3281-3283, 2014.[6] Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. BMC Systems Biology, 9(Suppl 1): S6, 2015.Acknowledgments. This work was partly supported by the Next-generation Integrated Living Matter Simulation (ISLiM) project of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) Japan, Core Research for Evolutional Science and Technology (CREST) “Extreme Big Data” from the Japan Science and Technology Agency (JST). HPCI system research project (hp120131), Microsoft Japan Co., Ltd. and Leave a Nest research grant.

[4]

Exhaustive PPI Predictions

Predicting Statistics of a Distributed DL System

Algorithms and Tools for Computational Linguistics Process billion-words scale corpora in-memory, distributed, fast

Bibliography[1] A. Drozd, A. Gladkova, and S. Matsuoka, “Word embeddings, analogies, and ma-chine learning: beyond king - man + woman = queen,” accepted for COLING 2016.[2] A. Gladkova, A. Drozd, and S. Matsuoka, “Analogy-based detection of morpholog-ical and semantic relations with word embeddings: what works and what doesn’t.,” in Proceedings of the NAACL-HLT SRW, San Diego, CA, June 12-17, 2016, pp. 47–54.[3] A. Gladkova and A. Drozd, “Intrinsic evaluations of word embeddings: what can we do better?,” in Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, 2016, pp. 36–42.

[4] A. Drozd, A. Gladkova, and S. Matsuoka, “Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora,” in Proceedings of 2015 IEEE Interna-tional Conference on Data Science and Data Intensive Systems (DSDIS), 2015, pp. 61–68. [5] A. Drozd, A. Gladkova, and S. Matsuoka, “Python, Performance, and Natural Language Processing,” in Proceedings of the 5th Workshop on Python for High-Per-formance and Scientific Computing, New York, NY, USA, 2015, p. 1:1–1:10.

Acknowledgments. This research was partly supported by JST, CREST (Research Area: Advanced Core Technologies for Big Data Integration).

Acknowledgments. This research was supported by JST, CREST (Research Area: Advanced Core Technologies for Big Data Integration).

Metagenome Analysis

GHOSTZ-MP: Ultra-fast Sequence Homology Search

Oral Metagenome Application

Protein-Protein Interactions (PPIs)

MEGADOCK: Ultra-fast PPI Predictions

Azure-based PPI Prediction Platform

Elucidation of protein-protein interactions(PPIs) is important for understanding disease mechanisms and for drug discovery

Node

CPU

Updating thread GPU thread

GPU

Interconnect

SSD

GPU

Part of dataset

Weights CNN model

Weights CNN model

GPU thread

Σ Grad

Σ GradMPI All-

reduce buffer

Latsetweights

Momentumetc.

(1)(2)

(3)(a)(b)

(c)

image by lanl.gov

Bibliography[1] Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)