big data and advanced data intensive computing

jwoo Woo

HiPIC

CSULA

Big Data and Advanced Data Intensive Computing

Yonsei UniversityShin-Chon, Korea

June 18th 2014

Jongwook Woo (PhD)

High-Performance Information Computing Center (HiPIC)

Cloudera Academic Partner and Grants Awardee of Amazon AWS

Computer Information Systems Department

California State University, Los Angeles

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

소개 Emerging Big Data Technology Big Data Use Cases Hadoop 2.0 Training in Big Data


CSULA

Me

이름 : 우종욱 직업 :

교수 ( 직책 : 부교수 ), California State University Los Angeles– Capital City of Entertainment

경력 : 2002 년 부터 교수 : Computer Information Systems Dept, College of

Business and Economics– www.calstatela.edu/faculty/jwoo5

1998 년부터 헐리우드등지의 많은 회사 컨설팅– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출 , 정보통합– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등

2009 여년 부터 하둡 빅데이타에 관심

http://www.calstatela.edu/faculty/jwoo5


CSULA

Experience in Big Data

Grants Received MicroSoft Windows Azure Educator Grant (Oct 2013

- July 2014) Received Amazon AWS in Education Research Grant (July

2012 - July 2014) Received Amazon AWS in Education Coursework Grants (July

2012 - July 2013, Jan 2011 - Dec 2011

Partnership Received Academic Education Partnership with Cloudera since

June 2012 Linked with Hortonworks since May 2013

– Positive to provide partnership


CSULA

Experience in Big Data

Certificate Certified Cloudera 강사 Certified Cloudera Hadoop Developer / Administrator Certificate of Achievement in the Big Data University Training

Course, “Hadoop Fundamentals I”, July 8 2012 Certificate of 10gen Training Course, “M101: MongoDB

Development”, (Dec 24 2012)

Blog and Github for Hadoop and its ecosystems http://dal-cloudcomputing.blogspot.com/

– Hadoop, AWS, Cloudera https://github.com/hipic

– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop

https://github.com/dalgual

http://dal-cloudcomputing.blogspot.com/

https://github.com/hipic

https://github.com/hipic

https://github.com/dalgual


CSULA

Experience in Big Data Several publications regarding Hadoop and NoSQL

Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of MovieLens Data Set using Hive”, in Journal of Science and Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN

“Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013

“Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)

“Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011

“Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)

Collaboration with Universities and companies USC, Texas A&M, Cloudera, Amazon, MicroSoft

http://www.ejournalofscience.org/


CSULA

What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing

ClouderaHortonWorks

AWS

NoSQ

L DB


CSULA

Data

Google“We don’t have a better algorithm than others but we have more data than others”


CSULA

New Data Trend

SparsityUnstructuredSchema free data with sparse attributes

– Semantic or social relationsNo relational property

– nor complex join queries• Log data

ImmutableNo need to update and delete data


CSULA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web– Sensor Data, Bioinformatics, Social Computing,

smart phone, online game…

Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive

Need new systemsNon-expensive


CSULA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– On inexpensive commodity computers

How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive

computers• Own super computers


CSULA

Hadoop 1.0

Hadoop Doug Cutting

– 하둡 창시자– 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의

창시자– 아파치 소프트웨어 파운데이션의 보드 멤버– Chief Architect at Cloudera

MapReduceHDFSRestricted Parallel Programming

– Not for iterative algorithms– Not for graph


CSULA

Emerging Big Data Technology

GiraphSpark and SharkUse CasesUse Cases experienced


CSULA

Spark and Shark

High Speed In-Memory Analytics over Hadoop and Hive datahttp://

www.slideshare.net/Hadoop_Summit/spark-and-shark

Fast Data Sharing– Iterative Graph Algorithms

• Data Mining (Classification/Clustering)– Interactive Query

http://www.slideshare.net/Hadoop_Summit/spark-and-shark





CSULA

Giraph

BSPFacebookhttp://

www.slideshare.net/aladagemre/a-talk-on-apache-giraph

http://www.slideshare.net/aladagemre/a-talk-on-apache-giraph




CSULA

Josh Wills (Cloudera) “I have found that many kinds of

scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.”

“Data science is a set of techniques used by many scientists to solve problems across a wide array of scientific fields.”


CSULA

Use Cases experienced

Log Analysis Log files from IPS and IDS

– 1.5GB per day for each systems Extracting unusual cases using Hadoop, Solr,

Flume on Cloudera

Customer Behavior AnalysisMarket Basket Analysis Algorithm

Machine Learning for Image Processing with Texas A&MHadoop Streaming API

Movie Data Analysis Hive, Impala

jwoo Woo

HiPIC

CSULA

Scalable, Incremental Learning with MapReduce Parallelization for

Cell Detection in High-Resolution 3D Microscopy Data (IJCNN 2013)

Chul Sung, Yoonsuck Choe

Brain Networks Laboratory

Computer Science and Engineering

TAMU

Jongwook Woo

Computer Information

SystemsCALSTATE-LA

Matthew Goodman, Todd Huffman

3SCAN


CSULA

Motivation

Analysis of neuronal distribution in the brain plays an important role in the diagnosis of disorders of the brain.

E.g., Purkinje cell reduction in autism [3]A. Normal cerebellum

B. Reduction of neurons in the Purkinje cell layer

Normal human brain

Autistic human brain


CSULA

Approach

Use a machine learning approach to detect neurons.

Learn a binary classifier:

Input: local volume dataOutput: cell center (1) or off-center (0)


CSULA

Requirement: Effective Incremental Learning

Several properties are desired:Low computational costNon-iterativeNo accumulation of data pointsNo retrainingYet, sufficient accuracy


CSULA

Proposed Algorithm

Principal Components Analysis (PCA)-based supervised learning

No need of retrainingHighly scalable due to only the

eigenvector matrices being storedHighly parallelizable due to its

incremental nature–We keep the eigenvectors as new training

samples are made available and additionally use them in the testing process.


CSULA

MapReduce Parallelization A highly effective and popular framework for big

data analytics Parallel data processing tasks

Map phase - tasks are divided and results are emittedReduce phase - the emitted results are sorted and

consolidated

Apache Hadoop Open source project of the Apache Foundation Storage: Hadoop Distributed File System (HDFS) Processing: Map/Reduce (Fault Tolerant Distributed

Processing)

Slide from Dr. Jongwook Woo’s SWRC 2013 Presentation


CSULA

Hadoop Streaming

Hadoop MapReduce for Non-Java codes: Python, Ruby

Requirement Running Hadoop Needs Hadoop Streaming API

– hadoop-streaming.jar Needs to build Mapper and Reducer codes

– Simple conversion from sequential codes STDIN > mapper > reducer > STDOUT


CSULA

Hadoop Streaming

MapReduce Python execution http://wiki.apache.org/hadoop/HadoopStreaming Sysntax

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options:

-input <path> DFS input file(s) for the Map step

-output <path> DFS output directory for the Reduce step

-mapper <cmd|JavaClassName> The streaming command to run

-reducer <cmd|JavaClassName> The streaming command to run

-file <file> File/dir to be shipped in the Job jar file Example

$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar \

-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py \

-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py \

-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-output

http://wiki.apache.org/hadoop/HadoopStreaming

http://wiki.apache.org/hadoop/HadoopStreaming


CSULA

Training

PCA is run separately on these class-specific subsets, resulting in class-specific eigenvector matrices.

Class 1:

Training Set

Class 2:

Eigenvectors 1:

PCA

Eigenvectors 2:

PCA

XY XZYZ

An input vector


CSULA

Eigenvectors 1:

Novel input x

Eigenvectors 2:

Eigenvectors 1:

Eigenvectors 2:

*

Reconst.

Class 1 Yes

║x−║<║x-║?No

Class 2

Reconst.

**

*

Projection

Projection

Testing

Each data vector x is projected using the two class-specific PCA eigenvector matrices

The class associated with the more accurate reconstruction determines the label for the new data vector

xz

xy

yz

xz

xy

yz

?


CSULA

Reconstruction Examples

Reconstruction of cell center and off-center data using matching vs. non-matching eigenvector matrices

Reconstruction accurate only with matching eigenvector matrix Proximity: Cell center proximity value (e.g., 1.0 is cell center and 0.1

off-center)


CSULA

MapReduce Parallelization

Our algorithm is highly parallelizable.

To exploit this property, we developed a MapReduce-based implementation of the algorithm.


CSULA

MapReduce Parallelization (Training)

Parallel PCA computations of the class-specific subsets from the training sets, generating two eigenvector matrices per training set

Set

Set

Set

Eigenvectors

Input files

Eigen Decomposition

Map phase Output files

Read

worker

worker

worker

Class 1

Class 2

Training Set 1

Class 1

Class 2

Training Set k

Set


CSULA

MapReduce Parallelization (Testing) - Map

1. We need to prepare all data vectors from all voxels in the data volume whether a data vector is in the cell-center class.

Eigenvectors

Input files

ReadProjection & Reconst.

Map phase

Reconst. Errors

Intermediate files

AveragingReconst. Errors

Reduce phase

Output files

Novel input

split 1

split m

Readworker

worker

worker

Averages ofReconst. Errors

err avg.

err avg.

Set

Set

Set

Set

|𝒙 𝟏−~𝒙𝟏+𝟏||𝒙 𝟏−~𝒙𝟏−𝟏|

|𝒙 𝟏−~𝒙𝟏+𝒌||𝒙 𝟏−~𝒙𝟏−𝒌|

|𝒙 𝟏−~𝒙𝟏+𝟏||𝒙 𝟏−~𝒙𝟏−𝟏|

|𝒙 𝟏−~𝒙𝟏+𝒌||𝒙 𝟏−~𝒙𝟏−𝒌|

worker

worker

worker


CSULA

300

250

200

150

100

50

0 A B C D Cluster Configuration

A: Single NodeB: One Master, One SlaveC: One Master, Five SlavesD: One Master, Ten Slaves

Results: MapReduce Performance

Performance comparison during testing

35 map tasks and 10 reduce tasks per job (except for A case)

Performance was greatly improved (nearly 10 times)

Not much gain during training

Ave

rage

Each node computing is quad-core 2xIntel Xeon X5570 CPU and 23.00 GB memory.


CSULA

Conclusion

Developed a novel scalable incremental learning algorithm for fast quantitative analysis of massive, growing, sparsely labeled data.

Our algorithm showed high accuracy (AUC of 0.9614).

10 times speed up using MapReduce.Expected to be broadly applicable to

the analysis of high-throughput medical imaging data.


CSULA

Use Cases in Science

SeismologyHEP


CSULA

하둡 과학 분야 이용 사례

Reflection Seismology ( 반사지진학 ) Marine Seismic Survey ( 해양 탄성파탐사 )

Sears (Retail)Gravity (Online Publishing,

Personalized Content)


CSULA

Reflection Seismology ( 반사지진학 )

반사지진학 A set of techniques for solving a classic inverse problem:

– given a collection of seismograms ( 진동 기록 ) and associated metadata,

– generate an image of the subsurface of the Earth that generated those seismograms.

Big Data– A Modern seismic survey

• tens of thousands of shots and multiple terabytes of trace data.

반사지진학의 목적 To locate oil and natural gas deposits. To identify the location of the Chicxulub Crater

– that has been linked to the extinction of the dinosaur.


CSULA

Marine Seismic Survey(해양 탄성파탐사)


CSULA

Common Depth Point (CDP) Gather( 공통 심도점 )

Common Depth Point (CDP)

CDP 의 목적 By comparing the time it

took for the seismic waves to trace from the different source and receiver locations and experimenting with different velocity models for the waves moving thorough the rock,

– we can estimate the depth of the common surface point that the waves reflected off of.


CSULA

Reflection Seismology and Hadoop

By aggregating a large number of these estimates, construct a complete image of

the surface. As we increase the density ( 밀

도 ) and the number of traces, – create higher quality images

• that improve our understanding of the subsurface geology (지하지질 )

A 3D seismic image of Japan’s southeastern margin


CSULA

Reflection Seismology and Hadoop(Legacy Seismic Data Processing)

Geophysicists ( 지구 물리학자 ) Use the first Cray supercomputers

–as well as the massively parallel Connection Machine.

Parallel Computing–must file a request to move the data into

active storage • then consume precious cluster resources

in order to process the data.


CSULA

Reflection Seismology and Hadoop(Legacy Seismic Data Processing)

open-source software tools in Seismic data processingThe Seismic Unix project

– from the Colorado School of MinesSEPlib

– from Stanfrod UniversitySeisSpace

–commercial toolkit for seismic data processing.• Built on top of an open source foundation,

the JavaSeis project.


CSULA

Emerge of Apache Hadoop for Seismology

Seismic Hadoop by Cloudera Data Intensive Computing

– store and process seismic data in a Hadoop cluster.• Enabled to export many of the most I/O intensive steps in the seismic data processing into the

Hadoop cluster

Combines Seismic Unix with Crunch,– the Java library for creating MapReduce Pipelines.

Seismic Unix – extensive use of Unix pipes in order to construct complex data processing tasks from a set

of simple procedures

sufilter f=

10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort cdp gx

A pipeline in Seismic Unix – first applies a filter to the trace data is built,

– then some meta data associated with each trace are edited,

– and the traces by the metadata just edited are finally sorted


CSULA

What is HEP?

High Energy Physics

Definition: Involves colliding highly energetic, common particles together

– in order to create small, exotic, and incredibly short-lived particles.


CSULA

Large Hadron Collider

Collides protons together at an energy of 7 TeV per particle. protons travel around the rings and are collided inside particle

detectors. Collisions occur every 25 nanoseconds.


CSULA

Compact Muon Solenoid

Big Data Collisions at a rate of 40MHz

– Each collision has about 1MB worth of data. 40MHz x 1MB = 320 Tera bps

– (unmanageable amount)

Complex custom compute system (called trigger) will cut down the entire collision rate to about 300Hz, which means that significant data are statistically determined.


CSULA

From Raw Data to Significant

Raw Sensor Data

Reconstructed Data

Analysis-oriented Data

Physicist-specific N-tuples

1MB

110KB

1KB

Tier 1 - A

t CE

RN

Tier 2 - B

ig Data


CSULA

Characteristics of Tier 2

Need Hadoop Large amount of data (400 TB)

– Large data rate (in the range of 10Gbps) to analyze Need for reliability, but not archival storage Proper use of resources Need for interoperability


CSULA

HDFS Structure

HDFSMounted with FUSE

Worker Nodes

SRMGeneric Web-Services Interface

Globus GridFTPStandard Grid Protocol for WAN Transfers

• FUSE • allows physicists’ C++ applications to access HDFS

without modification.• Two grid components

• allow interoperation with non-Hadoop sites.


CSULA

MapReduce 1.0 Cons and Future

Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream

Hadoop 2.0: YARNproduct


CSULA

Hadoop 2.0: YARN Data processing applications and services

Online Serving – HOYA (HBase on YARN)

Real-time event processing – Storm, S4, other commercial platforms

Tez – Generic framework to run a complex DAG

MPI: OpenMPI, MPICH2

Master-Worker

Machine Learning: Spark

Graph processing: Giraph

Enabled by allowing the use of paradigm-specific application master

[http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex]


CSULA

Training in Big Data

Learn by yourself?Miss many important topicsCloudera

With hands-on exercises

Cloudera 강의하둡 개발자하둡 시스템관리자하둡 데이터 분석가 / 과학자


CSULA

ConclusionEra of Big DataNeed to store and compute Big DataMany solutions but HadoopHadoop is supercomputer that you

can ownHadoop 2.0Training is important


CSULA

Question?

big data and advanced data intensive computing

Data & Analytics