introduction to big data and use cases on hadoop

Jongwook Woo

HiPIC

CSULA

Seoul Technology Society Meetup: Hack'n'Te l l night #3

Seoul, KoreaJuly 25th 2014

Jongwook Woo (PhD)

High-Performance Information Computing Center (HiPIC)

Cloudera Academic Partner and Grants Awardee of Amazon AWS

California State University Los Angeles

Introduction To Big Data and Use Cases on Hadoop

High Performance Information Computing CenterJongwook Woo

CSULA

Contents

IntroductionBig Data Use Cases Hadoop 2.0 Training in Big Data


CSULA

Me

Name: Jongwook Woo, PhDBackgrounds:

Since 1998, consulting companies in Hollywood – Implementing eBusiness applications using J2EE– Search applications using FAST, Lucene/Solr, Sphinx

• Data Integration, Data Feed– Warner Bros (Matrix online game), E!, citysearch.com, ARM

Teaching since 2002: – California State University Los Angeles

Exposed to Hadoop since 2008Exposed to Cloudera since 2010


CSULA

Experience in Big Data

Certificate Certified Cloudera Instructor Certified Cloudera Hadoop Developer / Administrator

Partnership Received Academic Education Partnership with Cloudera since

June 2012

Grants Received Microsoft Windows Azure Educator Grant (Oct 2013 -

July 2014) Received Amazon AWS in Education Research Grant (July

2012 - July 2014) Received Amazon AWS in Education Coursework Grants (July

2012 - July 2013, Jan 2011 - Dec 2011


CSULA

What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing

ClouderaHortonWorks

AWS

NoSQ

L DB


CSULA

Data

Google“We don’t have a better algorithm than others but we have more data than others”


CSULA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web• Sensor Data, Bioinformatics, Social Computing,

smart phone, online game…

Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive

Need new systemsNon-expensive


CSULA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– On inexpensive commodity computers

How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive

computers• Own super computers


CSULA

Hadoop 1.0

Hadoop Doug Cutting

– Hadoop founder– Initiate Apache Lucene, Nutch, Avro, Hadoop

projects– Board member of Apache Software Foundations– Chief Architect at Cloudera

MapReduceHDFSRestricted Parallel Programming

– Not for iterative algorithms– Not for graph


CSULA

MapReduce

Provides Restricted Parallel Programming model on HadoopUser implements Map() and Reduce()Libraries (Hadoop) take care of

EVERYTHING else–Parallelization–Fault Tolerance–Data Distribution–Load Balancing

Now you can own a supercomputer


CSULA

Definition: Big Data

Inexpensive frameworks that can store a large scale data and process it faster in parallelHadoop

–You can build and run your applications


CSULA

Legacy Example

In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs

into more web-friendly PDF files. – not a particularly complicated but large computing chore,

• requiring a whole lot of computer processing time.


CSULA

Legacy Example (Cont’d)

In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid,

– playing around with Amazon Web Services, Elastic Compute Cloud (EC2),

• uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)

• In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.

The total cost for the computing job? $240– 10 cents per computer-hour times 100 computers times 24 hours


CSULA

HuffPost | AOL

Two Machine Learning Use CasesComment Moderation

Evaluate All New HuffPost User Comments Every Day

Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every

DayArticle Classification

Tag Articles for Advertising E.g.: scary, salacious, …


CSULA

Use Cases experienced

Log Analysis Log files from IPS and IDS

– 1.5GB per day for each systems Extracting unusual cases using Hadoop, Solr,

Flume on Cloudera

Customer Behavior AnalysisMarket Basket Analysis Algorithm

Machine Learning for Image Processing with Texas A&MHadoop Streaming API

Movie Data Analysis Hive, Impala


CSULA

Hadoop 2.0: YARN

Data processing applications and services Impala on MPP

Tez – Generic framework to run a complex DAG

Machine Learning, Data Streaming: Spark

Graph processing: Giraph


CSULA

Training in Big Data

Learn by yourself?Miss many important topics Cloudera: a leading Big Data Hadoop distributor

With hands-on exercises

Cloudera Training seriesHadoop DeveloperHadoop Systems AdmistratorHadoop Data Analyst/Scientist


CSULA

ConclusionEra of Big DataNeed to store and compute Big DataMany solutions but Hadoop is the way

to goHadoop is supercomputer that you

can ownHadoop 2.0Training is important


CSULA

Question?

introduction to big data and use cases on hadoop

Engineering