introduction to big data and use cases on hadoop

Jongwook Woo

Seoul Technology Society Meetup: Hack'n'Te l l night #3

Seoul, KoreaJuly 25th 2014

Jongwook Woo (PhD)

High-Performance Information Computing Center (HiPIC)

Cloudera Academic Partner and Grants Awardee of Amazon AWS

California State University Los Angeles

Introduction To Big Data and Use Cases on Hadoop

High Performance Information Computing CenterJongwook Woo

Contents

IntroductionBig Data Use Cases Hadoop 2.0 Training in Big Data

Name: Jongwook Woo, PhDBackgrounds:

Since 1998, consulting companies in Hollywood – Implementing eBusiness applications using J2EE– Search applications using FAST, Lucene/Solr, Sphinx

• Data Integration, Data Feed– Warner Bros (Matrix online game), E!, citysearch.com, ARM

Teaching since 2002: – California State University Los Angeles

Exposed to Hadoop since 2008Exposed to Cloudera since 2010

Experience in Big Data

Certificate Certified Cloudera Instructor Certified Cloudera Hadoop Developer / Administrator

Partnership Received Academic Education Partnership with Cloudera since

June 2012

Grants Received Microsoft Windows Azure Educator Grant (Oct 2013 -

July 2014) Received Amazon AWS in Education Research Grant (July

2012 - July 2014) Received Amazon AWS in Education Coursework Grants (July

2012 - July 2013, Jan 2011 - Dec 2011

What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing

ClouderaHortonWorks

Google“We don’t have a better algorithm than others but we have more data than others”

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web• Sensor Data, Bioinformatics, Social Computing,

smart phone, online game…

Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive

Need new systemsNon-expensive

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– On inexpensive commodity computers

How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive

computers• Own super computers

Hadoop 1.0

Hadoop Doug Cutting

– Hadoop founder– Initiate Apache Lucene, Nutch, Avro, Hadoop

projects– Board member of Apache Software Foundations– Chief Architect at Cloudera

MapReduceHDFSRestricted Parallel Programming

– Not for iterative algorithms– Not for graph

MapReduce

Provides Restricted Parallel Programming model on HadoopUser implements Map() and Reduce()Libraries (Hadoop) take care of

EVERYTHING else–Parallelization–Fault Tolerance–Data Distribution–Load Balancing

Now you can own a supercomputer

Definition: Big Data

Inexpensive frameworks that can store a large scale data and process it faster in parallelHadoop

–You can build and run your applications

Legacy Example

In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs

into more web-friendly PDF files. – not a particularly complicated but large computing chore,

• requiring a whole lot of computer processing time.

Legacy Example (Cont’d)

In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid,

– playing around with Amazon Web Services, Elastic Compute Cloud (EC2),

• uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)

• In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.

The total cost for the computing job? $240– 10 cents per computer-hour times 100 computers times 24 hours

HuffPost | AOL

Two Machine Learning Use CasesComment Moderation

Evaluate All New HuffPost User Comments Every Day

Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every

DayArticle Classification

Tag Articles for Advertising E.g.: scary, salacious, …

Use Cases experienced

Log Analysis Log files from IPS and IDS

– 1.5GB per day for each systems Extracting unusual cases using Hadoop, Solr,

Flume on Cloudera

Customer Behavior AnalysisMarket Basket Analysis Algorithm

Machine Learning for Image Processing with Texas A&MHadoop Streaming API

Movie Data Analysis Hive, Impala

Hadoop 2.0: YARN

Data processing applications and services Impala on MPP

Tez – Generic framework to run a complex DAG

Machine Learning, Data Streaming: Spark

Graph processing: Giraph

Training in Big Data

Learn by yourself?Miss many important topics Cloudera: a leading Big Data Hadoop distributor

With hands-on exercises

Cloudera Training seriesHadoop DeveloperHadoop Systems AdmistratorHadoop Data Analyst/Scientist

ConclusionEra of Big DataNeed to store and compute Big DataMany solutions but Hadoop is the way

to goHadoop is supercomputer that you

can ownHadoop 2.0Training is important

Question?

introduction to big data and use cases on hadoop

Engineering

hadoop, big data and big analytics 2014 - sas...hadoop, big...

big data- hadoop -mapreduce

hadoop in three use cases

big data – hadoop - paramesh

big data hadoop full - themeslearning.com · learning big...

big data, hadoop, hdinsight

casablanca hadoop & big data meetup - introduction à hadoop

hadoop & big data benchmarking

big data technologies - hadoop

big data: apache hadoop

big data hadoop faq's

hortonworks hadoop big data_retail__white_paper

big data, hadoop, hdfs

big data & hadoop

big data with hadoop

big data hadoop (overview)

big data analytics hadoop

hadoop big data a big picture

hadoop and big data

big data hadoop training