introduction to big data and use cases on hadoop
Upload: jongwook-woo-big-data-artist-professor-at-california-state-university-los-angeles
Post on 28-Nov-2014
171 views
DESCRIPTION
Brief Introduction to Big Data using Hadoop and its use casesTRANSCRIPT
Jongwook Woo
HiPIC
CSULA
Seoul Technology Society Meetup: Hack'n'Te l l night #3
Seoul, KoreaJuly 25th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
California State University Los Angeles
Introduction To Big Data and Use Cases on Hadoop
High Performance Information Computing CenterJongwook Woo
CSULA
Contents
IntroductionBig Data Use Cases Hadoop 2.0 Training in Big Data
High Performance Information Computing CenterJongwook Woo
CSULA
Me
Name: Jongwook Woo, PhDBackgrounds:
Since 1998, consulting companies in Hollywood – Implementing eBusiness applications using J2EE– Search applications using FAST, Lucene/Solr, Sphinx
• Data Integration, Data Feed– Warner Bros (Matrix online game), E!, citysearch.com, ARM
Teaching since 2002: – California State University Los Angeles
Exposed to Hadoop since 2008Exposed to Cloudera since 2010
High Performance Information Computing CenterJongwook Woo
CSULA
Experience in Big Data
Certificate Certified Cloudera Instructor Certified Cloudera Hadoop Developer / Administrator
Partnership Received Academic Education Partnership with Cloudera since
June 2012
Grants Received Microsoft Windows Azure Educator Grant (Oct 2013 -
July 2014) Received Amazon AWS in Education Research Grant (July
2012 - July 2014) Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
High Performance Information Computing CenterJongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
ClouderaHortonWorks
AWS
NoSQ
L DB
High Performance Information Computing CenterJongwook Woo
CSULA
Data
Google“We don’t have a better algorithm than others but we have more data than others”
High Performance Information Computing CenterJongwook Woo
CSULA
Data Issues
Large-Scale dataTera-Byte (1012), Peta-byte (1015)
– Because of web• Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approachToo bigUn-/Semi-structured dataToo expensive
Need new systemsNon-expensive
High Performance Information Computing CenterJongwook Woo
CSULA
Two Cores in Big Data
How to store Big DataHow to compute Big DataGoogle
How to store Big Data– GFS– On inexpensive commodity computers
How to compute Big Data– MapReduce– Parallel Computing with multiple non-expensive
computers• Own super computers
High Performance Information Computing CenterJongwook Woo
CSULA
Hadoop 1.0
Hadoop Doug Cutting
– Hadoop founder– Initiate Apache Lucene, Nutch, Avro, Hadoop
projects– Board member of Apache Software Foundations– Chief Architect at Cloudera
MapReduceHDFSRestricted Parallel Programming
– Not for iterative algorithms– Not for graph
High Performance Information Computing CenterJongwook Woo
CSULA
MapReduce
Provides Restricted Parallel Programming model on HadoopUser implements Map() and Reduce()Libraries (Hadoop) take care of
EVERYTHING else–Parallelization–Fault Tolerance–Data Distribution–Load Balancing
Now you can own a supercomputer
High Performance Information Computing CenterJongwook Woo
CSULA
Definition: Big Data
Inexpensive frameworks that can store a large scale data and process it faster in parallelHadoop
–You can build and run your applications
High Performance Information Computing CenterJongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files. – not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
High Performance Information Computing CenterJongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.
The total cost for the computing job? $240– 10 cents per computer-hour times 100 computers times 24 hours
High Performance Information Computing CenterJongwook Woo
CSULA
HuffPost | AOL
Two Machine Learning Use CasesComment Moderation
Evaluate All New HuffPost User Comments Every Day
Identify Abusive / Aggressive Comments Auto Delete / Publish ~25% Comments Every
DayArticle Classification
Tag Articles for Advertising E.g.: scary, salacious, …
High Performance Information Computing CenterJongwook Woo
CSULA
Use Cases experienced
Log Analysis Log files from IPS and IDS
– 1.5GB per day for each systems Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior AnalysisMarket Basket Analysis Algorithm
Machine Learning for Image Processing with Texas A&MHadoop Streaming API
Movie Data Analysis Hive, Impala
High Performance Information Computing CenterJongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services Impala on MPP
Tez – Generic framework to run a complex DAG
Machine Learning, Data Streaming: Spark
Graph processing: Giraph
High Performance Information Computing CenterJongwook Woo
CSULA
Training in Big Data
Learn by yourself?Miss many important topics Cloudera: a leading Big Data Hadoop distributor
With hands-on exercises
Cloudera Training seriesHadoop DeveloperHadoop Systems AdmistratorHadoop Data Analyst/Scientist
High Performance Information Computing CenterJongwook Woo
CSULA
ConclusionEra of Big DataNeed to store and compute Big DataMany solutions but Hadoop is the way
to goHadoop is supercomputer that you
can ownHadoop 2.0Training is important
High Performance Information Computing CenterJongwook Woo
CSULA
Question?