big data with apache hadoop
DESCRIPTION
Slidedeck from our seminar about Hadoop (08/10/2014) Topics covered: - What is Big Data? - About Apache Hadoop - HDFS - MapReduce - Pig - Hive - HBase - Mahout & Machine Learning - Other tooling: Sqoop, Oozie, ... - Hadoop deployment options - Real-life casesTRANSCRIPT
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Big Data with Apache Hadoop
12/04/2023
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Who am I
BEN VERMEERSCHBig Data Consultant
Cloudera Certified Developer for Apache
Hadoop
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
About InfoFarm
Data Science
Big Data
Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business
value from it.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
About InfoFarm
2 Data Scientists 4 Big Data Consultants
1 Infrastructure Specialist
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Java
PHP E-Commerce
Mobile
Web Developmen
t
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• 09:30 – What is Big Data?• 09:45 – Hadoop – HDFS &
MapReduce• 10:00 – HDFS & MapReduce in
Practice• 10:30 – The Hadoop Ecosystem• 11:30 – Examples• 12:00 – Wrap up and Lunch
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What is Big Data?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What is Big Data not?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What is Big Data not?
• a technology• a solution (certainly not a silver-
bullet) to any IT problem• a replacement for an RDBMs• a cloud storage system• …
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Big Data definition attempt
“a description of a problem domain with specific challenges and solutions which has become relevant with increasing volume, velocity and variety in business data and the increasing requirements towards processing of this data”
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The 3 V’s
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Working the (Hadoop) Big Data way• Bringing data processing to the data
(vs centralized db)• Using unstructured or semi-
structured data• Store first, process later• Simple techniques applied at
massive scale• Your hardware will fail!
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
OozieWorkflow
Hadoop (limited) overview
HDFSDistributed File System Amazon S3 Local FS
YARNDistributed Data Processing
MapReduce
HBaseNoSQL
HiveData Mart
PigScripting
SqoopSQL
ImportExport
MahoutMachine Learning
…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HDFS
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HDFS Rack Topology
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce• A method for distributing tasks across multiple
nodes• Data is processed where it is stored (where
possible)• Two phases:– Map– Reduce
• Both fases have key-value pairs as input and output that may be chosen by the programmer
• The output from the mappers is used by the reducers
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Map & Reduce
Mapper input Mapper output Reducer input Reducer output
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Map functionInput.txt
Block 1
Block 2
Block 3
Node 1
Block 1
Block 2
Node 2
Block 2
Block 3
Node 3
Block 1
Block 3
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Shuffle and sort
• Hadoop automatically sorts and merges output from all map tasks
This intermediate process is known as the shuffle and sort The result is supplied to reduce tasks
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Reduce function• Reducer input comes from the shuffle and sort process receives one record at a time receives all records for a given key emit zero or more output records
• Example: A reduce function sums total per person and emits employee name (key) and total (value) as output
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce under the hood
Client ResourceManager
AppMasterNode 1
Node 2
Node 3
HDFS
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Joining
User Name
1 John
2 Maria
3 Jane
User Comment
1 Cool
2 Nonono
2 Hi there
3 Hadoop is awesome
Mapper Mapper
Key Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
JoiningKey Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome
Reducer
Shuffle/Sort
Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Joining
Reducer
Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
Userid Name Comment
1 John Cool
2 Maria Nonono
2 Maria Hi there
3 Jane Hadoop is awesome
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
MapReduce Design Patterns
• More info:
• Frameworks on top of MapReduce like Hive or Pig make this easier
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
OozieWorkflow
The Hadoop Ecosystem
HDFSDistributed File System Amazon S3 Local FS
YARNDistributed Data Processing
MapReduce
HBaseNoSQL
HiveData Mart
PigScripting
SqoopSQL
ImportExport
MahoutMachine Learning
…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Pig
• Processing framework for (large) datasets
• Pig Latin• Runs on Hadoop
(or local) with MapReduce
• Extensible with UDFs
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Hive
• SQL-like querying on Hadoop datasets
• Translates to MapReduce under the hood
• Originally developed at Facebook
• Now Apache Top Level project
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive <-> Traditional RDBMS
• Schema on read• Fast initial load• Flexible schema• No update or
delete (only insert into)
• HiveQL (subset of SQL)
• Schema on write
• Slow initial load• Fixed schema• Updates,
deletes, inserts all possible
• SQL compliant
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HBase
• Column-oriented Data Store• Distributed• Type of NoSQL-DB• Based on Google BigTable
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
HBase
• Lots and lots of data
• Large amount of clients
• Single selects• Range scan by
key• Variable schema
• Not Traditional RDBMS– Transactions– Group by– Join–Where– Like
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Sqoop• Import data from structured data source
(typically RDBMS) into Hadoop• Export data into structured data sources from
Hadoop• sqoop import --connect jdbc:mysql://localhost/salesdb --table orders
• sqoop export --connect jdbc:mysql://localhost/salesdb --table orders --export-dir /user/test/orders --input-fields-terminated-by ‘\t’
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Mahout
• Scalable Machine Learning
Recommendation
Classification
Clustering
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Recommendation
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Clustering
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
More information:
• Free seminar: Machine Learning in practice
• Fri 7th of November 2014 12:00 – 16:00
• Kontich
• http://www.buzzberry.be/events/
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Integrating Hadoop in your IT landscape
JDBC
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Tools – BigData – IT options• Hadoop is not a trivial piece of software to manage!
• On-premise– Commodity Hardware– Advantage: full control & performance– Disadvantage: required skills, migrations, backup, ...
• Cloud – Amazon AWS
– EMR (Elastic Map Reduce)– Storage in S3– Very competitive offering financially– Manageability and flexibility
• Cloud - IBM SoftLayer• Hardware options (performance)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Beyond MapReduce
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
There is more…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Oak3 Courses
• Data Science• Hadoop• Hbase
• http://www.oak3.be/
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Questions?