centralesupelec 2019-2020 msc dsba / data sciences big ...€¦ · apache hadoop is an open source...
TRANSCRIPT
![Page 1: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/1.jpg)
CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big Data Algorithms, Techniques and Platforms
Distributed Databases Hadoop Applications and Ecosystem.
Hugues Talbot & Céline Hudelot, professors.
![Page 2: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/2.jpg)
Big Data Hadoop Stack
![Page 3: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/3.jpg)
Lecture #1
Hadoop Beginnings
![Page 4: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/4.jpg)
What is Hadoop?
Apache Hadoop is an open source software framework for storage and
large scale processing of data-sets on clusters of commodity hardware
![Page 5: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/5.jpg)
Hadoop was created by Doug Cutting and Mike Cafarella in 2005
Named the project after son's toy elephant
![Page 6: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/6.jpg)
Moving Computation to Data
Computation
Data
![Page 7: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/7.jpg)
Scalability at Hadoop’s core!
![Page 8: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/8.jpg)
Reliability! Reliability! Reliability!
![Page 9: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/9.jpg)
Reliability! Reliability! Reliability!
![Page 10: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/10.jpg)
Reliability! Reliability! Reliability!
Google File System
![Page 11: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/11.jpg)
Once a year
365 Computers Once a day
![Page 12: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/12.jpg)
Hourly
![Page 13: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/13.jpg)
New Approach to Data
Keep all data
![Page 14: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/14.jpg)
New Kinds of Analysis
Schema-on read style
![Page 15: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/15.jpg)
New Kinds of Analysis
Schema-on read style New analysis
![Page 16: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/16.jpg)
Small Data & Complex Algorithm
Large Data & Simple Algorithm
Vs.
![Page 17: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/17.jpg)
Lecture #2
Apache Framework Hadoop Modules
![Page 18: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/18.jpg)
Apache Framework Basic Modules Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce
![Page 19: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/19.jpg)
Apache Framework Basic Modules
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce
![Page 20: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/20.jpg)
Apache Framework Basic Modules
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce
![Page 21: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/21.jpg)
Apache Framework Basic Modules
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce
![Page 22: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/22.jpg)
![Page 23: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/23.jpg)
![Page 24: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/24.jpg)
![Page 25: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/25.jpg)
![Page 26: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/26.jpg)
![Page 27: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/27.jpg)
Lecture #3
Hadoop Distributed File System (HDFS)
![Page 28: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/28.jpg)
HDFS Hadoop Distributed File System
Distributed, scalable, and portable file-system written in Java for the Hadoop framework
![Page 29: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/29.jpg)
HDFS
![Page 30: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/30.jpg)
![Page 31: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/31.jpg)
MapReduce Engine Job Tracker
![Page 32: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/32.jpg)
MapReduce Engine
Task Tracker
Job Tracker
![Page 33: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/33.jpg)
Apache Hadoop NextGen MapReduce (YARN)
![Page 34: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/34.jpg)
![Page 35: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/35.jpg)
What is Yarn?
• YARN enhances the power of a Hadoop compute cluster
Scalability
![Page 36: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/36.jpg)
What is Yarn?
• YARN enhances the power of a Hadoop compute cluster
Scalability
Improved cluster utilization
![Page 37: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/37.jpg)
What is Yarn?
• YARN enhances the power of a Hadoop compute cluster
Scalability
MapReduce Compatibility Improved cluster utilization
![Page 38: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/38.jpg)
What is Yarn?
• YARN enhances the power of a Hadoop compute cluster
Scalability
MapReduce Compatibility
Improved cluster utilization
Supports Other Workloads
![Page 39: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/39.jpg)
Lecture #4
The Hadoop “Zoo”
![Page 40: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/40.jpg)
![Page 41: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/41.jpg)
How to figure out the Zoo??
![Page 42: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/42.jpg)
Original Google Stack
![Page 43: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/43.jpg)
Original Google Stack
![Page 44: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/44.jpg)
Original Google Stack
![Page 45: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/45.jpg)
Original Google Stack Data Integration
![Page 46: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/46.jpg)
Original Google Stack
![Page 47: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/47.jpg)
Original Google Stack
![Page 48: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/48.jpg)
Original Google Stack
![Page 49: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/49.jpg)
Facebook’s Version of the Stack Data Integration
Coordination
Languages Compilers Data Store
![Page 50: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/50.jpg)
Yahoo’s Version of the Stack Data Integration
Coordination
Languages Compilers Data Store
![Page 51: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/51.jpg)
LinkedIn’s Version of the Stack Data Integration
Coordination
Languages Compilers Data Store
![Page 52: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/52.jpg)
Cloudera’s Version of the Stack
Coordination
Languages Compilers
Data Integration
Data Store
![Page 53: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/53.jpg)
Hadoop Ecosystem Major Components
Lecture #5
![Page 54: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/54.jpg)
![Page 55: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/55.jpg)
Apache Sqoop • Tool designed for
efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
![Page 56: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/56.jpg)
![Page 57: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/57.jpg)
HBASE • Column-oriented database
management system • Key-value store • Based on Google Big Table • Can hold extremely large data • Dynamic data model • Not a Relational DBMS
![Page 58: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/58.jpg)
![Page 59: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/59.jpg)
PIG
High level programming on top of Hadoop MapReduce
![Page 60: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/60.jpg)
PIG
The language: Pig Latin
![Page 61: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/61.jpg)
PIG
Data analysis problems as data flows
![Page 62: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/62.jpg)
PIG
Originally developed at Yahoo 2006
![Page 63: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/63.jpg)
Pig for ETL
![Page 64: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/64.jpg)
Pig for ETL
![Page 65: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/65.jpg)
Pig for ETL
![Page 66: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/66.jpg)
![Page 67: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/67.jpg)
Apache Hive
• Data warehouse software facilitates querying and managing large datasets residing in distributed storage
![Page 68: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/68.jpg)
Apache Hive
SQL-like language!
![Page 69: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/69.jpg)
Apache Hive
Facilitates querying and managing large datasets in HDFS
![Page 70: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/70.jpg)
Apache Hive
Mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL
![Page 71: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/71.jpg)
![Page 72: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/72.jpg)
Oozie
Workflow scheduler system to manage Apache Hadoop jobs
![Page 73: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/73.jpg)
Oozie
Oozie Coordinator jobs!
![Page 74: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/74.jpg)
Oozie Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.
![Page 75: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/75.jpg)
![Page 76: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/76.jpg)
Zookeeper
Provides operational services for a Hadoop cluster group services
![Page 77: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/77.jpg)
Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services
![Page 78: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/78.jpg)
Zookeeper Centralized service for: maintaining configuration information
![Page 79: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/79.jpg)
Zookeeper Centralized service for: maintaining configuration information naming services
![Page 80: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/80.jpg)
Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services
![Page 81: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/81.jpg)
Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
![Page 82: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/82.jpg)
Additional Cloudera Hadoop Components
Impala
![Page 83: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/83.jpg)
![Page 84: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/84.jpg)
Impala
• Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop
![Page 85: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/85.jpg)
Additional Cloudera Hadoop Components
Spark The New Paradigm
![Page 86: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/86.jpg)
![Page 87: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/87.jpg)
Spark
Apache Spark™ is a fast and general engine for large-scale data processing
![Page 88: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/88.jpg)
Spark Benefits Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
![Page 89: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/89.jpg)
Spark Benefits Allows user programs to load data into a cluster's memory and query it repeatedly
Well-suited to machine learning!!!
![Page 90: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/90.jpg)
Up Next
Tour of the Cloudera’s Quick Start VM
![Page 91: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/91.jpg)
Apache Pig
• Overview of apps, high level languages, services
• Databases/Stores • Querying • Machine Learning • Graph Processing
![Page 92: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/92.jpg)
Databases/Stores • Avro: data structures within context of
Hadoop MapReduce jobs. • Hbase: distributed non-relational
database • Cassandra: distributed data
management system
![Page 93: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/93.jpg)
Querying • Pig : Platform for analyzing large data
sets in HDFS • Hive : Query and manage large datasets • Impala : High-performance, low-latency
SQL querying of data in Hadoop file formats
• Spark : General processing engine for streaming, SQL, machine learning and graph processing.
![Page 94: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/94.jpg)
Machine Learning, Graph Processing • Giraph: Iterative graph
processing using Hadoop framework • Mahout: Framework for machine
learning applications using Hadoop, Spark
• Spark: General processing engine for streaming, SQL, machine learning and graph processing.
![Page 95: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/95.jpg)
Apache Pig
• Pig components – PigLatin, and infrastructure layer
• Typical Pig use cases • Run Pig with Hadoop
integration.
![Page 96: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/96.jpg)
Apache Pig • Platform for data processing • Pig Latin: High level language • Pig execution environment: Local,
MapReduce, Tez • In built operators and functions • Extensible
![Page 97: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/97.jpg)
Pig Usage Areas • Extract, Transform, Load (ETL)
operations • Manipulating, analyzing “raw” data • Widely used, extensive list at:
https://cwiki.apache.org/confluence/display/PIG/PoweredBy
![Page 98: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/98.jpg)
Pig Example • Load passwd file
and work with data.
• Step 1: hdfs dfs -put /etc/passwd /user/cloudera
(Note: this is a single line) • Step 2:
pig -x mapreduce
![Page 99: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/99.jpg)
Pig Example • Puts you in
“grunt” shell. • clear • Load the file:
load ‘/user/cloudera/passwd’ using PigStorage(‘:’);
• Pick subset of values: B = foreach A generate $0, $4, $5; dump B;
![Page 100: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/100.jpg)
Pig Example • Backend
Hadoop job info.
![Page 101: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/101.jpg)
Pig Example • Outputs
username, full name, and home directory path.
![Page 102: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/102.jpg)
Pig Example • Can store this
processed data in HDFS
• Command: store B into ‘userinfo.out’;
![Page 103: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/103.jpg)
Pig Example • Verify the new
data is in HDFS.
![Page 104: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/104.jpg)
Summary • Used interactive shell for Pig example • Can also run using scripts • Also as embedded programs in a host
language (Java for example).
![Page 105: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/105.jpg)
Apache Hive
• Query and manage data using HiveQL
• Run interactively using beeline.
• Other run mechanisms
![Page 106: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/106.jpg)
Apache Hive • Data warehouse software • HiveQL: SQL like language to structure,
and query data • Execution environment: MapReduce, Tez,
Spark • Data in HDFS, HBase • Custom mappers/reducers
![Page 107: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/107.jpg)
Hive Usage Areas • Data mining, analytics • Machine learning • Ad hoc analysis • Widely used, extensive list at:
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
![Page 108: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/108.jpg)
Hive Example • Revisit /etc/passwd file example from Pig
lesson video. • Start by loading file into HDFS:
hdfs dfs -put /etc/passwd /tmp/ • Run beeline to access interactively:
beeline -u jdbc:hive2://
![Page 109: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/109.jpg)
Hive Example • Copy passwd
file to HDFS • Running
interactively usiing beeline.
![Page 110: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/110.jpg)
Hive Example: Command list CREATE TABLE userinfo ( uname STRING, pswd STRING, uid INT, gid INT, fullname STRING, hdir STRING, shell STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':' STORED AS TEXTFILE; LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE userinfo; SELECT uname, fullname, hdir FROM userinfo ORDER BY uname ;
![Page 111: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/111.jpg)
Hive Example • Run the
Create table command
![Page 112: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/112.jpg)
Hive Example • Load passwd
file from HDFS
![Page 113: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/113.jpg)
Hive Example • Select info -
this launches the Hadoop job and outputs once its complete.
![Page 114: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/114.jpg)
Hive Example • Completed
MapReduce jobs; output shows username, fullname, and home directory.
![Page 115: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/115.jpg)
Summary • Used beeline for interactive Hive
example • Can also use
• Hive command line interface (CLI) • Hcatalog • WebHcat.
![Page 116: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/116.jpg)
Apache HBase
• Hbase features • Run interactively using
HBase shell. • List other access
mechanisms
![Page 117: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/117.jpg)
Apache HBase • Scalable data store • Non-relational distributed database • Runs on top of HDFS • Compression • In-memory operations: MemStore,
BlockCache
![Page 118: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/118.jpg)
HBase Features • Consistency • High Availability • Automatic Sharding
![Page 119: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/119.jpg)
Hbase Features • Replication • Security • SQL like access (Hive, Spark, Impala)
![Page 120: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/120.jpg)
HBase Example • Start HBase
shell: hbase shell
![Page 121: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/121.jpg)
HBase Example • Create Table: create
‘usertableinfo’,{NAME=>’username’},{NAME=>’fullname’},{NAME=>’homedir’}
![Page 122: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/122.jpg)
HBase Example • Add data: put ‘userinfotable’,’r1’,’username’,’vcsa’
![Page 123: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/123.jpg)
HBase Example • Scan table after data entry: scan ‘userinfotable’
![Page 124: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/124.jpg)
HBase Example • Select info from all rows corresponding to column ‘fullname’.
![Page 125: CentraleSupelec 2019-2020 MSC DSBA / DATA SCIENCES Big ...€¦ · Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters](https://reader033.vdocuments.us/reader033/viewer/2022050407/5f848738f1685b266b1aea03/html5/thumbnails/125.jpg)
Summary • We used: Apache HBase Shell • Other options:
• HBase, MapReduce • HBase API • HBase External API