hadoop - an introduction

1. Shankar Radhakrishnan
HCL Technologies
Hadoop An Introduction

2. State of the Data
What is Hadoop
Hadoop Ecosystem
References
Agenda
3. Data driven businesses
Businesses have been collecting information all the time
Mine more == Collect more (and vice-versa)
Challenges
Application Complexities
Data growth
Infrastructure
Economics
Need of the day
State of the data
4. Data driven business
Businesses have been collecting informationall the time
Challenges
Data growth
Infrastructure
Economics
State of the data
5. Applications
Searches, Message posts, Comments, Emails,Blogs, Photos, Video Clips, Product Listings
ERP, CRM, Databases, Internal Applications, Customer/Consumer facing products
Mobile
Context
Web, Customers, Products, Business Systems,Processes, Services
Support Systems
CRM, SOA, Recommendation Systems/processes,Data warehouses, Business Intelligence, BPM
Data driven business
Challenges
Data growth
Infrastructure
Economics
State of the data
7. Drivers
ROI
Customer Retention
Product Affinity
Market Trends
Research Analysis
Customer/Consumer Analytics
Process
Clustering
Classification
Build Relationships
Regression
Types
Structured
Semi-structured
Unstructured
Mine more
Challenges
Data growth
Infrastructure
Economics
State of the data
9. Complex Applications
Data integration is a good but complex problem to solve
Data Growth
Growth is exponential
Infrastructure
Availability
Unscalablehardware
Economics
Managing high data volume comes at a price
Failures are very costly
Challenges
10. System that can handle high volume data
System that can perform complex operations
Scalable
Robust
Highly Available
Fault Tolerant
Cheap
Need of the day
11. Top level Apache project
Open source
Inspired by Googles white papers onMap/Reduce (MR), Google File System (GFS)
Originally developed to support Apache Nutch Search Engine
Software Framework - Java
Designed
For sophisticated analysis
To deal with structured and unstructured complex data
12. Runs on commodity hardware
Shared-nothing architecture
Scale hardware when ever you want
System compensates for hardware scalingand issues (if any)
Run large-scale, high volume data processes
Scales well with complex analysis jobs
Handles failures
Ideal to consolidate data from both new and legacy data sources
Value to the business
Why Hadoop?
13. Hadoop in an enterprise - Example
14. HDFS Hadoop Distributed File System
Map/Reduce Software framework for Clustered, Distributed data processing
ZooKeeper Scheduler
Avro Data Serialization
Chukwa Data Collection System to monitor Distributed Systems
HBase Data storage for distributed large tables
Hive Data warehousing infrastructure
Pig High-Level Query Language
Hadoop Ecosystem
15. Master/Slave Architecture
Runs on commodity hardware
Fault Tolerant
Handle large volumes of data
Provides High Throughput
Streaming data-access
Simple file coherency model
Portable to heterogeneous hardware and software
Robust
Handles disk failures, replication (& re-replication)
Performs cluster rebalancing, data integrity checks
HDFS Hadoop Distributed File System
16. HDFS Example
Name node

File system operations

17. Maps data-nodesData node

Process read/write

18. Handles Data-blocks 19. Replication

hadoop - an introduction

Technology

data warehouses

unstructured complex

data driven businessbusinesses

legacy data sourcesvalue

complex problem

bpmdata driven business

high volume datasystem

businesswhy hadoop