hadoop beyond hype...hadoop : massively parallel processing capability, running on commodity...

16
1 Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services

Upload: others

Post on 22-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

1

Hadoop Beyond Hype: Complex Adaptive Systems ConferenceNov 16, 2012

Viswa SharmaSolutions ArchitectTata Consultancy Services

Page 2: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

TCS Confidential

Agenda

What is HadoopWhy Hadoop? The Net Generation is here

Sizing the HadoopGartner Hadoop Hype Cycle

TCS view pointHadoop Eco System LandscapeExamples of uses of HadoopTransformational  Platform

Ad Hoc Analysis  Analytics with Hadoop

Applications of Hadoop Analytics Near Real Time AnalysisWhat is the market 

Thank You

Page 3: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

3

What is Hadoop? 

‐ 3 ‐

SCALE OUT COMPUTING PLATFORM  WHICH PROCESSES INTENET SIZE DATA

Hadoop is the Name of a Toy Elephant

COMMODITY HARDWARE

PARALLEL PROGRAMMNG ENVIRONMENT GOOGLE MAP/REDUCE

OPEN SOURCE SOFTWARE

PARALLEL FILE SYSTEM MODLED AFTER GOOGLE FILE SYSTEM

Given To 

Page 4: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

4

Big Data : Web Scale50 billion web pages800 million Facebook users1000 million Facebook pages200 million Twitter accounts100 million tweets per day5 billion Google queries per dayMillions of servers, Petabytes of data

Varieties of DataVideo / AudioImages / PicturesDiverse internal and external data

Sources of DataNews / Feeds / Blogs / forumsGroups / Polls / Chats / Wiki

Why Hadoop?  The Net Generation is here 

Information is exploding all around – But the challenge is to understand the it

The Net Generation is inter-connected on a variety of Web based and Digital channels.

Page 5: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

5

Sizing the Hadoop 

Source: Pawyi Lee

Page 6: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

6

Hadoop Hype Cycle Starts

Gartner Hype Cycle  2012 

Page 7: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

7

TCS View Point: Hadoop Technology is here now…

Big Data Technology handles data at extreme scale and is  

characterized by

•Massive parallel  computing to divide and conquer workloads.

•Extremely flexible  to allow unlimited data manipulation and  transformation

•Massively  scalable in terms of  both technology and cost

Hadoop : Massively Parallel Processing  Capability, running on 

commodity hardware

Hbase and Hadoop/HDFS are designed to store and manage 

massive amounts of data 

Hive, Mahout and R, enable query, analysis and running in‐memory compute‐intensive applications

The ecosystem of Hadoop  Technology is affordable, and within 

the reach of companies

Page 8: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

8

Hadoop Eco System Landscape

No SQL

Hadoop Distributions

Cloud Distributions

Distributed File SystemMap‐Reduce

Appliance / MR Re-write

Analytics / Visualization

Data Integration

Data Integration

Query‐Oriented Data Warehouse

CEP

Search

Tools

Languages / Libraries

Page 9: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

9

Examples of Uses of Hadoop …

InsuranceClaims analysis &  Premium forecasting

Claims Fraud detection &  Revenue comparison

Overall risk analysis & Re‐insurance risk assessment

Policy pricing &‐ Customer retention

Travel, Transportation & HospitalityBetter Travel searchesGeo‐fencingCross selling and up‐sellingIntelligent traffic management

GovernmentFraud detection and cyber securityCompliance and regulatory analysisEnergy consumption and carbon 

footprint managementDisaster Management

Energy, Resources & UtilitiesWeather impact analysis on power 

generationOil Rig data monitoringSmart meter data analysisTerrain data analysis for wind 

energy

Hi TechProcess control for Microchip fabricationNetwork ManagementSupply Chain Management and analysisNew Product developmentContent management solutions 

Smart Grids

Page 10: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

10

HDFSHDFS

MapReduce / Hive /PigMapReduce / Hive /Pig

MapReduce / Hive / Pig could be used to transform data within the distributed file 

system (HDFS). 

Hadoo

p Cluster

TransactionalSystems

DataWarehouse

Within Hadoop Ecosystem

Tools like SQOOP could be leveraged to load data from and to HDFS

Hadoop as Transformation Platform in ETL

Less number of Higher end nodes

Page 11: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

11

Transactional Systems Data

Warehouse

Tools like SQOOP could be leveraged to load data from

and to HDFS

Hadoop as an ad-hoc analysis platform

HDFSHDFS

MapReduce / Hive /PigMapReduce / Hive /Pig

MapReduce / Hive / Pig could be used to transform data within the distributed file system (HDFS), this could provide the business analytics team a platform

for innovation

Had

oop

Clu

ster

Hadoop as an ad-hoc analysis platform

Higher number of nodes for larger storage

Data at lowest grain

Page 12: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

12

Analytics With Hadoop

Prescriptive Optimizing outcomes

Identifying possible outcomesDomain ExpertiseText AnalyticsData MiningKnowledge

Predictive Modeling

Statistical AnalysisVisual AnalyticsForecasting

Describing and analyzing outcomesAnalysis, Drill‐Down, Ad‐Hoc Reporting

Dashboards and ScorecardsVisual Analytics

OptimizationSimulation

Descriptive

Predictive

(What should happen?)

(What will happen?)

(What  has happened?)

Page 13: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

13

Applications for Hadoop Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 14: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

14

Hadoop Near Real Time Analytics

TransactionalSystems

Rule / Pattern Matching on Streams.Dist Processing : Processing is distributed on a set of nodes and not the data.

Complex Event Processing

Rule / Pattern Discovery on Streams.Dist Processing : Both Processing and data are distributed on a set of nodes.

e.g. C-MR (academic project)

Distributed Stream Processing [using MR]

[Time Series] Mining and Rule Discovery

Online

• Fraud Detection• Online Price Mgmt• Yield Management

Rule / Pattern Discovery [on Time Series]Dist Processing : Map-Reduce or scalable

time-series pattern mining.

Batch Map-Reduce Processing

Offline

Rule Application

Rule Discovery

• Learn Frauds Patterns• Demand Signal Refinement

• Real Time Self Learning Systems• Complex / Dynamic Pattern Matching e.g. Trading Patterns,

Mining Current Influencers

External Inputs(incl Social Media)

Page 15: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

15

What is the Market?

Page 16: Hadoop Beyond Hype...Hadoop : Massively Parallel Processing Capability, running on commodity hardware Hbase and Hadoop/HDFS are designed to store and manage massive amounts of data

16

5 December, 2012

Thank You