using hadoop for cognitive analytics

21
Using Hadoop for Cognitive Analytics Pedro Desouza, Ph.D. Associate Partner Big Data & Analytics Center of Competence IBM Global Business Services June 29, 2016

Upload: hadoop-summit

Post on 07-Jan-2017

630 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Using Hadoop for Cognitive Analytics

Using Hadoop for Cognitive Analytics

Pedro Desouza, Ph.D.Associate Partner

Big Data & Analytics Center of CompetenceIBM Global Business Services

June 29, 2016

Page 2: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

2

OutlineP Metro Pulse: Enhancing Decision Making Processes With Hyperlocal Data

DashboardsP

Use Cases In Multiple IndustriesP

Geographic Hierarchies, External Metrics, and Mapping Representation P

Integration Of External and Customer-Specific MetricsP

Solution ArchitectureP

Technological ComponentsP

Micro Services for Data Ingestion and CurationP

Page 3: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

3

Improving Decision Making Accuracy by Combining Business Metrics with Hyperlocal Data

Weather

Social Media Sentiment

Economics…

Events

Thousands of them together, on a single repository

Other Points of Interests

Subway Stations

Demographics

Hyperlocal DataBusiness decision can be made on precise hyperlocal context for each store

Store Context

Com

bini

ng b

usin

ess

met

rics o

f eac

h st

ore

with

hyp

er lo

cal d

ata

prov

ides

insi

ghts

via

vi

sual

insp

ectio

n an

d ad

vanc

ed a

naly

tics

Demand Forecast, Marketing Campaign, Distribution Plan and many other business decisions are usually based on aggregate levels of data that don’t precisely consider the context where the business operates.

Stores in London

Page 4: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

4

Improving Forecast Accuracy with External Data

Traditional Method: Neuron Net, ARIMA…

Forecast based on Neural Network with External Data:

23.9% better accuracy

Actuals of a retail store

Riemer, M., Vempaty, A., Calmon, F., Heath, F., Hull, R., and Khabiri, E., Correcting Forecast with Multifactor Neural Attention, Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. http://jmlr.org/proceedings/papers/v48/riemer16.pdf

T. J. Watson IBM Research Center

Page 5: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

5

Same color on the map Similar context considering all external

metrics

Retail Use Case: Identification of Low/High PerformersGroups of similar stores in locations with

similar hyperlocal contextsCategory: “All Products’, “Electronics”, or “Cosmetics”…

Top Performer

Top Performer

Top Performer

Top Performer

Group 1 Baseline

Group 2 Baseline

Group 3 Baseline

Group 4 Baseline

Potential Revenue Increase:

Rev Inc G1

Rev Inc G2

Rev Inc G

Rev Inc G

Micro-Segmentation + External Metrics Higher Accuracy for Root Cause Analysis and Revenue Increase

Page 6: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

6

Population Movement Analytics

Store in Dallas

Close, but few visits. Why?

15%20%

7%

9%

12%5% of visits

Percentage of visits based on buyer’s Home Location, obtained via anonymous app use analysis.

18%

Potential location for a new store.

Advertisement• Population demographics• Where people are and go

P

Market Campaign• Interests of each region (% of visits)• Population density

P

Other Use Cases

City Planning• Traffic growth• Precise route• Emergency Services

P

Page 7: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

7

Telecommunication Use Cases: Quality of Services (Tower Location)

Affluent Houses Life Time Revenue (LTR)

High

Low

Medium

Congestion

High

Low Medium

Intuition: New tower

Max LTR: Ideal position for a

new tower

Congestion

Famous band free show, Saturday, 9-11PM:

Tower will be over capacity

Schedule a mobile base antenna during event

Page 8: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

Use cases are countless…Banking and Finance

1. Branch Segmentation / New Market Opportunities2. Cash Demand Forecasting3. Promotion Customization4. Staffing Mix / Specialty Account Services5. Customer Churn6. ATM Kiosk-to-Location Ratio Optimization

Retail1. Uncaptured Opportunity 2. Assortment Optimization3. Out of Stock4. Demand Forecasting5. Dynamic Pricing6. Promotion Effectiveness

Insurance1. Risk Management and Pricing Optimization2. Portfolio Suitability3. Demand Forecasting4. Staffing Mix / Specialty Account Services5. Damage Forecasting

City Analytics Industry Use Cases

Consumer Packaged Goods1. Product mix2. Out of Stock3. Visibility4. Expansion Opportunity5. Customer Churn6. Promotion Effectiveness

Travel and Transportation1. Booking Traffic Forecasting Based on POIs2. Service Relative Pricing Model3. Promotion Customization4. Amenity Mix5. Cancellation Forecasting

Telecommunications1. Customer Churn2. Package/Service Offering Optimization3. Coverage Optimization4. New Product Demand5. Device Repair Services6. Service Outage Forecasting

8

Page 9: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

9

Geographic Hierarchy, External Metrics, and Polygons

Rockaways

Manhattan

Soho

Midtown

Brooklyn

Queens

SouthernEastern

Central

External Metric Data Point Domain defined by coordinates: Temperature at (x,y) is 72 F.

(x,y)

External Metric Data Point Domain defined by a node of the hierarchy: It’s raining in Queens. It’s raining in all polygons under Queens.

Level 0

Level 1

Level 2

New York

ManhattanBrooklyn Queens

Soho Midtown RockawaysCentralSouthern Eastern

Nodes

((lat lon, lat lon, … , lat lon))

((lat lon, lat lon, … , lat lon), (lat lon, lat lon, … , lat lon))

Polygon 1

Polygon 2 Polygon 3

Rockaways:

Central:

Most cities have files with the boundaries of sub-regions represented as polygons:

Page 10: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

10

Associating External and Internal Contexts

External Metrics, Events, News…

Geographic Hierarchy

PolygonsPrime

Entities(Stores, Towers,

ATM…)

Customer-Specific Metrics

Customer Hierarchies(Product, Sales…)

External/Public Context Internal/Customer-Specific Context

Coordinates of Prime Entities of any customer can instantly leverage the external context associated to polygons

Easily replaced for any customerSame for all customers

IBM Metro Pulse Solution

Page 11: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

11

Fundamental Polygon Functions

2) polygons_intersection(“Polygon P”, “Polygon Q”)

1polygons_intersection(“Pol 1”, “Pol 2”)

0polygons_intersection(“Pol 1”, “Pol 3”)

Pol 1

Pol 2

Pol 3

Data Quality: No two polygons under the same hierarchy can intersect on any point other than on the edges or vertices.

1) point_in_polygon(“Point X”, “Polygon P”)

Pol 1

Pol 2

Pol 3

Pol 4A

BC

1point_in_polygon(“A”, “Pol 2”)

0point_in_polygon(“B”, “Pol 3”)

Data Quality: All Prime Entities and Points of Interest must belong to one and only one polygon in each geographic hierarchy.

Page 12: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

12

External Data Normalization Via a Reference Polygon

Reference Polygon

Pol 1

Pol 2

Pol 3

Pol 4

Metric 1: Original

Pol 1

Pol 2

Pol 3

Pol 4

“Metric 1” values are based on a set of polygons that don’t match the reference polygon.

Pol 1

Pol 2

Pol 3

Pol 4

Metric 1: Normalized

Different types of metrics (e.g., count, temperature) require different types of aggregation methods.

Page 13: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

13

External DataLanding Zone

IBM Data Lake

Metro Pulse High Level Architecture

Global EnrichedCity Repository

External Data FromCities All Over The World)

Geographic Boundaries, Polygons, and Hierarchies

Analytics WorkbenchCustomer G

Analytics WorkbenchCustomer J

...

Cities relevant to Customer GCities relevant to

Customer J

Customer G Specific Data

Customer J Specific Data

On Premise

On Premise

DaaSCities relevant to

Customer Z

DaaSCities relevant to

Customer L

DaaSCities relevant to

Customer K

Customers interested in external data only.

...

Analytics WorkbenchCustomer A

Analytics WorkbenchCustomer B

Analytics WorkbenchCustomer F

...Cities relevant to

Customer A

Cities relevant to

Customer B

Cities relevant to Customer F

Customer A Specific Data

Customer B Specific Data

Customer F Specific Data

On the Cloud

Analytics WorkbenchGold Copy

Page 14: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

Weather

GBS Data Lake

Exte

rnal

Dat

aby

City

Twitter

Census

...

Geographical Borders, Polygons, and Hierarchies

Metro Pulse Global CityRepository(Curated Data)

RES

T A

PI

PowerUsers

Land

ing

Zone

DaaS

Metro Pulse Analytical Workbench Gold Copy(One Deployment per Customer)

POS

ATM

Cell Towers

...

Files,Tables

SFTP / DirectConnections

Inge

stio

n La

yerCus

tom

er-S

peci

fic D

ata

by C

ity/S

ite

Metro Pulse Architecture – Version: 2.1

Performance Layer

14

DataScientists

Size ofPrize

MovementAnalytics

NewsAnalysis ...

Modeling

Enhanced Forecast

Customer-Specific CityRepository

Core Analytics

ParametersRepository

Sandbox

DaaS

Visualization

BusinessUser

PowerUsers

Acc

ess

Serv

ices

RES

T A

PI

Page 15: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

15

D3…

Data Lake

Analytics Workbench Data Flow

Raw Internal

Data

Raw Internal

Data

Clean Internal

DataSFTP Validated

Internal Data

TabularInternal

Data

DerivedData

ConsumableData

VisualizedData

Raw External

Data

RawExternal

Data

Clean External

Data

Validated External

Data

Tabular External

Data

Published Data

Cached Published

DataData

SamplesResults New CoreAnalytics

Sandbox

Published inProduction

Published inProduction

DataSamples ResultsNew

Analytics

Sandbox

Published inProduction

Hadoop Cluster: HDFS and HBASEStaging NodeCustomer’s Site

Cassandra Redis

User’s Additiona

l Data

Customer’s Site

User’s Database

Customer’s Site

Spark

Spark

IntegratedData

Node.js

Node.js

Micro services reusable not only for other customers, but also for other solutions

Page 16: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

16

Micro Services for Data Ingestion and CurationData Sources Ingestion Engine

RDMBS

Structured Files

Unstructured

CopyData

HadoopEdge Node

Analytic Persistence

Curation Engine

Hadoop, HBASE. Cassandra, Redis…

GetData

Raw DataStore

PrepareRaw Data

CurateData

Transform / Enrich Data

Conformed/Polyglot Data

Store

231 2 3

4 5 6

7 8 9 10

11 12 13 1415

16 17 18 19 20 21

22 2322 23221 2 3

19 Reference Data Lookup

20 Transform Data

21 Enrich Data

22 Archive

23 Purge

1 Error & Exception Processing

2 Configuration set up

3 Audit, Balance, & Control

4 Transport Data from Source to Edge Node

5 Convert Data Formats

6 Copy/Move Data to Hadoop

7 Preprocessing Service

8 Technical Data Validation (TDQ)

9 Source Delta Processing

10 Persist Raw Data

11 Catalog Raw Data

12 Profile Data

13 Cross File Analysis

14 Causality Analysis

15 Target Load Service

16 Business Data Validation

17 Merge / Match

18 Manage Keys

Micro Services

Page 17: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

Loading Geographic Hierarchy to HBASETable L0

Row Desc …London London is… Great Britain…

Paris Paris is… Continental Europe…

Table L1

Row Desc History…London:Central … ……

London:North … ……

HistoryNameLondon

Paris

Central

North

Name

Paris:Central … ……Central

Table L2

Row Desc History…London:Central:Kensington … ……

London:Central:Buckingham … ……

Kensington

Buckingham

Name

Table L3

Row Desc History…London:Central:Kensington:Notting Barns … ……

… … ……

Notting Barns

Name

P1 P2 P3 PN…

P1 P2 P3 PN…

P1 P2 P3 PN…

P1 P2 P3 PN…

Column Family: Data Column Family: Polygons

Column Family: Data Column Family: Polygons

Column Family: Data Column Family: Polygons

Column Family: Data Column Family: Polygons

50484 51673 54735 53896

75736 78493 78303 79659

50484 51673 54735

50484 51673

50484

Page 18: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

18

Metro Pulse Analytical Workbench Edge Node

Flume Agent: Tweets

Flume Agent: Weather

Flume Agent: News

Hadoop Data Nodes: HDFS

Tweets

Weather

News

. . .

Metro Pulse Analytical Workbench Edge Node

Flume Agent: Tweets

Flume Agent: Weather

Flume Agent: News

Hadoop Data Nodes: HDFS

Tweets

Weather

News

. . .

Metro Pulse Analytical Workbench Edge Node

Flume Agent: Tweets

Flume Agent: Weather

Flume Agent: News

Hadoop Data Nodes: HDFS

Tweets

Weather

News

. . .

Easy to broadcast same data to multiple customers. Easy to add new customers.

Metro Pulse Analytical Workbench Edge Node

Flume Agent: Tweets

Flume Agent: Weather

Flume Agent: News

Hadoop Data Nodes: HDFS

Tweets

Weather

News

. . .

Ingesting External Data via Flume

Flume Agent: Tweets

Flume Agent: Weather

Flume Agent: News

. . .

Metro Pulse Global Repository Flume Server

Global CityRepository

Tweets

Weather

News

Internet

Agents can be optimally configured according to the data sources characteristics

Each agent writes to a different HDFS folders: no conflict, good for parallel execution

Each source is captured as a HBASE column family

One data source per agent: easy to add new sources

Page 19: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

19

Performance Layer

- V_Transaction

- V_Level_Entity

- V_Polygon_Entity

- V_Size_of_Prize

...

Cache ManagerGet_View(“XYZ”)

- V_Level_Entity

- V_Size_of_Prize

API- If “XYZ” in Redis, return “XYZ”

- Else: - Get “XYZ” from Cassandra - Return “XYZ” to the API - Load “XYZ” to Redis

“XYZ”

Eviction Policy: Less Recently Used

Sub-second latency and high throughput Dashboards small files

High throughput for large files DaaS

Page 20: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

20

Sample of Visualization Objects on D3.js

Page 21: Using Hadoop for Cognitive Analytics

© 2016 IBM Corporation

Global Business Services

21