using hadoop for cognitive analytics
TRANSCRIPT
Using Hadoop for Cognitive Analytics
Pedro Desouza, Ph.D.Associate Partner
Big Data & Analytics Center of CompetenceIBM Global Business Services
June 29, 2016
© 2016 IBM Corporation
Global Business Services
2
OutlineP Metro Pulse: Enhancing Decision Making Processes With Hyperlocal Data
DashboardsP
Use Cases In Multiple IndustriesP
Geographic Hierarchies, External Metrics, and Mapping Representation P
Integration Of External and Customer-Specific MetricsP
Solution ArchitectureP
Technological ComponentsP
Micro Services for Data Ingestion and CurationP
© 2016 IBM Corporation
Global Business Services
3
Improving Decision Making Accuracy by Combining Business Metrics with Hyperlocal Data
Weather
Social Media Sentiment
Economics…
Events
Thousands of them together, on a single repository
Other Points of Interests
Subway Stations
Demographics
Hyperlocal DataBusiness decision can be made on precise hyperlocal context for each store
Store Context
Com
bini
ng b
usin
ess
met
rics o
f eac
h st
ore
with
hyp
er lo
cal d
ata
prov
ides
insi
ghts
via
vi
sual
insp
ectio
n an
d ad
vanc
ed a
naly
tics
Demand Forecast, Marketing Campaign, Distribution Plan and many other business decisions are usually based on aggregate levels of data that don’t precisely consider the context where the business operates.
Stores in London
© 2016 IBM Corporation
Global Business Services
4
Improving Forecast Accuracy with External Data
Traditional Method: Neuron Net, ARIMA…
Forecast based on Neural Network with External Data:
23.9% better accuracy
Actuals of a retail store
Riemer, M., Vempaty, A., Calmon, F., Heath, F., Hull, R., and Khabiri, E., Correcting Forecast with Multifactor Neural Attention, Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. http://jmlr.org/proceedings/papers/v48/riemer16.pdf
T. J. Watson IBM Research Center
© 2016 IBM Corporation
Global Business Services
5
Same color on the map Similar context considering all external
metrics
Retail Use Case: Identification of Low/High PerformersGroups of similar stores in locations with
similar hyperlocal contextsCategory: “All Products’, “Electronics”, or “Cosmetics”…
Top Performer
Top Performer
Top Performer
Top Performer
Group 1 Baseline
Group 2 Baseline
Group 3 Baseline
Group 4 Baseline
Potential Revenue Increase:
Rev Inc G1
Rev Inc G2
Rev Inc G
Rev Inc G
Micro-Segmentation + External Metrics Higher Accuracy for Root Cause Analysis and Revenue Increase
© 2016 IBM Corporation
Global Business Services
6
Population Movement Analytics
Store in Dallas
Close, but few visits. Why?
15%20%
7%
9%
12%5% of visits
Percentage of visits based on buyer’s Home Location, obtained via anonymous app use analysis.
18%
Potential location for a new store.
Advertisement• Population demographics• Where people are and go
P
Market Campaign• Interests of each region (% of visits)• Population density
P
Other Use Cases
City Planning• Traffic growth• Precise route• Emergency Services
P
© 2016 IBM Corporation
Global Business Services
7
Telecommunication Use Cases: Quality of Services (Tower Location)
Affluent Houses Life Time Revenue (LTR)
High
Low
Medium
Congestion
High
Low Medium
Intuition: New tower
Max LTR: Ideal position for a
new tower
Congestion
Famous band free show, Saturday, 9-11PM:
Tower will be over capacity
Schedule a mobile base antenna during event
© 2016 IBM Corporation
Global Business Services
Use cases are countless…Banking and Finance
1. Branch Segmentation / New Market Opportunities2. Cash Demand Forecasting3. Promotion Customization4. Staffing Mix / Specialty Account Services5. Customer Churn6. ATM Kiosk-to-Location Ratio Optimization
Retail1. Uncaptured Opportunity 2. Assortment Optimization3. Out of Stock4. Demand Forecasting5. Dynamic Pricing6. Promotion Effectiveness
Insurance1. Risk Management and Pricing Optimization2. Portfolio Suitability3. Demand Forecasting4. Staffing Mix / Specialty Account Services5. Damage Forecasting
City Analytics Industry Use Cases
Consumer Packaged Goods1. Product mix2. Out of Stock3. Visibility4. Expansion Opportunity5. Customer Churn6. Promotion Effectiveness
Travel and Transportation1. Booking Traffic Forecasting Based on POIs2. Service Relative Pricing Model3. Promotion Customization4. Amenity Mix5. Cancellation Forecasting
Telecommunications1. Customer Churn2. Package/Service Offering Optimization3. Coverage Optimization4. New Product Demand5. Device Repair Services6. Service Outage Forecasting
8
© 2016 IBM Corporation
Global Business Services
9
Geographic Hierarchy, External Metrics, and Polygons
Rockaways
Manhattan
Soho
Midtown
Brooklyn
Queens
SouthernEastern
Central
External Metric Data Point Domain defined by coordinates: Temperature at (x,y) is 72 F.
(x,y)
External Metric Data Point Domain defined by a node of the hierarchy: It’s raining in Queens. It’s raining in all polygons under Queens.
Level 0
Level 1
Level 2
New York
ManhattanBrooklyn Queens
Soho Midtown RockawaysCentralSouthern Eastern
Nodes
((lat lon, lat lon, … , lat lon))
((lat lon, lat lon, … , lat lon), (lat lon, lat lon, … , lat lon))
Polygon 1
Polygon 2 Polygon 3
Rockaways:
Central:
Most cities have files with the boundaries of sub-regions represented as polygons:
© 2016 IBM Corporation
Global Business Services
10
Associating External and Internal Contexts
External Metrics, Events, News…
Geographic Hierarchy
PolygonsPrime
Entities(Stores, Towers,
ATM…)
Customer-Specific Metrics
Customer Hierarchies(Product, Sales…)
External/Public Context Internal/Customer-Specific Context
Coordinates of Prime Entities of any customer can instantly leverage the external context associated to polygons
Easily replaced for any customerSame for all customers
IBM Metro Pulse Solution
© 2016 IBM Corporation
Global Business Services
11
Fundamental Polygon Functions
2) polygons_intersection(“Polygon P”, “Polygon Q”)
1polygons_intersection(“Pol 1”, “Pol 2”)
0polygons_intersection(“Pol 1”, “Pol 3”)
Pol 1
Pol 2
Pol 3
Data Quality: No two polygons under the same hierarchy can intersect on any point other than on the edges or vertices.
1) point_in_polygon(“Point X”, “Polygon P”)
Pol 1
Pol 2
Pol 3
Pol 4A
BC
1point_in_polygon(“A”, “Pol 2”)
0point_in_polygon(“B”, “Pol 3”)
Data Quality: All Prime Entities and Points of Interest must belong to one and only one polygon in each geographic hierarchy.
© 2016 IBM Corporation
Global Business Services
12
External Data Normalization Via a Reference Polygon
Reference Polygon
Pol 1
Pol 2
Pol 3
Pol 4
Metric 1: Original
Pol 1
Pol 2
Pol 3
Pol 4
“Metric 1” values are based on a set of polygons that don’t match the reference polygon.
Pol 1
Pol 2
Pol 3
Pol 4
Metric 1: Normalized
Different types of metrics (e.g., count, temperature) require different types of aggregation methods.
© 2016 IBM Corporation
Global Business Services
13
External DataLanding Zone
IBM Data Lake
…
Metro Pulse High Level Architecture
Global EnrichedCity Repository
External Data FromCities All Over The World)
Geographic Boundaries, Polygons, and Hierarchies
Analytics WorkbenchCustomer G
Analytics WorkbenchCustomer J
...
Cities relevant to Customer GCities relevant to
Customer J
Customer G Specific Data
Customer J Specific Data
On Premise
On Premise
DaaSCities relevant to
Customer Z
DaaSCities relevant to
Customer L
DaaSCities relevant to
Customer K
Customers interested in external data only.
...
Analytics WorkbenchCustomer A
Analytics WorkbenchCustomer B
Analytics WorkbenchCustomer F
...Cities relevant to
Customer A
Cities relevant to
Customer B
Cities relevant to Customer F
Customer A Specific Data
Customer B Specific Data
Customer F Specific Data
On the Cloud
Analytics WorkbenchGold Copy
© 2016 IBM Corporation
Global Business Services
Weather
GBS Data Lake
Exte
rnal
Dat
aby
City
Census
...
Geographical Borders, Polygons, and Hierarchies
Metro Pulse Global CityRepository(Curated Data)
RES
T A
PI
PowerUsers
Land
ing
Zone
DaaS
Metro Pulse Analytical Workbench Gold Copy(One Deployment per Customer)
POS
ATM
Cell Towers
...
Files,Tables
SFTP / DirectConnections
Inge
stio
n La
yerCus
tom
er-S
peci
fic D
ata
by C
ity/S
ite
Metro Pulse Architecture – Version: 2.1
Performance Layer
14
DataScientists
Size ofPrize
MovementAnalytics
NewsAnalysis ...
Modeling
Enhanced Forecast
Customer-Specific CityRepository
Core Analytics
ParametersRepository
Sandbox
DaaS
Visualization
BusinessUser
PowerUsers
Acc
ess
Serv
ices
RES
T A
PI
© 2016 IBM Corporation
Global Business Services
15
D3…
Data Lake
Analytics Workbench Data Flow
Raw Internal
Data
Raw Internal
Data
Clean Internal
DataSFTP Validated
Internal Data
TabularInternal
Data
DerivedData
ConsumableData
VisualizedData
Raw External
Data
RawExternal
Data
Clean External
Data
Validated External
Data
Tabular External
Data
Published Data
Cached Published
DataData
SamplesResults New CoreAnalytics
Sandbox
Published inProduction
Published inProduction
DataSamples ResultsNew
Analytics
Sandbox
Published inProduction
Hadoop Cluster: HDFS and HBASEStaging NodeCustomer’s Site
Cassandra Redis
User’s Additiona
l Data
Customer’s Site
User’s Database
Customer’s Site
Spark
Spark
IntegratedData
Node.js
Node.js
Micro services reusable not only for other customers, but also for other solutions
© 2016 IBM Corporation
Global Business Services
16
Micro Services for Data Ingestion and CurationData Sources Ingestion Engine
RDMBS
Structured Files
Unstructured
CopyData
HadoopEdge Node
Analytic Persistence
Curation Engine
Hadoop, HBASE. Cassandra, Redis…
GetData
Raw DataStore
PrepareRaw Data
CurateData
Transform / Enrich Data
Conformed/Polyglot Data
Store
231 2 3
4 5 6
7 8 9 10
11 12 13 1415
16 17 18 19 20 21
22 2322 23221 2 3
19 Reference Data Lookup
20 Transform Data
21 Enrich Data
22 Archive
23 Purge
1 Error & Exception Processing
2 Configuration set up
3 Audit, Balance, & Control
4 Transport Data from Source to Edge Node
5 Convert Data Formats
6 Copy/Move Data to Hadoop
7 Preprocessing Service
8 Technical Data Validation (TDQ)
9 Source Delta Processing
10 Persist Raw Data
11 Catalog Raw Data
12 Profile Data
13 Cross File Analysis
14 Causality Analysis
15 Target Load Service
16 Business Data Validation
17 Merge / Match
18 Manage Keys
Micro Services
© 2016 IBM Corporation
Global Business Services
Loading Geographic Hierarchy to HBASETable L0
Row Desc …London London is… Great Britain…
Paris Paris is… Continental Europe…
Table L1
Row Desc History…London:Central … ……
London:North … ……
HistoryNameLondon
Paris
Central
North
Name
…
Paris:Central … ……Central
…
…
Table L2
Row Desc History…London:Central:Kensington … ……
London:Central:Buckingham … ……
Kensington
Buckingham
Name
…
Table L3
Row Desc History…London:Central:Kensington:Notting Barns … ……
… … ……
Notting Barns
…
Name
…
P1 P2 P3 PN…
P1 P2 P3 PN…
P1 P2 P3 PN…
P1 P2 P3 PN…
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
Column Family: Data Column Family: Polygons
50484 51673 54735 53896
75736 78493 78303 79659
50484 51673 54735
50484 51673
50484
© 2016 IBM Corporation
Global Business Services
18
Metro Pulse Analytical Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
. . .
Metro Pulse Analytical Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
. . .
Metro Pulse Analytical Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
. . .
Easy to broadcast same data to multiple customers. Easy to add new customers.
Metro Pulse Analytical Workbench Edge Node
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
Hadoop Data Nodes: HDFS
Tweets
Weather
News
. . .
Ingesting External Data via Flume
Flume Agent: Tweets
Flume Agent: Weather
Flume Agent: News
. . .
Metro Pulse Global Repository Flume Server
Global CityRepository
Tweets
Weather
News
Internet
Agents can be optimally configured according to the data sources characteristics
Each agent writes to a different HDFS folders: no conflict, good for parallel execution
Each source is captured as a HBASE column family
One data source per agent: easy to add new sources
© 2016 IBM Corporation
Global Business Services
19
Performance Layer
- V_Transaction
- V_Level_Entity
- V_Polygon_Entity
- V_Size_of_Prize
...
Cache ManagerGet_View(“XYZ”)
- V_Level_Entity
- V_Size_of_Prize
API- If “XYZ” in Redis, return “XYZ”
- Else: - Get “XYZ” from Cassandra - Return “XYZ” to the API - Load “XYZ” to Redis
“XYZ”
Eviction Policy: Less Recently Used
Sub-second latency and high throughput Dashboards small files
High throughput for large files DaaS
© 2016 IBM Corporation
Global Business Services
20
Sample of Visualization Objects on D3.js
© 2016 IBM Corporation
Global Business Services
21