![Page 1: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/1.jpg)
Power product innovation with Big Data technologies
![Page 2: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/2.jpg)
Introducing:
Zhixuan Wang Experian
Hua Li Experian
![Page 3: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/3.jpg)
©Experian 3
In the 21st century, data is the new oil, Big Data analytics is the new engine, Big Data tools are the new machinery.
4/21/2017 Experian Public Vision 2017
![Page 4: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/4.jpg)
©Experian 4
Big Data and open source landscape
4/21/2017 Experian Public Vision 2017
![Page 5: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/5.jpg)
©Experian 5
Apache Hadoop Stack
4/21/2017 Experian Public Vision 2017
Tip: Use Hadoop streaming to write mapper and reducer in your favorite program language
![Page 6: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/6.jpg)
©Experian 6
Credit Card Attrition Trigger Transaction Data Insight System (TDIS)
4/21/2017 Experian Public Vision 2017
1
Historical spend enables probability
expectation (profile) to be computed
As time passes, new transactions
adjust the probability expectation
Notify when transaction does not
occur within the probability
expectation threshold
2
3
![Page 7: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/7.jpg)
©Experian 7
Hadoop Streaming with secondary sort
4/21/2017 Experian Public Vision 2017
Calculate triggers in reducer
• Build up profile based on account-grouped date-time ordered transactions
• Reuse old python code
Results
• 10M accounts with 1.2B transaction over 24 months
• No profile data to be stored: ~50GB / snapshot
• Finish in 1 hours 17 minutes
6 machine with 8 cores each
• Trigger delivery from weekly to daily
Sort
• Primary key = account number
• Secondary key = date-time
![Page 8: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/8.jpg)
©Experian 8
Apache Spark
4/21/2017 Experian Public Vision 2017
![Page 9: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/9.jpg)
©Experian 9
• Interactive data exploration via Spark Shell (Scala)
Spark use example
4/21/2017 Experian Public Vision 2017
![Page 10: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/10.jpg)
©Experian 10
Credit card transaction data: 24-month
• 25GB bzipped
• 1.2B transaction
– 18 fields / transaction
• 8 machine
– 32 cores / machine
– 256GB memory / machine
Interactively explore credit card transactions data
4/21/2017 Experian Public Vision 2017
![Page 11: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/11.jpg)
©Experian 11
Split, convert and load data
4/21/2017 Experian Public Vision 2017
Split, convert, and load data
Fire up Spark-shell
![Page 12: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/12.jpg)
©Experian 12
Cache data
4/21/2017 Experian Public Vision 2017
Check cached data and executors
Cache it!
![Page 13: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/13.jpg)
©Experian 13
Explore data (fast!)
4/21/2017 Experian Public Vision 2017
Take a peak
Five number summary on TRAN_AMT
![Page 14: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/14.jpg)
©Experian 14
Save results
4/21/2017 Experian Public Vision 2017
Top merchant ZIP Codes™
Save results
![Page 15: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/15.jpg)
©Experian 15
Start Spark Shell
• Set proper number of executors and memory per executor
Convert, load, cache data
• Spark >=1.6v: memory efficient
• Partition data to fit executor’s memory limit
Explore
Recap and tips
4/21/2017 Experian Public Vision 2017
![Page 16: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/16.jpg)
©Experian 16
Graph database
4/21/2017 Experian Public Vision 2017
![Page 17: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/17.jpg)
©Experian 17
Challenge: Finding the missing link
Potential applications:
• Healthcare: Elder patients close to his / her children
• Wealth service: Identify the heirs of the elder customers
• Retail: Condolence / celebration / holiday gifts and services
• Anti-money laundry: Domestic politically exposed persons
• Fraud prevention: Synthetic ID fraud
Who are my family members?
4/21/2017 Experian Public Vision 2017
![Page 18: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/18.jpg)
©Experian 18
What is a graph database?
4/21/2017 Experian Public Vision 2017
Graph
• A collection of vertices (nodes) and edges(relationships) that connect them
Graph database
• Index-free adjacency: connected nodes physically “point” to each other in the database
![Page 19: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/19.jpg)
©Experian 19
• Extremely Flexible data format
• Most of time family members are not directly connected
• Nodes that are useful family indicators:
• Address
• Phone number
• Email address
• Last name
• Other usage:
• Meetup / E-harmony (based on hobby, taste etc.)
• Facebook / LinkedIn (based on co-worker, classmates etc.)
Design the graph for family search
4/21/2017 Experian Public Vision 2017
![Page 20: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/20.jpg)
©Experian 20
Comparison
4/21/2017 Experian Public Vision 2017
SQL Query (RDBMS Database) Cypher Query (Graph Database)
![Page 21: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/21.jpg)
©Experian 21
Geolocation database with PostgreSQL
4/21/2017 Experian Public Vision 2017
![Page 22: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/22.jpg)
©Experian 22
Geolocation data
4/21/2017 Experian Public Vision 2017
![Page 23: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/23.jpg)
©Experian 23
Geolocation data
4/21/2017 Experian Public Vision 2017
Exponential growth of mobile location data with the rise of smart phones
Wide applications:
• Home / work location detection
• Favorite shops
• Mobile marketing service
• Passenger analysis
Key question:
• Where has the consumer been?
Supporting components:
• Where are the Points of Interest (POI) data?
• Which POI is/are around the consumer?
where you
work
where you
shop
how you get there
events you
attend
where you
travel
where you live
where you spend free time
![Page 24: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/24.jpg)
©Experian 24
OpenStreetMap Best Free source for points of interests
4/21/2017 Experian Public Vision 2017
OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world
• Not as accurate as Google, but getting closer and closer, especially in major cities
• Points, lines, polygons
• Rich tags:
Addr: House number, street, city, etc.
Shop: Alcohol, beverage, computer
Admin_level: 2 (country), 4 (state), 6 (city)
Highway: Residential, primary, cycle way, track, etc.
Amenity: Library, school, parking area, bar
Cuisine: coffee, pizza, Chinese, sushi
• Could be easily imported into PostgreSQL with PostGIS extension
![Page 25: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/25.jpg)
©Experian 25
What POIs are around me?
4/21/2017 Experian Public Vision 2017
![Page 26: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/26.jpg)
©Experian 26
9p 9r 9x 9z
9n 9q 9w 9y
9j 9m 9t 9v
9h 9k 9s 9u
95 97 9e 9g
94 96 9d 9f
91 93 99 9c
90 92 98 9b
9q
b
9q
c
9qf 9q
g
9q
u
9q
v
9q
y
9q
z
9q
8
9q
9
9q
d
9q
e
9q
s
9qt 9q
w
9q
x
9q
2
9q
3
9q
6
9q
7
9q
k
9q
m
9q
q
9qr
9q
0
9q
1
9q
4
9q
5
9q
h
9qj 9q
n
9q
p
• Hierarchical group coding of (latitude, longitude) coordinates
• Arbitrary accuracy
• Fast encoding
Geohash
4/21/2017 Experian Public Vision 2017
![Page 27: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/27.jpg)
©Experian 27
Nearby points Easy case of vicinity search
4/21/2017 Experian Public Vision 2017
Which store am I visiting?
Identify the search radius
POI candidates within
candidate Geohash
Filter by actual distance
calculation
![Page 28: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/28.jpg)
©Experian 28
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Challenge #1: The geohash of a polygon
is the geohash of its center, but the boundary
could be very far away from its center
Which park am I visiting?
Solution:
Categorize polygons by its size first, then
customize search radius by the search
![Page 29: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/29.jpg)
©Experian 29
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Multiple level search: Find polygons of all sizes
Which park am I visiting?
![Page 30: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/30.jpg)
©Experian 30
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Am I in the park?
Challenge #2: Given a point, how do we
determine whether it is inside the polygon?
Solution:
ST_Within (PostGIS built-in function):
Using ray_casting algorithm
• Draw a ray from the point in random
direction
• Count the number of intersections
• Odd: In Even: Out
![Page 31: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/31.jpg)
©Experian 31
Key takeaways
4/21/2017 Experian Public Vision 2017
• Use OpenStreetMap + PostgreSQL(PostGIS) to handle your geo-location data
• Filter the candidates first before you calculate distance
![Page 32: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/32.jpg)
©Experian 32
Tips on some latest techniques based on our experiences
• Spark:
– Set proper number of executors and memory per executor
– Partition data to fit executor’s memory limit
• Graph Database:
– Much more efficient when you have to do multiple joins in traditional RDBMS
– Much more flexible
• Geolocation data:
– OpenStreetMap + PostgreSQL
– Filter candidates before a proximity search
Summary
4/21/2017 Experian Public Vision 2017
http://www.experian.com/big-data/datalabs.html
![Page 33: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/33.jpg)
©Experian 33
Experian contact:
Hua Li [email protected]
Zhixuan Wang [email protected]
Questions and answers
4/21/2017 Experian Public Vision 2017
![Page 34: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/34.jpg)
©Experian 34
Share your thoughts about Vision 2017!
4/21/2017 Experian Public Vision 2017
Please take the time now to give us your feedback about this session.
You can complete the survey at the kiosk outside.
How would you rate both the Speaker and Content?
![Page 35: Power product innovation with Big Data technologies · Start Spark Shell • Set proper number of executors and memory per executor Convert, load, cache data • Spark >=1.6v: memory](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5e7b62d616a7d5b6552dc/html5/thumbnails/35.jpg)