search, engage, influence

11
SEEi SEarch, Engage, Influence Insight Data Engineering, Silicon Valley Yan Jiang

Upload: yan-jiang

Post on 21-Feb-2017

17 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Search, Engage, Influence

SEEi SEarch, Engage, Influence

Insight Data Engineering, Silicon Valley

Yan Jiang

Page 2: Search, Engage, Influence

Search keyword, in a time window, inside social network

● User ID ● Keyword● Time Window (start + end)● Social network (followers relationship)

Motivation

Page 3: Search, Engage, Influence

http://ec2-52-24-96-242.us-west-2.compute.amazonaws.com/email

Demo

Page 4: Search, Engage, Influence

User Request

Network Relationship

PipelineTweets + Network RelationshipFiltering

+Map-reduce

Page 5: Search, Engage, Influence

Input Data

User ID Timestamp Tweets

123456789 2015-01-01 Hello world!

User ID Follower ID Earliest Retweet Time

123456789 987654321 2015-01-01

Table 1. Tweets Table

Table 2. User and Follower Relationship

1.4 TB json tweets

Page 6: Search, Engage, Influence

1. Track the reach impact inside the user and follower network

Challenges

Tweets date: 05/10/2015Mother's Day is hard when your mom deserves an island but you can only afford a candle

Page 7: Search, Engage, Influence

1. Track the reach impact inside the user and follower network

Tweets date: 05/10/2015Mother's Day is hard when your mom deserves an island but you can only afford a candle

Earliest retweet date: 5/15/2015My mom taught me not to break people's heart.

Earliest retweet date: 05/09/2015@PerfectAmeezy mOM

Earliest retweet date: 5/11/2015my mom is either my best friend or satan there is no in between

Earliest retweet date: 05/01/2015Happy Mother's Day to the best mom out there http://t.co/TSoesf2vuw❤️

Challenges

Page 8: Search, Engage, Influence

Current Version:Find earliest retweets time of each user and follower pair

increase Spark memory to 6GB

Benchmark Map-reduce JobFuture improvement:A better way - airflow

Page 9: Search, Engage, Influence

2. How to improve the search efficiency and scalability?

● Multi-step filters (sequential query VS. big table)

● Optimization I/O

Challenges

Page 10: Search, Engage, Influence

About Yan

Data Analyst, US EPA, 2016● design & auto-process metric● geospatial/statistical modeling

M.S., Purdue University, 2013● environmental informatics● mathematical modeling

I LOVE nature, yoga, meditation...

Page 11: Search, Engage, Influence