search, engage, influence
TRANSCRIPT
SEEi SEarch, Engage, Influence
Insight Data Engineering, Silicon Valley
Yan Jiang
Search keyword, in a time window, inside social network
● User ID ● Keyword● Time Window (start + end)● Social network (followers relationship)
Motivation
http://ec2-52-24-96-242.us-west-2.compute.amazonaws.com/email
Demo
User Request
Network Relationship
PipelineTweets + Network RelationshipFiltering
+Map-reduce
Input Data
User ID Timestamp Tweets
123456789 2015-01-01 Hello world!
User ID Follower ID Earliest Retweet Time
123456789 987654321 2015-01-01
Table 1. Tweets Table
Table 2. User and Follower Relationship
1.4 TB json tweets
1. Track the reach impact inside the user and follower network
Challenges
Tweets date: 05/10/2015Mother's Day is hard when your mom deserves an island but you can only afford a candle
1. Track the reach impact inside the user and follower network
Tweets date: 05/10/2015Mother's Day is hard when your mom deserves an island but you can only afford a candle
Earliest retweet date: 5/15/2015My mom taught me not to break people's heart.
Earliest retweet date: 05/09/2015@PerfectAmeezy mOM
Earliest retweet date: 5/11/2015my mom is either my best friend or satan there is no in between
Earliest retweet date: 05/01/2015Happy Mother's Day to the best mom out there http://t.co/TSoesf2vuw❤️
Challenges
Current Version:Find earliest retweets time of each user and follower pair
increase Spark memory to 6GB
Benchmark Map-reduce JobFuture improvement:A better way - airflow
2. How to improve the search efficiency and scalability?
● Multi-step filters (sequential query VS. big table)
● Optimization I/O
Challenges
About Yan
Data Analyst, US EPA, 2016● design & auto-process metric● geospatial/statistical modeling
M.S., Purdue University, 2013● environmental informatics● mathematical modeling
I LOVE nature, yoga, meditation...