adventure in data: a tour of visualization projects at twitter
TRANSCRIPT
Krist Wongsuphasawat / @kristw
Adventure in dataA tour of visualizations at Twitter
Krist Wongsuphasawat / @kristw
Computer EngineerBangkok, Thailand
PhD in Computer ScienceUniv. of MarylandInformation Visualization
IBMMicrosoft
Data Visualization ScientistTwitter
Krist Wongsuphasawat / @kristw
Krist Wongsuphasawat / @kristw
Adventure in dataA whirlwind tour of visualization projects at Twitter
Get data
Having all Tweets
How people think I feel.
How people think I feel. How I really feel.
Having all Tweets
• Too much data. Want only relevant Tweets
• hashtag: #BRA
• keywords: “goal”
• Need to aggregate & reduce size
• Long processing time (hours)
Challenges
Hadoop ClusterData Storage
Workflow
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Workflow
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Workflow
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Smaller datasetYour laptop
Workflow
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
Workflow
Smaller dataset
Krist Wongsuphasawat / @kristw
Cleaning dataa story of my life
Storytelling
Analytics Tools
Creative
Projects
To understand the world and share the stories
To understand Twitter users and improve the service
To showcase the data and inspire
Projects
Storytelling
Analytics Tools
Creative
Storytelling1
World Cup ElectionOscars
TV Shows New YearEarthquake
Super BowlProtest
…
BehaviorsSleepingDaylight savingLanguage
…
Events
FastingInformation spread
Commute
So many things we could learn
from Twitter data
Give us interesting vis about xxxx by Nov 10
Challenge accepted
Tweets
(+ media)photos, videos
What?
Where? When?
GEO TIME
TEXT
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Time Tweets/second
Time Tweets/second
Time Tweets/second + Annotation
http://www.flickr.com/photos/twitteroffice/5681263084/
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
GeoHeatmap
Low density
High density
GeoSan Francisco
flickr.com/photos/twitteroffice/8798020541
Low density
High density
GeoSan Francisco
Rebuild the world based on
tweet volumes
twitter.github.io/interactive/andes/
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Text Word cloud of Tweets right after the 1st goal
www.wordle.net
It was an “own” goal.
Text WordTree [Wattenberg & Viégas 2008]www.jasondavies.com/wordtree
www.jasondavies.com/wordtree
Text
word/phrase/hashtag count topic
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Time + Geo Tweet pattern [Rios & Lin 2012]
Night
Late night
Daytime
Night
Late night
Daytime
Night
Late night
Daytime
Night
Late night
Daytime
Time + Geo Tweet pattern [Rios & Lin 2012]
Night
Late night
Daytime
Night
Late night
Daytime
Time + Geo Tweet pattern [Rios & Lin 2012]
Night
Late night
Daytime
Night
Late night
Daytime
Time + Geo Tweet pattern [Rios & Lin 2012]
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Geo + Text Real-time Tweet map
Geo + Text Real-time Tweet map
most frequent
term
Geo + Text Real-time Tweet map
Gmail was down Jan 24, 2014
Geo + Text Real-time Tweet map
Nelson Mandela passed away Dec 5, 2013
Geo + Text Real-time Tweet map
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Time + Text
UEFA Champions League
Biggest tournament for European soccer clubs
Many Tweets during the matches
UEFA Champions League
Dortmund Bayern Munich
Count Tweets mentioning the teams every minute
Team 1 Team 2
Time + Text
Time + Text UEFA Champions League
+ “goal” count + context
Time + Text UEFA Champions League
+ “offside”
Time + Text UEFA Champions League
+ players
Time + Text UEFA Champions League
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
Time + Text + Geo State of the Union
twitter.github.io/interactive/sotu2014
1) timeline + topic from Tweets
4) Density map of Tweets about selected topic
3) Volume of Tweets by topics
during selected part of the SOTU
2) context (speech)
twitter.github.io/interactive/sotu2014
Time + Text + Geo State of the Union
World Cup 2014Time + Textinteractive.twitter.com/wccompetitree
Time + Text + Geo World Cup 2014
interactive.twitter.com/wccompetitree
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
What?
Where? When?
GEO TIME
TEXT
Visualize Tweets
+Non-Twitter data
CONTEXT
Time + Text New Year 2014
Time + Text New Year 2014
Time + Text + Geo (c) New Year 2014
twitter.github.io/interactive/newyear2014/
Analytics Tools2Data sources
Output
explore
analyze
present
get
*
*
Analytics Tools2Data sources
Output
explore
analyze
present
get
*
*
Analytics Tools2Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts
Analytics Tools2Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts tools for exploration
User activity logs
UsersUseTwitter
UsersUse
Product Managers
Curious
UsersUse
Curious
Engineers
Log datain Hadoop Write Twitter
Instrument
Product Managers
What are being logged?
tweetactivities
What are being logged?
tweet from home timeline on twitter.com tweet from search page on iPhone
activities
What are being logged?
tweet from home timeline on twitter.com tweet from search page on iPhone
sign up log in
retweet etc.
activities
Organize?
log event a.k.a. “client event”
[Lee et al. 2012]
log event a.k.a. “client event”
client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet
1) User ID 2) Timestamp 3) Event name
4) Event detail
[Lee et al. 2012]
Log data
UsersUse
Curious
Engineers
Log datain Hadoop Twitter
Instrument
Write
Product Managers
bigger than Tweet data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find
Ask
Instrument
Write
Product Managers
Log data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Instrument
Write
Product Managers
Part I Find & Monitor Client Events
Motivation
Log datain Hadoop
Engineers & Data Scientists
billions of rows
Log datain Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
Log datain Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
(Who-to-Follow)
Log datain Hadoop
AggregateClient event collection
Engineers & Data Scientists
Log datain Hadoop
Aggregate
Find
client page section component element action
Search
Client event collection
Engineers & Data Scientists
Log datain Hadoop
Aggregate
Find
client page section component element action
Search
Client event collection
Engineers & Data Scientists
section? component?
element?
client page section component element action
Search
Find
Log datain Hadoop
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
10,000+ event types
search can be better
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
Aggregate
search can be better
one graph / event
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
Aggregate
search can be better
one graph / eventx 10,000
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
!
• Search for client events
• Explore client event collection
• Monitor changes
Goals
Design
Client event collection
Engineers & Data Scientists
See
Client event collection
Engineers & Data Scientists
See
Interactions search box => filter
Client event collection
narrow down
Engineers & Data Scientists
See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
Interactions search box => filter
See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
client : page : section : component : element : actionInteractions search box => filter
Client event hierarchy
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impressioniphone:home:-:tweet:tweet:click
Detect changes
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
Calculate changes
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
Display changes
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
Display changes
home -
- - impression
tweet tweet click
iphone
Demo Scribe Radar
Twitter for Banana
Part II Analysis
Count page visits
banana : home : - : - : - : impressionhome page
Funnel
home page
profile page
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression
1 jobhome page
profile page
1 hour
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression banana : search : - : - : - : impression
home page
profile page search page
2 jobs2 hours
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression banana : search : - : - : - : impression
home page
profile page search page
Specify all funnels manually!
n jobsn hours
Goal
banana : home : - : - : - : impression
… ……
1 job => all funnels, visualized
home page
• Visualize an overview of event sequences
!
• Big data? eBay checkout sequences
Related work
[Wongsuphasawat et al. 2011, Monroe et al. 2013, …]
[Shen et al. 2013]
User sessionsSession#1
A
B
start
end
Session#4
start
end
A
Session#2
B
start
end
A
Session#3
C
start
end
A
Aggregate4 sessions
A
BB C
start
end endend
A A
end
A
Aggregate
A
BB C
start
end endend
end
4 sessions
Aggregate
C
start
end endend
end
A
B
4 sessions
Aggregate
C
start
end endend
end
A
B
4 sessions
Aggregate
C
start
end endend
A
B end
4 sessions
Aggregate
C
start
endend
A
B end
4 sessions
Aggregate
C
start
endend
A
B end
4 sessions
Aggregate
start
endend
A
CB end
4 sessions
Aggregate
4,000,000 sessions
endend
A
CB end
start
Twitter for Banana
try with sample data (~millions sessions, 10,000+ event types)
!
original paper (100,000 sessions, ~10 event types)
fail…
How to make it work?
# of unique sequences
1. Reduce event types
Reduce # of unique sequences
1. Reduce event types
Reduce # of unique sequences
10,000 types select
tweet sign up log out
1. Reduce event types
Reduce # of unique sequences
10,000 types select
tweet sign up log out
1. Reduce event types
Reduce # of unique sequences
10,000 types select merge
tweet from home timeline tweet from search page tweet …
= tweet
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
session
1000 events
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
session
10 events after (window size & direction)
1000 events
visit home page (alignment)
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
Ask users for input}
1. Reduce event types
2. Reduce sequence length
3. More aggregation on Hadoop
Reduce # of unique sequences
Ask users for input}
Collapse eventsSequence ABBBCCCC ABBCC ABC ABCCCC ABCD ABCCCD ABCCE ABCDF ABCDG ABCDH
e.g. tweet, tweet, tweet, … = tweet
Sequence ABC ABC ABC ABC ABCD ABCD ABCE ABCDF ABCDG ABCDH
Collapse events
Group & CountSequence ABC ABCD ABCE ABCDF ABCDG ABCDH …
Count 2000 80 20 1 1 1 …
Group & CountSequence ABC ABCD ABCE ABCDF ABCDG ABCDH ABCDI ABCDJK ABCDJL
Count 2000 80 20 1 1 1 1 1 1
rare sequences (count < threshold)
TruncateSequence ABC ABCD ABCE ABCDx ABCDx ABCDx ABCDx ABCDJx ABCDJx
Count 2000 80 20 1 1 1 1 1 1
Replace last event with x (…)
Sequence ABC ABCD ABCE ABCDx ABCDJx
Count 2000 80 20 4 2
Group & Count
Truncate moreSequence ABC ABCD ABCE ABCDx ABCDx
Count 2000 80 20 4 2
Group & CountSequence ABC ABCD ABCE ABCDx
Count 2000 80 20 6
1. Define set of events
2. Pick alignment, direction and window size
3. Run Hadoop job (with more aggregation)
4. Wait for it… (2+ hrs)
5. Visualize
Final process
~100,000 patterns (10MB)
gazillion patterns (TBs)
Demo Flying Sessions
• Large-scale User Activity Logs + Visual Analytics
• Used in day-to-day operations at Twitter
• Generalize to smaller systems
Summary
Challenge
big data
small data
visualize & interact
aggregate & sacrifice
Data sources
Output
Creative3
…
https://medium.com/@kristw/designing-the-game-of-tweets-7f87c30dc5a2
Demo / Game of Tweets
To understand the world and share the stories
To understand Twitter users and improve the service
To showcase the data and inspire
Projects
Storytelling
Analytics Tools
Creative
Oh no…. NOT AGAIN
To understand the world and share the stories
To understand Twitter users and improve the service
To showcase the data and inspire
Projects
Storytelling
Analytics Tools
Creative
Reusable Toolkits
To implement once and for all
Coming soon
Demo / Labella.js
https://github.com/twitter/d3kit
Demo / d3Kithttp://www.slideshare.net/kristw/d3kit
ConclusionsData are everywhere.
Many applications: Journalism, Product development, Art, etc.
Combine visualization with other skills: HCI, Design, Stats, ML, etc.
Don’t repeat yourself.
Krist Wongsuphasawat / @kristwinteractive.twitter.com kristw.yellowpigz.com
@philogb @trebor @miguelrios @smrogers @lintool @linuslee @chuangl4
and many other colleagues at @twitter
Acknowledgement
ConclusionsData are everywhere.
Many applications: Journalism, Product development, Art, etc.
Combine visualization with other skills: HCI, Design, Stats, ML, etc.
Don’t repeat yourself.
Krist Wongsuphasawat / @kristwinteractive.twitter.com kristw.yellowpigz.com
Thank you
Questions?