measuring the new wikipedia community (pydata sv 2013)
DESCRIPTION
Talk given by Ryan Faulkner at PyData Silicon Valley 2013TRANSCRIPT
![Page 1: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/1.jpg)
Measuring the New Wikipedia Community
PyData 2013
Ryan Faulkner ([email protected])
Wikimedia Foundation
![Page 2: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/2.jpg)
OverviewIntroduction
Problem & Motivation
Proposed Solution
User Metrics
A Short Example
Extending the Solution
Using the Tool
Live Demo!!
![Page 3: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/3.jpg)
IntroductionMe: Data Analyst at Wikimedia
Machine Learning @ McGillFundraising - A/B testingEditor Experiments - increasing the number of Active editors
Editor Engagement Experiments (E3) team @ the Wikimedia Foundation
Micro-feature experimentation
![Page 4: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/4.jpg)
Problem
What's wrong with Wikipedia?
![Page 5: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/5.jpg)
Problem - Editor Decline
http://strategy.wikimedia.org/wiki/Editor_Trends_Study
![Page 6: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/6.jpg)
Problem - ApproachCan we stimulate the community of users to become more
numerous and productive?
○ Focus on new users■ Encourage contribution, make it easier
○ Lower the threshold for account creation■ Bring more people in.
○ Rapid experimentation on features that retain more users and stimulate increased participation.■ This will help us determine what works with less
cost
![Page 7: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/7.jpg)
Problem - Evaluation○ Data Consistency
■ Anomaly Detection
■ Auto-correlation (seasonality)
○ "A/B" testing
■ Hypothesis testing - student's t, chi-square
■ Linear / Logistic regression
○ Multivariate testing
■ Analysis of variance
![Page 8: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/8.jpg)
Problem - What we needCurrently a lot of the work around analysis is done
manually and is a large drain on resources:
○ Faster Data gathering
○ Knowing what we're logging and measuring &
faster ETL
○ Faster Analysis
○ Broadening Service and iterating on results
![Page 9: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/9.jpg)
Problem - What we needBuild better infrastructure around how we interpret and
analyze our data.
○ Determine what to measure.■ Rigorously define relevant metrics
○ Expose the metrics from our data store■ Python is great for writing code quickly to handle
tasks with data■ Library support for data analysis (pandas,
numpy)
![Page 10: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/10.jpg)
Solution
The tools to build.
![Page 11: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/11.jpg)
Solution - Proposed
We need to measure User Behaviour"User Metrics" & "UMAPI"
User Metrics & UMAPI
Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and
analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.
![Page 12: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/12.jpg)
Solution - Why BotherWhat exactly do we gain by building these classes? Why not just query the database?
1. Reproducibility & Standardization2. Extensibility3. Concise definition4. Increase turn around
a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K usersvia MySQL = 24hrs,via User Metrics < 10mins)
![Page 13: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/13.jpg)
Solution - Why Python?Why not C++, Java, or PHP?
1. Speed of development
2. Simplify the code base & easy extensibility a. more "Scientist Friendly"
3. Good support for data processing
4. Better integration for downstream data analysis
5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck-typing, RESTful API.
![Page 14: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/14.jpg)
User Metrics
How do we form a picture about what happens on Wikipedia?
![Page 15: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/15.jpg)
User Metrics - User activityEvents (not exhaustive):
■ Registration
■ Making an edit
■ Contributions of Namespaces
■ Reverting edits
■ Blocking
![Page 16: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/16.jpg)
User Metrics - What do we want to know about users?
○ How much do they contribute?
○ How often do they contribute?
○ Potential vandals. Do they go on to be reverted,
blocked, banned?
![Page 17: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/17.jpg)
User Metrics - Metrics Definitions
https://meta.wikimedia.org/wiki/Research:Metrics
Retention Metrics
Survival(t) Boolean measure of an editor surviving beyond t
Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t
Live Account(t) Boolean measure of whether the new user click the edit button?
Volume Metrics
Edit Rate Float result of user's rate of contribution.
Content Integer bytes added by revision and edit count.
Sessions Average session length (future)
Time to Threshold Time to reach a threshold (e.g. first edit)
![Page 18: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/18.jpg)
User Metrics - Metrics Definitions
Content Quality
Revert Rate Float representing the proportion of revisions reverted.
Block Boolean indicating a block event on the user.
Content Persistence Integer indicating how long this user's edits survive (future)
Contribution Type
Namespace of Edits Integer edit counts in all namespaces.
Scale of Change Float representation of fraction of total page content modified (future)
![Page 19: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/19.jpg)
User Metrics - Bytes Added
userrevision history
(over a predifined period)
Revision k:byte increase
(user ID, bytes_added, bytes_removed, edit count)
![Page 20: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/20.jpg)
User Metrics - Threshold
userrevision history
(over a predefined period)
(user ID, threshold_reached={0,1})
registration
Events since registration up to time "t"
if len(event_list) >= n:threshold_reached = True
else:threshold_reached = False
![Page 21: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/21.jpg)
User Metrics - Revert Rate
userrevision history
(over a predefined period)
for each revision look
at page history
Future Revisions
Past Revisions
checksum k
checksum i
if checksum i == checksum k:# reverted!
(user ID, revert_rate, total_revisions)
![Page 22: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/22.jpg)
User Metrics - Implementationhttps://github.com/wikimedia/user_metrics
1. MySQL & Redis (future) data store
a. All of the backend dependency is abstracted out of
metrics classes
2. Python implementation - MySQLdb (SQLalchemy)
3. Strategy Pattern of Parent user metrics class
4. Metrics built mainly from four core MediaWiki tables:
a. revision, user, page, logging
5. Python Decorator methods for handling metric
aggregation
![Page 23: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/23.jpg)
User Metrics
![Page 24: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/24.jpg)
A Concrete Example
How can we use this framework?
![Page 25: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/25.jpg)
Example - Post Edit Feedback
What effect does editing feedback (confirmation/gratitude) have on new editors?
![Page 26: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/26.jpg)
Example - Results
![Page 27: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/27.jpg)
An Extended Solution
Turn the data machine into a service.
![Page 28: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/28.jpg)
Editor Metrics go beyond feature experimentation ...
It became clear that...
● We needed a service to let clients generate their own user metrics data sets
● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF
● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)
![Page 29: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/29.jpg)
User Metrics API [UMAPI]Open Source (almost) RESTful API (Flask)
Computes metrics per user (User Metrics)
Combines metrics in different ways depending on request types
HTTP response in JSON with resulting data
Store data internally for reuse
![Page 30: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/30.jpg)
UMAPIhttp://metrics.wikimedia.org/
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
![Page 31: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/31.jpg)
UMAPI - OverviewService GET requests based on a combination of URL paths + query params
e.g. /cohort/metric?date_start=..&date_end=...&...
Define user "cohorts" on which to operate
API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request
JSON response
![Page 32: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/32.jpg)
UMAPI - Overview
Basic cPickle file cache for responsesCan substitute caching system (e.g. memcached)
Reusing request data where it overlaps
Request Types:"Raw" - metrics per userAggregation over cohorts: mean, sum, median, etc.Time series requests
![Page 33: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/33.jpg)
UMAPI ArchitectureHTTP GET request
JSON response
Apache Flask / App Servermod_wsgi
Request Notifications
ListenerRequest Control
Response Control Cache
MediaWiki Slaves
User MetricsAPI
Messaging Queues
Metrics objects - Separate
Processes
Asynchronous Callbacks
![Page 34: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/34.jpg)
UMAPI Architecture - Listeners
Request Notifications CallbackHandles managing and notifications on job status
Request ControllerQueues requestsSpawns jobs from metrics objectsCoordinates parameters
Response ControllerReconstruct response dataWrite to cache
![Page 35: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/35.jpg)
We will want to consider large groups of users, for instance, a test or control group in some experiment:
Aggregate groups of userslists of user IDs
Cohort registration (under construction)adding new cohorts to the model
Single user endpoint
Boolean expressions over cohorts supported
UMAPI - User Cohorts
![Page 36: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/36.jpg)
User Metric PeriodsHow do we define the periods over which metrics are measured?
RegistrationLook "t" hours since user registration
User DefinedUser supplied start and end dates
Conditional RegistrationRegistration as above with condition that registration falls within input
![Page 37: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/37.jpg)
UMAPI - RequestMeta Module
Mediator Pattern to handle passing request data among different portions of the architecture
Abstraction allows for easy filtering and default behaviour of request parameters
Requests can easily be turned into reproducible and unique hashes for caching
![Page 38: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/38.jpg)
How the Service Works
The user experience with user metrics.
![Page 39: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/39.jpg)
UMAPI - Pipeline
Cohort or
comboRaw Params
Time Series
Aggregator
Aggregator Params
Params JSON
JSON
JSON
![Page 40: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/40.jpg)
UMAPI - Frontend Flow
![Page 41: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/41.jpg)
Job QueueAs you fire off requests the queue tracks what's running:
![Page 42: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/42.jpg)
Response - Bytes Added
![Page 43: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/43.jpg)
Response - Threshold
![Page 44: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/44.jpg)
Response - Edit Rate
![Page 45: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/45.jpg)
Response - Threshold w/ params
![Page 46: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/46.jpg)
Response - Aggregation
![Page 47: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/47.jpg)
Response - Aggregation
![Page 48: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/48.jpg)
Response - Time series
![Page 49: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/49.jpg)
Response - Combining Cohorts
"usertags_meta" - cohort definitions
![Page 50: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/50.jpg)
Response - Combining Cohorts
Two intersecting cohorts:
![Page 51: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/51.jpg)
Response - Combining Cohorts
AND (&)
![Page 52: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/52.jpg)
Response - Combining Cohorts
OR (~)
![Page 53: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/53.jpg)
Response - Single user endpointe.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
![Page 54: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/54.jpg)
Looking ahead ...Connectivity metrics (additional metrics)
○ Graph database? (Neo4j, gremlin w/ postgreSQL)○ User talk and common article edits
Better in-memory modelling○ python-memcached○ better reuse of generated data based on request data
Beyond English WikipediaImplemented!
![Page 55: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/55.jpg)
Looking ahead ...More sophisticated and robust data modelling
○ Modelling richer data: contribution histories, articles
edited, aggregate metrics
○ Classification: Logistic classifiers, Support Vector
Machine, Deep Belief Networks, Dimensionality
Reduction
○ Modelling revision text - Neural Networks, Hidden
Markov Models
![Page 56: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/56.jpg)
DEMO!!
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/thresholdhttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_ratehttp://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist
http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720
![Page 57: Measuring the New Wikipedia Community (PyData SV 2013)](https://reader034.vdocuments.us/reader034/viewer/2022052618/554be542b4c90556328b4a49/html5/thumbnails/57.jpg)
The Endhttp://metrics.wikimedia.org/
stat1.wikimedia.org:4000
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
Questions?