Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Edward Curry
Enterprise Data World 2013
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Problems with Data ¨ Master Data Management
n Crowdsourcing
n Collaborative Data Management
n Setting up a CDM Process
n Future Directions
Overview
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
The Problems with Data
Knowledge Workers need: ¨ Access to the right data
¨ Confidence in that data
Flawed data effects 25% of critical data in world’s top companies
Data quality role in recent financial crisis: ¨ “Asset are defined differently
in different programs”
¨ “Numbers did not always add up”
¨ “Departments do not trust each other’s figures”
¨ “Figures … not worth the pixels they were made of”
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Master Data Management is a process that can improve data quality
n What is Data Quality? ¨ Desirable characteristics for information
resource
¨ Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness,
Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation
Master Data Management
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quailty
Master Data Management
Profile Sources
Define Mappings
Cleans Enrich
De-duplicate Define Rules
Master Data
Data Developer
Data Steward
Data Governance
Business Users
Applications
Product Data Product Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quality
6
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product name=“iPod Nano”> <Items> <Item code=“IPN890”> <price>150</price> <genera?on>5</genera?on> </Item> </Items> </Product>
Source A
Source B
Schema Difference?
Data Developer
APNR
iPod Nano
Red
150
APNR
iPod Nano
Silver
160
iPod Nano IPN890 150
5
Value Conflicts? Entity Duplication?
Data Steward
Business Users
?
Technical Domain (Technical)
Domain
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Pros ¨ Can create a single version of truth
¨ Standardized information creation and management
¨ Improves data quality
n Cons ¨ Significant upfront costs and efforts
¨ Participation limited to few (mostly) technical experts
¨ Difficult to scale for large data sources – Extended Enterprise e.g. partner, data vendors
¨ Small % of data under management (i.e. CRM, Product, …)
Master Data Management
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Enterprise Data Landscape
The Managed
8
Reference data managed through well define policies and governance council
Data directly managed by enterprise and its departments
All data relevant to enterprise and its operations The
Reality
The Known
MDM
Enterprise Data
Relevant External Data
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
CROWDSOURCING
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Crowdsourcing Industry Landscape
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user can’t)
n A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals
n Related Areas ¨ Collective Intelligence
¨ Social Computing
¨ Human Computation
¨ Data Mining
Introduction to Crowdsourcing
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Maskelyne 1760 ¨ Used human computers
to created almanac of moon positions
– Used for shipping/navigation
¨ Quality assurance – Do calculations twice – Compare to third verifier
When Computers Were Human
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
When Computers Were Human
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Human ü Visual perception ü Visuospatial thinking
ü Audiolinguistic ability ü Sociocultural
awareness
ü Creativity ü Domain knowledge
Machine ü Large-scale data
manipulation
ü Collecting and storing large amounts of data
ü Efficient data movement
ü Bias-free analysis
Human vs Machine Affordances
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Computers cannot do the task
n Single person cannot do the task
n Work can be split into smaller tasks
When to Crowdsource?
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Tag a Tune
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Peekaboom
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Foldit
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
ReCaptcha
n OCR ¨ ~ 1% error rate
¨ 20%-30% for 18th and 19th century books
n 40 million ReCAPTCHAs every day” (2008) ¨ Fixing 40,000 books a
day
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Generic Architecture
Workers
Platform/Marketplace (Publish Task, Task Management)
Requestors
1.
2.
4.
3.
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Amazon Mechanical Turk
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
CrowdFlower
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
COLLABORATIVE DATA MANAGEMENT
• Collabora?ve knowledge base maintained by community of web users
• Users create en?ty types and their meta-‐data according to guidelines
• Requires administra?ve approvals for schema changes by end users
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaboratively built by large community ¨ More than 19,000,000 articles, 270+ languages,
3,200,000+ articles in English
¨ More than 157,000 active contributors
n Accuracy and stylistic formality are equivalent to expert-based resources ¨ i.e. Columbia and Britannica encyclopedias
n WikiMeida ¨ Software behind Wikipedia
¨ Widely used inside organizations
¨ Intellipedia:16 U.S. Intelligence agencies
¨ Wiki Proteins: curated Protein data for knowledge discovery
Wikipedia
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n DBPedia provides direct access to data ¨ Indirectly uses wiki as data curation platform
¨ Inherits massive volume of curated Wikipedia data
¨ 3.4 million entities and 1 billion RDF triples
¨ Comprehensive data infrastructure – Concept URIs – Definitions – Basic types
DBPedia Knowledge base
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
A Bottom up Approach to MDM
Engage More Human Workers to Collabora4vely Manage Enterprise Data
31 of 50
Collaborative Enterprise Data Management
10s-100s 10,000s-100,000s Number of Participants
Data Control
Top-down
Bottom-up
MDM
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Emerging Enterprise Data Landscape
The Managed
8
Reference data managed through well define policies and governance council
Data directly managed by enterprise and its departments
All data relevant to enterprise and its operations The
Reality
The Known
Enterprise Data
Relevant External Data
Collaboratively Managed
MDM
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Clean Data
Algorithm + Crowd
Developers Data Governance
Internal Community
External Crowd
Data Sources
Data Quality Algorithms
Human Computation
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Examples of CDM Tasks
n Understanding customer sentiment for launch of new product around the world.
n Implemented 24/7 sentiment analysis system with workers from around the world.
n Categorize millions of products on eBay’s catalog with accurate and complete attributes
n Combine the crowd with machine learning to create an affordable and flexible catalog quality system
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Natural Language Processing ¨ Dialect Identification, Spelling Correction, Machine
Translation, Word Similarity
n Computer Vision ¨ Image Similarity, Image Annotation/Analysis
n Classification ¨ Data attributes, Improving taxonomy, search results
n Verification ¨ Entity consolidation, de-duplicate, cross-check, validate
data
n Enrichment ¨ Judgments, annotation
Examples of CDM Tasks
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
SETTING UP A CDM PROCESS
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Core Design Questions of CDM
Goal What
Why Incentives Who Workers
How Process
Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Hierarchy (Assignment) ¨ Someone in authority assigns a particular person
or group of people to perform the task
¨ Within the Enterprise
n Crowd (Choice) ¨ Anyone in a large group who choses to do so
¨ Internal or External Crowds
Who is doing it? (Workers)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Motivation ¨ Money ($$££)
¨ Glory (reputation/prestige)
¨ Love (altruism, socialize, enjoyment)
¨ Unintended by-product (e.g. re-Captcha, captured in workflow)
¨ Self-serving resources (e.g. Wikipedia, product/customer data)
n Determine pay and time for each task ¨ Marketplace: Delicate balance
– Money does not improve quality but can increase participation
¨ Internal Hierarchy: Engineering opportunities for recognition – Performance review, prizes for top contributors, badges,
leaderboards, etc.
Why are they doing it? (Incentives)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Effect of Payment on Quality
n Cost does not affect quality [Mason and Watts, 2009, AdSafe]
n Similar results for bigger tasks [Ariely et al, 2009]
[Panos Ipeirotis. WWW2011 tutorial]
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Creation Tasks ¨ Create/Generate
¨ Find
¨ Improve/ Edit / Fix
n Decision (Vote) Tasks ¨ Accept / Reject
¨ Thumbs up / Thumbs Down
¨ Vote for Best
What is being done? (Goal)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Tasks integrated in normal workflow of those creating and managing data ¨ Simple as vetting or “rating” results of algorithm
n Task Design ¨ Task Interface
¨ Task Assignment/Routing
¨ Task Quality Assurance
How is it being done? (How)
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Task Design
43
* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art
Input Output
Task Router before computation
Output Aggregation after computation
Task Interface during computation
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Pull Routing
n Workers seek tasks and assign to themselves ¨ Search and Discovery of tasks support by platform
¨ Task Recommendation
¨ Peer Routing
Workers
Tasks Select
Result
Algorithm
Search & Browse Interface
Result
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Push Routing
n System assigns tasks to workers based on: ¨ Past performance
¨ Expertise
¨ Cost
¨ Latency
45
Workers
Tasks
Assign
Result
Assign
Algorithm
Task Interface
* www.mobileworks.com
Result
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Redundancy: Quorum Votes ¨ Replicate the task (i.e. 3 times)
¨ Use majority voting to determine right value (% agreement)
¨ Weighted majority vote
n Gold Data / Honey Pots ¨ Inject trap question to test quality
¨ Worker fatigue check (habit of saying no all the time)
n Estimation of Worker Quality ¨ Redundancy plus gold data
n Qualification Test ¨ Use test tasks to determine users ability for such tasks
Managing Task Quality Assurance
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Task Management ¨ Task assignment, payment, routing
– Optimizing for Cost, Quality, Completion Time
n Human–Computer Interaction ¨ Payment / incentives
¨ User interface and interaction design
¨ Worker reputation, recruitment, retention
n Quality Control ¨ Trust, reliability, spam detection, consensus
Future Directions
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaborative Data Management ¨ Emerging trend for data management in the Enterprise.
¨ Crowdsourcing + Micro Tasks
¨ A number of emerging platform to assist
Summary
Data Quality Algorithms
Human Computation Clean Data Dirty Data
BIG Big Data Public Private Forum
THE BIG PROJECT
Overall objective
Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to
enhance the EU competitiveness taking full advantage of Big Data technologies.
Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data
specifically in Horizon 2020.
BIG Big Data Public Private Forum
BIG Big Data Public Private Forum
Key facts about BIG-project
▶ Type of project: CSA ▶ Project start date: September 2012 ▶ Duration: 26 months ▶ Call: FP7-ICT-2011-8 ▶ Effort: 552,5 PM ▶ Budget: 3,038 M€ ▶ Max EC contribution: 2,499 M€ ▶ Consortium: 11 partners
BIG Big Data Public Private Forum
BIG: PROJECT STRUCTURE
Data acquisition Data analysis
Data curation
Data storage
Data usage
Health Public Sector Telco, Media & Entertainment
Finance & insurance
Manufacturing, Retail, Energy, Transport
Value Chain
• Structured data• Unstructured Data• Event processing• Sensors networks• Streams
• Data preprocessing• Semantic analysis• Sentiment analysis• Other features
analysis• Data correlation
• Trust• Provenance• Data augmentation• Data validation
• RDBMS limitations • NOSQL• Cloud storage
• Decision support• Decision making• Automatic steps• Domain-‐specific
usage
Technical areas
SupplyNeeds
Industry driven working groups
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Edward is a research scientist at the Digital Enterprise Research Institute. His areas of research include green IT/IS, energy informatics, linked data, integrated reporting, and cloud computing. He has worked extensively with industry and government advising on the adoption patterns, practicalities and benefits of new technologies.
He has published in leading journals and books, and has spoken at international conferences including the MIT CIO Symposium.
About the Presenter
URL: www.edwardcurry.org Email: [email protected]
Twitter: @EdwardACurry Slides: slideshare.net/edwardcurry
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Big Data & Data Quality ¨ S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data,
Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011.
¨ A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288–303, 2011.
¨ R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data – challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–162, 2011.
¨ E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012.
¨ D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008.
¨ B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110.
Selected References
53
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collective Intelligence, Crowdsourcing & Human Computation ¨ A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-
Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
¨ E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
¨ M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD ’11, 2011, p. 61.
¨ P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302–312.
¨ Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009)
¨ Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial
¨ O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong 2011.
¨ When Computers Were Human: http://www.youtube.com/watch?v=YwqltwvPnkw
Selected References
54
Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaborative Data Management ¨ E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation
for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality (ICIQ 2012), Paris, France.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM.
Selected References
55