ordering the chaos: creating websites with imperfect data
TRANSCRIPT
![Page 1: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/1.jpg)
Ordering the chaos: creating websites using
imperfect dataAndrew Stretton
Oxford University Web SIG November 2014
![Page 2: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/2.jpg)
Who am I, what is ChemBio Hub?
• Andrew Stretton – Data Architect and Developer
github.com/strets123
@strets123
linkedin (google me)
• Chembio Hub
http://chembiohub.ox.ac.uk (feel free to link to us!)
@oxchembiohub
github.com/thesgc
![Page 3: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/3.jpg)
Chembio Hub exists to support research at the
interface of chemistry and biology
by enabling sharing of reagents, expertise and data across 20+ departments
![Page 4: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/4.jpg)
Who are we trying to connect and how?
User 1:Scientist at Oxford
User 2:Potential collaborator
Could be in industry or anywhere in academia
Unpublished results
Negative Data
Equipment
Methods
Areas of expertise
Questions and answers
Contacts
Reagents
Publications
Held on other sites or social networksOrganised/linked to by ChemBio Hub
Stored and curated by ChemBio Hub
? Not sure yet
![Page 5: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/5.jpg)
Who are we trying to connect and how?
User 1:Scientist at Oxford
User 2:Potential collaborator
Could be in industry or anywhere in academia
Unpublished results
Negative Data
Equipment
Methods
Areas of expertise
Questions and answers
Contacts
Reagents
Publications
Held on other sites or social networksOrganised/linked to by ChemBio Hub
Stored and curated by ChemBio Hub
? Not sure yet
All of these parts require tagging entities in text, how can we do it
cheaply and sustainably?
![Page 6: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/6.jpg)
What sorts of messy data are we working with?
• Full text from procedures, biographies, web sites
• Raw CSV/ Excel formats from multiple machines or departmental processes
• “Standard” XML and JSON formats from various sources that do not map perfectly to our application
• Multiple external databases to submit data to
![Page 7: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/7.jpg)
How do most of our users like their web-based tools?
Simple Search
Flexible data management
Comprehensive, overlapping tagging
Clear progress, seamless experience
![Page 8: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/8.jpg)
What do we sometimes give them?
• Incomplete or many-to-one tagging
• Hyperlinks instead of the right information from the other site
• Dumb search
• Inflexible schemas
• Lack of linking between datasets
![Page 9: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/9.jpg)
What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce / Database viewsfor schema Normalisation and data analysis
Web crawling - not as hard or messy as it used to be
![Page 10: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/10.jpg)
What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce / Database viewsfor schema Normalisation and data analysis
Web crawling - not as hard or messy as it used to be
Let’s look at this one first, happy to discuss other areas later…
![Page 11: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/11.jpg)
How do we fill in gaps on un-tagged data?
Let’s do an experiment…
github.com/strets123/web-sig-2014/
![Page 12: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/12.jpg)
Elasicsearch - information extraction on-the-fly
• Take a dataset of 18801 companies
~ 50% tagged
> 80% have some
text data
0% 50% 100%
Overview ordescription
Overview
Description
Tags
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
![Page 13: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/13.jpg)
Use the “significant terms” feature…
• What description/overview words most strongly linked to each tag?
travel education music realestateSearch engine
optimizationjobs onlinemarketing projectmanagement
travel students music estate seo job marketing project
travelers teachers artists real optimization jobs seo projects
trip learning musicians agents engine employers agency task
trips education songs property ppc career optimization collaboration
hotels student labels listings marketing teams
flights educational playlists search management
traveler bands click
travellers song pay
airline artist
hotel fans
![Page 14: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/14.jpg)
Now let’s test these queries
• Which companies have no tag but are most likely to need tagging with “music”…uPlaya
Description uPlaya provides independent or unsigned musicians with immediate feedback on their music….
Category games_video
Tags -
Webceleb
Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.….
Category games_video
Tags -
![Page 15: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/15.jpg)
But what if we have
NO TAGS?
![Page 16: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/16.jpg)
A process to extract tags from text…
Index DataAssign resources (e.g. Amazon spot instance
for large dataset)
List word counts with the least frequent
first
Exclude lowest countsAggregate the
significant terms for each word
Filter words that have a lot of high scoring
significant terms
![Page 17: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/17.jpg)
What does this give us?
athletes: [athletes, coaches, athlete, coach, sports, fans]
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game]
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure]
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features]
dial: [dial, calling, calls, voip, number, call, voice, phone]
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health]
indie: [indie, labels, artists, music]
logos: [logos, branding, flash, design]
pci: [pci, dss, hipaa, compliance, sensitive, compliant]
portland: [portland, oregon, inc, founded]
ringtones: [ringtones, ringtone, personalization, games]
traders: [traders, forex, trader, trading, quotes, stock, trade]
yellow: [yellow, pages, directory, local]
abc: [abc, cnn, nbc, television]
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin]
aviation: [aviation, aircraft, aerospace, defense, transportation]
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
![Page 18: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/18.jpg)
What else can we do with this?
Filter words that have a lot of high scoring
significant terms
De duplicate where large overlaps exist
Assign levels of tags in order of frequency
Use to categorise new data on the fly
using percolate
Curate manuallyGenerate a sidebar
menu
github.com/strets123/web-sig-2014/
Use elasticsearchphrase suggester to create phrase tags
![Page 19: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/19.jpg)
Advantages over direct curation / supervised learning:
• Simplicity and pragmatism
• Applicable to novel domains
– e.g. Chemical Biology
• Auto generated tags choose more appropriate word combinations than manual curators
• No need for complex data formats like rdf
• Data from many sources can be mixed
– e.g. categories from other university’s sites…
![Page 20: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/20.jpg)
Where might this technology lead?
• How about a tag-based file system?
• How about an implicit social network?
• Elasticsearch is really easy to scale…
• Which websites, filesystems and datasets do you need to categorise?
– Do you really need RDF ontologies, curators etc. or can you just do something simple?
![Page 21: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/21.jpg)
Summary
• We now have many options to categorise and tidy up messy data
• Managing variations on schemas takes a lot of resources – leave it to the data owners if you can!
• When it comes to tagging…
– Perfection is in the eye of the beholder
– Sustainability is really important
![Page 22: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/22.jpg)
Thanks
• Thanks to the Research informatics team at the NDM Structural Genomics Consortium– Paul Barrett– Karen Porter– Michael O’Hagan– Brian Marsden– David Damerell– Sefa Garsot– Anthony Bradley
• Thanks to the InfoDev team at IT services for answering my endless questions about webauth
• Funders:
– John Fell Fund
– NDM Strategic
– Welcome Trust
– Higher Education Funding Council
• To everyone here for listening
![Page 23: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/23.jpg)
Any Questions?
• Andrew Strettongithub.com/strets123
@strets123
linkedin (google me)
• Chembio Hubhttp://chembiohub.ox.ac.uk
@oxchembiohub
github.com/thesgc
Simple example categorisation code available here in python
github.com/strets123/web-sig-2014/
![Page 24: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/24.jpg)
Appendix of other messy data techniques
![Page 25: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/25.jpg)
How do we make it easy to add spreadsheet data to a
system?
![Page 26: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/26.jpg)
Working with flat files
• Sometimes a flat file is the right schema for a dataset– User defined formats
– Different types of research
– Only some of the fields are relevant when comparing experiments
– Data is not in memory unless needed
• Pandas and HDF allows SQL-like queries on flat files
![Page 27: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/27.jpg)
Helpful data management
• Data Wrangler
– https://player.vimeo.com/video/19185801
• Raw
– http://raw.densitydesign.org
• Take these as inspiration for our tool for re-shaping biochemistry data
![Page 28: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/28.jpg)
Simplifying web crawling
• Modern web crawling patterns use class selectors instead of xPath
– Less likelihood of change
• Content can be crawled using a backend web browser
– Dynamic javascript elements are included
• Using a website’s data for classification is more acceptable than wholesale reproduction
![Page 29: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/29.jpg)
Managing multiple JSON schemas with views
Couchbase
PostgreSQL – also supported by Rails/Activerecord
![Page 30: Ordering the chaos: Creating websites with imperfect data](https://reader036.vdocuments.us/reader036/viewer/2022081404/55a05e0a1a28ab3c2e8b45c9/html5/thumbnails/30.jpg)
Why views over JSON can be useful
• Expose only required fields from e.g. RDF
• Input format may change but we don’t want crawler to break
• Required fields may change
• Versions are easy to support if format normalisation is in the database layer
• Storage is cheap
• View code is executed only once