finish section on linked data begin data cleaning and pre ... · json-ld (example from json-ld.org)...
TRANSCRIPT
Today
• Finish Section on Linked Data • Begin data cleaning and pre-processing topic
Graphs: Social networks
https://www.flickr.com/photos/marc_smith/5592302165
Protein-Protein Interactions
http://www.nature.com/nrg/journal/v5/n2/fig_tab/nrg1272_F2.html
The Internet Graph (https://en.wikipedia.org/wiki/Opte_Project)
Linked Data
• We need to connect data together --- form links. – A key part of the Semantic Web – Also important for the Internet of Things
• (26 billion things by 2020, each continuously producing data)
1. Principles of links from Tim Berners-Lee 1. All kinds of conceptual things, they have names now that start with
HTTP. 2. If I take one of these HTTP names and I look it up, I will get back
some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
3. When I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.
Linked Data Examples
• DBPedia – ~5 million “things” from Wikipedia – Can be linked to external datasets such as CIA World
Factbook, US Census Data – “Give me all cities in New Jersey with more than 10,000
people
• Freebase • FOAF (friend of a friend) • Google Knowledge Graph
• https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html
Standards for Linked Data
• Widely used standards (W3C Recommendations) – JSON-LD (JSON Linked Data) – RDF (Resource Description Framework)
JSON-LD (example from json-ld.org)
• Provide mechanisms for specifying unambiguous meaning in JSON data
• Provides extra keys with “@” sign – “@context” (used to define meanings of terms, map to
identifiers) – “@type” – “@id”
• Use cases – Google Knowledge Graph
JSON-LD Example (from https://en.wikipedia.org/wiki/JSON-LD)
{"@context": { "name": "http://xmlns.com/foaf/0.1/name", "homepage": { "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id" }, "Person": "http://xmlns.com/foaf/0.1/Person" }, "@id": "http://me.example.com", "@type": "Person", "name": "John Smith", "homepage": "http://www.example.com/" }
Graphs – RDF (Resource Description Framework) [materials from w3.org]
Serialisation of RDF Example Graph
This graph can be serialised as XML (don’t worry about syntax!)
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">
<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me"> <contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:[email protected]"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>
RDF – Triple Store
• An alternative format for storing RDF type data – triple store <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:[email protected]> . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .
Freebase
• A large database that connects entities (facts, people, places, organizations …) together as a graph – www.freebase.com – Freebase is the basis of the Google Knowledge graph that is
used to improve search. • https://developers.google.com/knowledge-graph/
• Retrieving data from the Google Knowledge Graph – Example adapted from http://www.nolan-nichols.com/
knowledge-graph-via-sparql.html
Other formats for Graphs: Matrix Representation
A
C
D
B A B C D
A 0 0 1 0 B 0 0 0 0 C 0 1 0 0 D 0 1 0 0 A ‘1’ in the matrix iff there is an edge from node X to node Y. Or use a relational table
Source Destination
A C C B D B
What you should know about data formats
• -Why do we have different data formats and why do we wish to transform between different formats?
• -Motivation for using relational databases to manage information • -Different between a (standard) relational database and a nosql database • -What is a csv, what is a spreadsheet, what is the difference? • -Be able to write regular expressions in python format (operators .^$*+|[]) • -Difference between HTML and XML and when to use each • -Motivation behind using XML and XML namespaces • -Be able to read and write data in XML (elements, attributes, namespaces) • -Be able to read and write data in JSON • -Difference between XML and JSON. Applications where each can be used. • -The purpose of using schemas for XML and JSON data. • -The motivation behind Linked Data and the purpose of using JSON-LD or RDF
to represent it.
Further reading
• Further reading – Relational databases
• Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf
– XML • http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
– JSON and JSON-LD • http://json.org • http://crypt.codemancers.com/posts/2014-02-11-An-introduction-to-
json-schema/ • https://cloudant.com/blog/webizing-your-database-with-linked-data-in-
json-ld/#.Vtp_UMfB_Gw – RDF
• https://www.w3.org/DesignIssues/LinkedData.html • http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_
%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf
• http://www.dlib.org/dlib/may98/miller/05miller.html
COMP20008 Elements of Data Processing Data Pre-Processing and Cleaning
Why is pre-processing needed?
Name Age Date of Birth
“Henry” 20.2 20 years ago
Katherine Forty-one 20/11/66
Michelle 37 5/20/79
Oscar@!! “5” 13th Feb. 2011
- 42 -
Mike___Moore 669 -
巴拉克奥巴⻢马 52 1961年8月4日
Why is pre-processing needed?
• Measuring data quality – Accuracy
• Correct or wrong, accurate or not – Completeness
• Not recorded, unavailable – Consistency
• E.g. discrepancies in representation – Timeliness
• Updated in a timely way – Believability
• Do I trust the data is correct? – Interpretability
• How easily can I understand the data?
Major data preprocessing activities
Data mining concepts and techniques, Han et al 2012
Terminology
Height Weight Age Gender 1.8 80 22 Male 1.53 82 23 Male 1.6 62 18 Female
• The 4 columns (height, weight, age, gender) are features or attributes
• The data items (3 rows) are called instances or objects • Height, Weight and Age are continuous features • Gender is a categorical or discrete feature
Data integration
• Bringing data from multiple sources together – Resolve conflicts – Detect duplicates
• Will cover in depth in weeks 8 and 9
Data Source
Data Source
Integrated Data Source
Data reduction
• Decrease the number of features (columns) or instances (rows) – Sampling strategies – Remove irrelevant features and reduce noise – Easier to visualise, faster to analyse
• Will cover during section on visualisation (weeks 5 and 6), and feature analysis (weeks 9 and 10)
http://bigdataexaminer.com/data-science/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/
Data cleaning
• Incomplete (missing data) • Noisy data • Inconsistent data • Intentionally disguised data
Data cleaning – The Process
• Many tools exist (Goole Refine, Kettle, Talend, …) – Data scrubbing – Data discrepancy detection – Data auditing – ETL (Extract Transform Load) tools: users specify
transformations via a graphical interface • Our emphasis will be to understand some of the methods
employed by some of these tools
Missing or incomplete data
• Lacking feature values • Name=“” • Age=null
• Types of missing data (Rubin 1976) – Missing completely at random: Data are missing
independently of observed and unobserved data. – E.g/ Coin flipping to decide whether or not to answer
an exam question. – Missing not completely at random
• I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?
• I set an exam and ask a question in hard to understand language. What is the meaning of missing values for those who don’t answer the question?
Example: USA Salary survey data
• Is Person B’s salary missing at random? • Very difficult to determine reasons for missingness.
– In practice report assumptions about missingness.
Name Salary Person C $59k Person D $63k Person H $99k Person E $102k Person G $140k Person F $150k Person A $180k Person B -
Causes of missing data
• Why does it occur? – Malfunction of equipment (e.g. sensors) – Not recorded due to misunderstanding – May not be considered important at time of entry – Deliberate
• How to handle it? – We will look at a number of strategies
Extreme Missing data
• Movie Recommender systems
Person Star Wars
Batman Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 - - - 1 - John - - 1 2 - - - Jill 1 - - 3 2 1 -
Users and movies Each user only rates a few movies (say 1%) Netflix wants to predict the missing ratings for each user
Noisy data
• Truncated fields (exceeded 80 character limit) • Text incorrectly split across cells (e.g. separator issues) • Salary=“-5” • Some causes
– Imprecise instruments – Data entry issues – Data transmission issues
Inconsistent data
• Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
• Different date formats (“3/4/2016” versus “3rd April 2016”) • Age=20, Birthdate=“1/1/2002” • Two students with the same student id • Outliers
– E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999 • No good if it is list of ages of hospital patients • Might be ok though for a listing of people number of
contacts on Linkedin though – Can use automated techniques, but also need domain
knowledge
Disguised data
• Everyone’s birthday is January 1st? • Email address is [email protected] • Adriaans and Zantige
– “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
• How to handle – Look for “unusual” or suspicious values in the dataset, using
knowledge about the domain
Dealing with missing data
• What are the consequences of missing data? – May break application programs not expecting it – Less power for later analysis analysis – May bias later analysis
• So, how to handle it?
Strategy 1: Delete all instances with a missing value
• Sometimes called case deletion • Effects
– Easy to analyse the new (complete data) – May produce bias on analysis if new sample size small or
structure exists in the missing data.
Case deletion
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma
Mandy 1 2 1 3 3 2 3
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma
Mandy 1 2 1 3 3 2 3
James 3 2 - - - 1 -
John - - 1 2 - - -
Jill 1 - - 3 2 1 -
Strategy 2: Manually correct
• A human eyeballs the missing value and fills it in using their expert knowledge
https://en.wikipedia.org/wiki/Eye
Strategy 3: Imputation
• Impute a value (replace the missing value with a substitute one) • After imputing all missing values, can use standard analysis
techniques for complete datasets
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 2 2 1 1 1
John 3 2 1 2 2 1 1
Jill 1 1 1 3 2 1 1
Person Star Wars
Batman
Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 - - - 1 -
John - - 1 2 - - -
Jill 1 - - 3 2 1 -
Imputation: Fill in with zeros (or similar)
Person Star Wars
Batman Jurassic World
The Martian
The Revenant
Lego Movie
Selma ….
James 3 2 0 0 0 1 0
John 0 0 1 2 0 0 0
Jill 1 0 0 3 2 1 0
• Simple • Won’t break application programs • Limited utility for analysis
Imputation: Fill in with mean value
• Popular method – Can be good for supervised classification – Apply separately to each attribute
Name Age
Daisy 10
Maisy 15
Harry 2
Jackie -
Jackie’s age is imputed to be (10+15+2)/3=9
Imputation: Fill in with mean value cont
• Drawbacks – Reduces the variance of the feature – Incorrect view of the distribution of that attribute – Relationships to other features changes
• Can also use median instead of mean (if distribution is skewed) • Use mode (most frequent value) imputation for categorical
features
Fill in with category mean
• Take categories/clusters and compute the mean ….
Name Age Gender Daisy 10 Female Maisy 15 Female Harry 2 Male Jackie - Female
Jackie’s age is imputed to be (10+15)/2=12.5 (considering the category “Female”)
Time series: Last value carried forward
Day Kilometres Walked Day 1 8.9 Day 2 8.2 Day 3 9.6 Day 4 Day 5 11.6 Day 6 12.0
Kilometres walked on Day 4 = ?
Acknowledgements
– Data Mining Concepts and Techniques. Han, Kamber and
Pei. 3rd edition (chapter 3). Available through library as ebook.
– Data analysis using regression and multilevel hierarchical models. Gelman and Hill (chapter 25), 2006.
Next Week
• Second workshop is available on LMS – Practice with JSON and XML and Web scraping
• Project will be released • Continue data-preprocessing and cleaning
– Look at more complex techniques for value imputation (e.g. for the movie recommender system example)