Download - John Murtagh, UEL
![Page 1: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/1.jpg)
Data Management forGeoinformatics
A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
John Murtagh, UEL
![Page 2: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/2.jpg)
Data Integration
![Page 3: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/3.jpg)
Types of Data
![Page 4: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/4.jpg)
qualitative dataquantitative datastructured dataunstructured datamachine-readable data
![Page 5: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/5.jpg)
Qualitative data
is everything that refers to the quality of something:
A description of colours, texture and feel of an object. E.g. description of experiences; interview are all qualitative data.
![Page 6: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/6.jpg)
Qualitative data (2)
refers to forms of data collection and analysis which rely on understanding, with an emphasis on meanings rather than numerical form. It’s typically descriptive.
E.g. diary accounts, open-ended questionnaires, unstructured interviews and unstructured observations.
![Page 7: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/7.jpg)
Quantitative data (1)
is data that refers to a number. E.g. the number of golf balls, the size, the price, a score on a test etc.
![Page 8: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/8.jpg)
Quantitative data (2)
usually regarded as referring to the collection and analysis of numerical data
….which can be put into categories
or measured in units of measurement.
or in rank order,
![Page 9: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/9.jpg)
Structured&
Unstructured data
![Page 10: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/10.jpg)
Structured data
If you want your computer to process and analyse your data, a computer has to be able to read and process the data. This means it needs to be structured and in a machine-readable form.
![Page 11: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/11.jpg)
Unstructured data
Unstructured has no fixed underlying structure.E.g. PDFs and scanned images may contain information which is pleasing to the human-eye as it is laid-out nicely, but they are not machine-readable.
![Page 12: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/12.jpg)
Machine-readable data
is data (or metadata) which is in a format that can be understood by a computer.2 Types1. Human-readable data that is marked up so that it can also be read by machines Examples: microformats, RDFa
2. Data file formats intended principally for machines (RDF, XML, JSON).
![Page 13: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/13.jpg)
Data Quality
![Page 14: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/14.jpg)
Types of “bad data”
Incorrect data Inaccurate data Business rule violations Inconsistent data Incomplete data Nonintegrated data
![Page 15: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/15.jpg)
Incorrect data
• For data to be correct (valid), its values must adhere to its domain (valid values).
• For example, a month must be in the range of 1–12, or a person’s age must be less than 130.
Taken From: ADELMAN, S., ABAI, M., & MOSS, L. T. (2005). Data strategy [...] [...]. Upper Saddle River, NJ [u.a.], Addison-Wesley.
![Page 16: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/16.jpg)
Nonintegrated data
• Data that has been created separately & not with the intention of future integration.
• E.g. customer data can exist on 2 or more outsourced systems under different customer numbers with different spellings of the customer name & even different phone numbers or addresses. Integrating data from such systems is a challenge.
![Page 17: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/17.jpg)
Inaccurate data
• A data value can be correct without being accurate. For example, the city of London and web country code for France “.fr” are both accurate but when used together (such as London, France) the country is wrong because the city of London is not in France, and the accurate country code is “co.uk”
![Page 18: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/18.jpg)
Inconsistent data
• Uncontrolled data redundancy results in inconsistencies. Every organization is plagued with redundant and inconsistent data.
• For example names or places:“Smith, David” might also sit alongside “David Smith”. London, UK and London, England.
![Page 19: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/19.jpg)
Incomplete data
Data that might include elements such as Names, postal code, gender, age, NHS number might also only capture haphazardly elements such as ailment, GP name, NHS capture area or even incomplete date of birth.
![Page 20: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/20.jpg)
Data(cleaning)(Cleansing)(scrubbing)
![Page 21: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/21.jpg)
Tools
![Page 22: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/22.jpg)
Open Knowledge Foundation:
School of Data (a gentle
introduction to cleaning data)
![Page 23: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/23.jpg)
Section 1: “Nuts and chewing gum” - looks at the the way data is presented in spreadsheets
and how it might cause errors.
Section 2: “The Invisible Man” is in your spreadsheet is concerned with the problems of white spaces and non-printable characters and
how they affect our ability to use the data.
Section 3: “Your data is a witch’s brew” deals with consistency in data entry, and how to choose the right unit and format for data.
![Page 24: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/24.jpg)
Section 4: “Did you bring the wrong suitcase (again)?” is about where to put data, and how to structure it. Accompanying these sections is a step-by-step recipe for cleaning a dataset. This is an extensive, handbook-style resource which
we refer to in each section. It takes a set of ‘dirty’ data and moves it through the different steps to
make it ‘clean’. –
See more at: http://schoolofdata.org/handbook/courses/data-
cleaning/#sthash.HNzpdzyq.dpuf
![Page 25: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/25.jpg)
Cleaning up
![Page 26: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/26.jpg)
Sort and Filter: The basics of spreadsheets
http://schoolofdata.org/handbook/courses/sort-and-filter/
![Page 27: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/27.jpg)
http://schoolofdata.org/handbook/recipes/cleaning-data-with-spreadsheets/
![Page 28: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/28.jpg)
Open Refine
Possible uses of Open Refine software.
Cleaning messy data: for example if you have text file with some semi-structured data, you can edit it using transformations, facets and clustering to make the data cleanly structured.
Transformation of data: converting values to other formats, normalizing and denormalizing.
Parsing data from web sites: OpenRefine has a URL fetch feature and jsoup HTML parser and DOM engine.
![Page 29: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/29.jpg)
• Adding data to dataset by fetching it from webservices (i.e. returning json). For example can be used for geocoding addresses to geographic coordinates.
• Working with Freebase:• Augmentation of datasets with data from
Freebase.• Contributing data to Freebase using
Schema Alignment feature. This involves reconciliation - mapping string values in cells to entities in Freebase.
http://en.wikipedia.org/wiki/OpenRefine
![Page 30: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/30.jpg)
Tutorials
![Page 31: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/31.jpg)
Tutorial: OpenRefine/LODRefine – A
Power Tool for Cleaning Data
http://schoolofdata.org/category/howto/#sthash.TEXrJElh.dpuf
![Page 32: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/32.jpg)
Data manipulation
![Page 33: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/33.jpg)
•SPSSData manipulation tutorial:
![Page 34: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/34.jpg)
•MS AccessData manipulation tutorial:
![Page 35: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/35.jpg)
•RData manipulation tutorial: on the following page http://www.sr.bham.ac.uk/~ajrs/R/r-manipulate_data.html
![Page 36: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/36.jpg)
![Page 37: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/37.jpg)
•GephiThe following page is about Data manipulation within Gephi.
![Page 38: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/38.jpg)
![Page 39: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/39.jpg)
Primary & Secondary data• In this video Professor Innes of University of Edinburgh
talks about the differences between using primary and secondary data
![Page 40: John Murtagh, UEL](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816857550346895dde78f2/html5/thumbnails/40.jpg)
Other sessions as part of Data Management in Geoinformatics:
•Data Collection•Data Management•Data Sharing
Data Management for Geoinformatics by John Murtagh as part of the Jisc funded project TraD (University of East London is licensed under a Creative Commons Attribution Share Alike Licence