open data and data formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/tcpd... ·...
TRANSCRIPT
![Page 1: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/1.jpg)
Open Data and Data Formats
1
![Page 2: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/2.jpg)
TCPD Summer School 2018 (c) Ashoka University
Data-based analysis: caveatsQuality: Your analysis is only as good as your data
Correlation vs. Causation
Prone to filter bubbles
Some questions are imprecise: Qualitative and quant. techniques need to co-exist
A lot of data is boring/hard to use — signal vs. noise
2
https://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html
![Page 3: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/3.jpg)
TCPD Summer School 2018 (c) Ashoka University
Data lifecycle
e
3
Graphic: Jeff Heer
![Page 4: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/4.jpg)
Data organization
4
![Page 5: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/5.jpg)
TCPD Summer School 2018 (c) Ashoka University
Databases
How is it stored?
Organization (encoding, structure, meaning…) or “Schema”
a.k.a. code book
How is it queried?
Query language, visualization, interfaces, etc.
5
![Page 6: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/6.jpg)
TCPD Summer School 2018 (c) Ashoka University
Relational Databases
One way to store data: As a table of rows and columns
Pioneered by IBM (E.F. Codd in 1970)
Subsequently used by Oracle in a project for the CIA in the 1970s
Developed into a huge industry over 40 years
6
![Page 7: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/7.jpg)
TCPD Summer School 2018 (c) Ashoka University
Relational DatabasesSets of tables (like a sheet in a spreadsheet)
Each table has rows and columns
Each table has a“schema” (the set of columns and their meanings); all rows follow the same structure
Efficient at retrieving rows quickly based on the values in some of their columns (using keys and indexes), or computing aggregates
Popular query language: Structured Query Language (SQL)
Like a spreadsheet (but no formulae)
A CSV file is a simple way to store a small(ish) database
7
https://data.gov.in/sites/default/files/NDSAP_Implementation_Guidelines-2.1.pdf
![Page 8: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/8.jpg)
TCPD Summer School 2018 (c) Ashoka University
SQL
Query language for databases, e.g.
SELECT CAND_NAME, YEAR, POSITION, VOTES
FROM CANDIDATES
WHERE PC_NUMBER = 543;
(analogous to filtering rows in Excel)
8
![Page 9: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/9.jpg)
TCPD Summer School 2018 (c) Ashoka University
Other kinds of databases
9
https://www.flickr.com/photos/caseorganic/4935757995
Graph based
![Page 10: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/10.jpg)
TCPD Summer School 2018 (c) Ashoka University
Other kinds of databases
Spatial databases (efficient at quick spatial queries)
Unstructured databases (e.g. plain text)
Image databases
RDF databases
etc.
10
![Page 11: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/11.jpg)
TCPD Summer School 2018 (c) Ashoka University
Unstructured Data Processing
Key Techniques (NLP/Information extraction):
Parsing: telling parts of speech in a (well-formed) sentence
Entity recognition: identify people names, places, organizations etc.
Disambiguation: Which “Ashoka” does Ashoka refer to?
Sentiment analysis: What feeling does a sentence convey about something?
Topic modeling: Splitting a group of documents by topic
Word Embeddings: Finding co-occurring or semantically similar words in text (Use languages like Python or Java to access these functions)
Question-Answering e.g. IBM Watson
11
![Page 12: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/12.jpg)
Open Data
12
![Page 13: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/13.jpg)
TCPD Summer School 2018 (c) Ashoka University13
• https://data.gov.in/sites/default/files/NDSAP.pdf
![Page 14: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/14.jpg)
TCPD Summer School 2018 (c) Ashoka University
NDSAP guidelines
RTI Act, Section 4(2)It shall be a constant endeavour of every public authority … to provide as much information suo motu to the public at regular intervals through various means of communications, including internet, so that the public have minimum resort to the use of this Act to obtain information.
Ministries/Departments will upload at least 5 “high-value” datasets on data.gov.in …
All datasets are to be updated regularly every quarter
14
![Page 15: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/15.jpg)
TCPD Summer School 2018 (c) Ashoka University15
Also see NDSAP (2012)
https://data.gov.in/sites/default/files/NDSAP_Implementation_Guidelines-2.1.pdf
![Page 17: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/17.jpg)
TCPD Summer School 2018 (c) Ashoka University
RDF files<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/office> <http://dbpedia.org/resource/Prime_Minister_of_India> .
<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/president> <http://dbpedia.org/resource/R._Venkataraman> .
<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/termStart> "1984-10-31"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/termEnd> "1989-12-02"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/predecessor> <http://dbpedia.org/resource/Indira_Gandhi> .
<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/successor> <http://dbpedia.org/resource/V._P._Singh> .
17
![Page 18: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/18.jpg)
TCPD Summer School 2018 (c) Ashoka University
SPARQLQuery language for RDF databases, e.g.
SELECT ?name
WHERE {
?name <http://dbpedia.org/property/office> <http://dbpedia.org/resource/Prime_Minister_of_India> .
}
18
![Page 19: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/19.jpg)
Lok Dhaba
tcpd.ashoka.edu.in
19
![Page 21: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/21.jpg)
ECI Data: Problems
Legacy data mostly in PDF format
No standard data schema
Very hard for researchers to work with
No data quality checks
21
![Page 24: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/24.jpg)
Quality checksExample issues found on ECI data:- Same party, seat, year but multiple candidates- Inconsistent PC Type Information (Gen/SC/ST)- Multiple parties with the same code- Inconsistent or missing Sex field- Uncontested constituencies are completely missing- Inconsistency of elected winners w.r.t. Lok Sabha records
24
http://eci.nic.in/eci_main1/ElectionStatistics.aspx
![Page 25: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/25.jpg)
25
http://164.100.47.194/Loksabha/Members/lokprev.aspx
![Page 26: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/26.jpg)
Incumbency visualization
26 http://shivangitikekar.com/portfolio/karnataka_viz.html
![Page 27: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/27.jpg)
HD Works
hdworks.org
27
![Page 33: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/33.jpg)
HD WorksEncourages citizen participation (crowdsourcing)Provides feedback to city officialsIncreases transparencyAllows citizens to subscribe to areas of works of interestGeo-enabled
To be done: ties to budget spendingTo be done: ties to tender documents
33
![Page 34: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/34.jpg)
34
![Page 35: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/35.jpg)
What do we know about the functioning of our Parliament? & (more) importantly, what concerns
do our MPs pose in the House?
![Page 36: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/36.jpg)
Parliamentary ProcedureGoverned by Article 118 of the Indian Constitution:
“Each House of Parliament may make rules for regulations, subject to the provisions of this Constitution, its procedure and the conduct of its business”
![Page 37: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/37.jpg)
The Question Hour! Questions are tools through which parliamentarians
ensure administrative accountability to the people.
! Not subject to party whips
! Can demand oral or written answers from Ministers
! An MP can submit a maximum of 10 questions for each day, max 230 questions are admitted
The number of questions received by the Parliamentary Secretariat are far more, hence questions are selected to be answered by a random ballot.
![Page 38: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/38.jpg)
![Page 39: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/39.jpg)
Data Available on Questions! Ministry to which question
is asked
! House, Starred/Unstarred
! Date on which question is tabled
! Title of question
! Members asking question
! Text of Question
![Page 40: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/40.jpg)
Data Available on Questions! Ministry to which question
is asked
! House, Starred/Unstarred
! Date on which question is tabled
! Title of question
! Members asking question
! Text of Question
![Page 41: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/41.jpg)
Methods to Understand Content! Ministry data is
insufficient: The Secretariat decides which Ministry will be taking questions and for how many days in each session.
! Often, questions span >1 themes
! Key is to categorizing question by understanding the text of the content
![Page 42: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/42.jpg)
Step 1: Preparing the Dataset! Scraped the meta-data of
questions using programs in Python & R from the Lok Sabha Website
! Scraped the text of the questions & answers for each question
! Combining this with member data from ECI data prepared by the TCPD (candidate type, gender, religion, constituency etc.)
![Page 43: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/43.jpg)
Step 2: Extracting Topical Information! Required powerful tool to
understand context of each question
! ‘Informative words’ in a question
! Sufficient to search for appearance for ‘Scheduled Caste’ and determine the resulting data as the dataset on all caste questions?
![Page 44: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/44.jpg)
Word Embeddings! Words are mapped to a vector of real numbers, useful to
calculate word similarities
! Word2vec (Mikolov et al, Google 2013) - representation of meaning of a word is determined by looking at the context in which it appears.
! Preserves semantic and syntactic relationships and captures the meaning of words
Result for the word ‘frog’ using GloVe
![Page 45: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/45.jpg)
Method for Extracting Themes! Train the word2vec algorithm on the questions from the dataset
! Determine word similarity: get ~500 words similar to the anchor word
! For example: get words similar to ‘woman’, ‘minority’, ‘caste’ etc.
! Prepare a curated list of 50-80 words
Anchor word (Theme)
Words (curated using word2vec)
woman woman, widowed, mothers, ladies, female, creches, maternity, sabla, janani, pregnant, girl
education Educational, learning, elementary, syllabi, teacher,secondary, rte, shiksha, cabe, pupil, literacy, madarsas, aicte, ncert
![Page 46: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/46.jpg)
Method for Extracting Topics! What if we don’t know the topic that might appear in a corpus?
! Topic Modelling - The LDA Topic Model understands ‘content’ and helps to cluster the words falling under the same topic (for n topics)
! Input the training data (question text) & the number of topics required, output is as follows:
Word list (generated by the Topic Model) Topicsstudents scholarship amount scheduled scheme scholarships government coaching post matric post-matric details belonging proposes increase schemes caste fixed obc tribe
scholarships
act cases atrocities prevention scheduled courts special sc/st castes protection rights registered pending set state-wise implementation law legal disposal provisions
crimes/atrocities
![Page 47: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/47.jpg)
Working with Data:A few tips
47
![Page 48: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/48.jpg)
Primary vs. derived dataData should be broken down into “primary” and “derived” data, e.g.
primary data: gender of a personderived data: number of women in the dataset
primary data: candidate contesting in an ACderived data: number of candidates contesting in an AC
48
![Page 49: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/49.jpg)
Primary vs. derived dataPrimary data is assumed to be correctDerived data is computed from primary data by an automated process (like a formula)If primary data is correct, derived data is correct (if the automated process is correct)If primary data changes, derived data can be updated automaticallyNot all primary source data is primary data (often sources will provide redundant or derived data)
49
![Page 50: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/50.jpg)
Data quality checksData should not be blindly trusted
Think: What could be wrong?How can I catch anomalies?Write consistency checks on the value in fields, the relationship between fields in a row, etc.If derived data is present, check its correctness
50
![Page 51: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/51.jpg)
Data correctionsSometimes primary data may be wrong/inconsistentCorrections in primary data should be carried out through a scriptScript captures intention of update (e.g. change “ASAWARPUR” to “ASAVARPUR”)Script can be re-applied automatically if base data changes
51
![Page 52: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/52.jpg)
Data versioningStore all versions of dataTools like git can store many versions efficiently (use simple formats like CSV)Allows rollback if anything changes by mistakeAllows branching and multiple people working on different sections of the dataGit support built in to Rstudio
52
![Page 53: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/53.jpg)
Data provenanceSource of data needs to be properly documented (and acknowledged) All updates should be controlled and documented carefullyIf you make a data correction, have it reviewed by someone else, and write a justificationWhen providing data-based analysis, also offer primary data for reproducibility (many wrong conclusions have been made due to errors in spreadsheets!)
53
![Page 54: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/54.jpg)
SchemaBuilding a long-lived schema can be quite hardAssumptions change, new variables come in, etc.Think of the types of each column:
What are the allowed values?What are the allowed ranges?What field/combination of fields uniquely identifies a row (i.e., is a “key”)?e.g. <Const#, Year of election> is not sufficient: Same year could have multiple (by)polls in the same const. Const. # also changes across delimitations
54
![Page 55: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:](https://reader034.vdocuments.us/reader034/viewer/2022042409/5f269d440915220fe125ef90/html5/thumbnails/55.jpg)
TCPD Summer School 2017 (c) Ashoka University
Types of variables
Nominal (a.k.a Categorical) - discrete set of values, (e.g. Candidate caste, sex)
Ordinal- numeric, with ordering only (e.g. rank)
Quantitative- numeric with math operations (e.g. number of votes)
Think carefully about which type of variable each column is
55
http://eci.nic.in/eci_main1/ElectionStatistics.aspx