wedtech - 24 aug. 05 1 what is the proof necessary for truth (whatever that is) tom johnson managing...

29
1 Wedtech - 24 Aug. 05 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe, New Mexico Presentation to FRIAMGroup's Applied Complexity Lecture Series Santa Fe, NM USA 24 August 2005

Upload: megan-goodwin

Post on 17-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

1Wedtech - 24 Aug. 05

What is the Proof necessary for

Truth(whatever that is)

Tom JohnsonManaging Director

Institute for Analytic JournalismSanta Fe, New Mexico

Presentation to FRIAMGroup's Applied Complexity Lecture Series

Santa Fe, NM USA 24 August 2005

Page 2: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

2Wedtech - 24 Aug. 05

What is the IAJ

Analysis using a variety of tools and methods from multiple disciplinesUnderstand multiple phenomena Communicate results to multiple audiences in a variety of ways.

Page 3: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

3Wedtech - 24 Aug. 05

Cornerstones of IAJ

General Systems Theory Statistics Visual statistics/infographics Simulation modeling

Page 4: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

4Wedtech - 24 Aug. 05

Prob of day

So what’s the problem of the day

for analytic journalists?

Page 5: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

5Wedtech - 24 Aug. 05

So what’s the problem?

Ever increasing -- beyond estimate -- number of public records databases

DB increasingly used for broad spectrum of decision-making

Assumption that data, as given, is correct. Anecdotal evidence suggests that’s not so.

Page 6: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

6Wedtech - 24 Aug. 05

Examples of bad data

St. Louis Post-Dispatch 1997-98: 350 S.Ill. Sex offenders “…found that hundreds of convicted sex offenders don't actually live

at the addresses listed on the sex offender registries for St. Louis, St. Louis County and the Metro East area.”

Every record carried probability between 30-50% of error

1999 - City of St. Louis: “About 700 Sex Offenders Do Not Appear To Live At The Addresses Listed On A St. Louis Registry.”

Boston 2000 BPD - 6 detectives assigned to cleaning up sex

offenders DB

Page 7: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

7Wedtech - 24 Aug. 05

Examples of bad data

2000 - Florida voter registration rolls State hires DBT Online/Choicepoint to

“purge rolls.”

“Some [counties] found the list too unreliable and didn't use it at all. … Counties that did their best to vet the file discovered a high level of errors, with as many as 15 percent of names incorrectly identified as felons.”

Source:Plast, Greg. http://www.gregpalast.com/detail.cfm?artid=55

Page 8: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

8Wedtech - 24 Aug. 05

More bad data

2004 - Dallas Morning News “…The state criminal convictions database is so

riddled with holes that law enforcement officials say public safety is at risk. “… the state has only 69 percent of the complete criminal histories records for 2002. In 2001, the state had only 60 percent. Hundreds of thousands of records are missing.”

Page 9: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

9Wedtech - 24 Aug. 05

Surely there is a simple solution….

Is there a methodology to measure, to know -- or to anticipate -- the quality, i.e. veracity, of a given database?

What are the best -- and most objective -- ways to “X-ray” a DB to note internal problems or potential problems?

Hoping for answers from statisticians, data miners, forensic accountants, bioinformatics, genomics, physics, etc. ‘cause journalists don’t have much of a clue

Page 10: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

10Wedtech - 24 Aug. 05

Approaches to database analysis

Theoretical/statistical What can we know about a database only based

on its size and whether a record’s field/cell is occupied?

Are there cheap, fast and good templates/tools to X-ray the DB?

Contextual/statistical How would knowing the context/meaning of data

-- or lack of data -- in cells change our answers to previous questions?

Are there methodologies to help us weigh the importance of a variable relative to the veracity of a record? e.g. is “name” more important than SS#?

Page 11: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

11Wedtech - 24 Aug. 05

Approaches to database analysis

Theoretical/statistical What can we know about a database -- and its

potential validity -- only based on its size and whether a record’s field/cell is occupied?

Are there cheap, fast and good templates/tools to X-ray the DB

Contextual/statistical How would knowing the context/meaning of data

-- or lack of data -- in cells change our answers to previous question?A

Are there methodologies to help us weigh the importance of a variable relative to the veracity of a record? e.g. is “name” more important than SS#?

Both/all approaches vary

with the question(s) being

asked

Page 12: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

12Wedtech - 24 Aug. 05

Theoretical database structure

DB = Metadata Coding sheet

Fields/elements Field tag (name) Character limited/open field

Numeric/alpha End-of-Record character

Number of records

Page 13: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

13Wedtech - 24 Aug. 05

Theoretical database

Assume matrix - 100 records, 10 fields

Assume a given -- and occupied -- index field (serial record number)

Page 14: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

14Wedtech - 24 Aug. 05

Theoretical database

Assume matrix - 100 records, 10 fields Assume a given -- and occupied --

index field (serial record number)

 Does a record's LCI (Loaded Cell Index), from 10% to 100%, constitute "proof" of anything? 

Page 15: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

15Wedtech - 24 Aug. 05

Theoretical database

LAs (logical adjacencies) not necessarily physically adjacent in record layout.

Like genome, data present -- or not present -- in a field can trigger the presence or lack of data in another.

Fld #1 Fld #2 Fld #3 Fld #4 Fld #5 Fld #6 Fld #7 Fld #8 Fld #9 Fld #10

Page 16: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

16Wedtech - 24 Aug. 05

Assumptions???

The greater a record’s LCI, the greater potential (probability?) that record has enough “Proof” to achieve “True Data" status. Do we think this is true?

Probably, even when we have no idea what the data is/means. Still, “proof” seems to occupy a density-of-data continuum reaching for some critical mass. How do we measure that criticality?

When software achieves critical mass, it can never

be fixed; it can only be discarded and rewritten.

Same for DBs?How do programmers measure that critical

mass?

Page 17: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

17Wedtech - 24 Aug. 05

Assumptions???

Probably, even when we have no idea what the data is/means. Still, “proof” seems to occupy a continuum reaching for some critical mass. How do we measure that criticality?

When focus is on individual record, must have context/meaning/definition for the variables/elements, otherwise a nonsensical array of possibly random numbers. 

There is no opportunity for Proof of anything, much less Truth.

Page 18: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

18Wedtech - 24 Aug. 05

Search for patterns (in 100+k records)

Are there patterns? How can I quickly identify them?

Are there consistencies?

Do populated cells suggest anything about hierarchy of importance?

Are there "Logical Adjacencies,“ (LAs)?

Page 19: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

19Wedtech - 24 Aug. 05

Demographics of a database

Logical Adjacencies

Patterns in LAs?

Is there a hierarchy of import/value of LAs?

Are there various thresholds of LAs present, i.e. is it better Proof to have four LAs than three? 

Maybe, maybe not.  So how do we create rules to weigh (a) a cell and (b) weigh LAs. 

Page 20: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

20Wedtech - 24 Aug. 05

Demographics of a database

Logical Adjacencies

If a record does not meet some standard of LA-ness, do we discard it from the analysis because it lacks the potential for Proof? (Discarded outlier problem)

Do patterns of populated cells suggest anything about hierarch of importance or only data input process?

Are some records “better” records?

Any “truth” to be found?

Tools to quickly, easily see these answers?

Page 21: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

21Wedtech - 24 Aug. 05

Working with the real stuff

Fundrace 2004 Neighbor Searchhttp://www.fundrace.org/neighbors.php

Political Money Linehttp://www.fecinfo.com/cgi-win/indexhtml.exe?MBF=zipcode

Page 22: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

22Wedtech - 24 Aug. 05

Missing data problem. Significant?

Page 23: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

23Wedtech - 24 Aug. 05

Realities of DBs

The NAME problem

Can this be “cleaned” automatically?

Page 24: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

24Wedtech - 24 Aug. 05

“Dirty” campaign contributions

Same person?

Page 25: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

25Wedtech - 24 Aug. 05

“Dirty” campaign contributions

Same person?

Page 26: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

26Wedtech - 24 Aug. 05

“Dirty” campaign contributions

How do we easily spot these problems in large DB?

How do we rectify them in large DB?

Same person? Same job?

Page 27: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

27Wedtech - 24 Aug. 05

Wrong data

Huh?Is there any way to vet this cell’s data?

How many triangulated db’s necessary to meet some “proof” index?

Does this field have importance (The hierarchy of importance?) to be worth X time/money to verify?

Is there a better way than drawing a sample and tracking down original data?

Page 28: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

28Wedtech - 24 Aug. 05

Ver 1.0 workshop April 9-12, 2006

Workshop on public database verification for journalists and social scientists

“Ver” as in “verification” and “verify” and, from the Spanish verb ver: “to see; to look into; to examine.”

Ver 1.0 Objectives1.Developing new statistical methods for DB

verification; 2.Building a flowchart/decision tree for the DB

verification process; 3.Developing rules for creation of a hierarchy of

importance/significance of record elements, i.e. variables, in common databases.

Seeking suggestions:

Automated

Affordable

Generic or easily adopted to various DBs

Easily understood and with error trapping

Easy to learn/apply

Page 29: Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,

29Wedtech - 24 Aug. 05

What is the Proof necessary for

Truth(whatever that is)

Tom JohnsonManaging Director

Institute for Analytic JournalismSanta Fe, New Mexico

Presentation to FRIAMGroup's Applied Complexity Lecture Series

Santa Fe, NM USA 24 August 2005