overview 1200 new new + replicated€¦ · newer data exabyte = 1 billion gigabytes * includes...

16
1 Data Privacy in Biomedicine Lecture 9: Availability of Data and (timer permitting) the Curse of the SSN Bradley Malin, PhD ([email protected]) Professor of Biomedical Informatics, Biostatistics, & Computer Science Vanderbilt University February 10, 2020 © 2020 Bradley Malin 2 Lecture 9: Availability & Prediction Overview Information Generation Models of Availability Some Resources A Look at Voter Registration Curse of the SSN © 2020 Bradley Malin 3 Lecture 9: Availability & Prediction 0 50 100 150 200 250 300 350 400 450 500 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 Year GDSP (MB/person) Information Explosion 0 5 10 15 20 25 30 35 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 Sewrvers (in Millions) 1 st WWW conference 2001 Growth in available disk storage Growth in active web servers 1996 1991 L. Sweeney. Information explosion. In L. Zayatz, et al. (eds) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies . Urban Institute, Washington, DC, 2001 © 2020 Bradley Malin 4 Lecture 9: Availability & Prediction Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data Some estimates put original information closer to 40 exabytes New Data Generated in the “World” Source 2003 5 exabytes* UC Berkeley 2006 161 exabytes** IDC 0 200 400 600 800 1000 1200 2002 2004 2006 2008 2010 Exabytes Year New New + Replicated © 2020 Bradley Malin 5 Lecture 9: Availability & Prediction Latest Numbers On average, the US alone is now generating 2,657,700 GB (or quintillion bytes) of Internet data every minute https://www.domo.com/lear n/data-never-sleeps-5 © 2020 Bradley Malin 6 Lecture 9: Availability & Prediction Birth Certificates (circa 1925) Field# Field name 1 Child's first name 2 Child's middle name (sometimes or initial) 3 Child's last name 4 Day, month and year of birth 5 City and/or County of birth (sometimes hospital) 6 Father's name 7 Mother's name (including maiden name) 8 Place of birth (address and town/city) 9 Mother's age and address 10 Mother's birthplace (town/city, state, county) 11 Mother's occupation 12 Mother, number of previous children 13 Father's age and address 14 Father's birthplace (town/city, state, county) 15 Father's occupation 1 2 3 4 5 6

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

1

Data Privacy in Biomedicine

Lecture 9: Availability of Data and

(timer permitting) the Curse of the SSN

Bradley Malin, PhD ([email protected])

Professor of Biomedical Informatics, Biostatistics, & Computer Science

Vanderbilt University

February 10, 2020

© 2020 Bradley Malin 2Lecture 9: Availability & Prediction

Overview

◼ Information Generation

◼ Models of Availability

◼ Some Resources

◼ A Look at Voter Registration

◼ Curse of the SSN

© 2020 Bradley Malin 3Lecture 9: Availability & Prediction

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

GD

SP

(M

B/p

ers

on

)

Information Explosion

0

5

10

15

20

25

30

35

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Se

wrv

ers

(in

Millio

ns)

1st WWW

conference

2001

Growth in

available

disk storage

Growth in

active web

servers

1996 1991

L. Sweeney. Information explosion. In L. Zayatz, et al. (eds) Confidentiality, Disclosure, and Data Access: Theory and

Practical Applications for Statistical Agencies . Urban Institute, Washington, DC, 2001

© 2020 Bradley Malin 4Lecture 9: Availability & Prediction

Newer Data

◼ Exabyte = 1 billion gigabytes

* Includes analog data

radio communications, paper memos, etc.

** Includes new and replicated data

Some estimates put original information closer to 40 exabytes

New Data

Generated in

the “World”

Source

2003 5 exabytes* UC Berkeley

2006 161 exabytes** IDC

0

200

400

600

800

1000

1200

2002 2004 2006 2008 2010

Exab

yte

s

Year

NewNew + Replicated

© 2020 Bradley Malin 5Lecture 9: Availability & Prediction

Latest Numbers

◼ On average, the

US alone is now

generating

2,657,700 GB (or

quintillion bytes)

of Internet data

every minute

◼ https://www.domo.com/lear

n/data-never-sleeps-5© 2020 Bradley Malin 6Lecture 9: Availability & Prediction

Birth Certificates (circa 1925)Field# Field name

1 Child's first name

2 Child's middle name (sometimes or initial)

3 Child's last name

4 Day, month and year of birth

5 City and/or County of birth (sometimes hospital)

6 Father's name

7 Mother's name (including maiden name)

8 Place of birth (address and town/city)

9 Mother's age and address

10 Mother's birthplace (town/city, state, county)

11 Mother's occupation

12 Mother, number of previous children

13 Father's age and address

14 Father's birthplace (town/city, state, county)

15 Father's occupation

1 2

3 4

5 6

Page 2: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

2

© 2020 Bradley Malin 7Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

1 1 File Status

2 50 Baby’s First Name

3 50 Baby’s Middle Name

4 50 Baby’s Last Name

5 1 Baby’s Suffix Code

6 3 Baby’s Suffix Text

7 8 Baby’s Date of Birth

8 5 Baby’s Time of Birth

9 1 AM/PM Indicator

10 1 Baby’s Sex

11 3 Blood Type

12 1 Born Here?

13 40 Place of Birth

14 1 Facility Type

15 20 City of Birth

Beyond the

Google

Phenomenon

© 2020 Bradley Malin 8Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

16 20 County of Birth

17 6 Certifier’s Code

18 30 Certifier’s Name

19 1 Certifier’s Title

20 30 Attendant’s Name

21 1 Attendant’s Title

22 23 Attendant’s Address

23 19 Attendant’s City

24 2 Attendant’s State

25 10 Attendant’s Zip Code

26 50 Mother’s First Name

27 50 Mother’s Middle Name

28 50 Mother’s Last Name

29 9 Mother’s Social Security Number

30 8 Mother’s Date of Birth

© 2020 Bradley Malin 9Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

31 3 Mother’s State of Birth

32 7 Mother’s Residence Address

33 2 Mother’s Residence Direction

34 20 Residence Street Address

35 10 Residence Type

36 2 Residence Extension

37 10 Residence Apartment #

38 20 Mother’s Town of Residence

39 1 Mother’s Residence in City Limits

40 14 Mother’s County of Residence

41 3 Mother’s State of Residence

42 10 Mother’s Residence Zip Code

43 38 Mother’s Mailing Address

44 19 Mother’s Mailing City

45 2 Mother’s Mailing State© 2020 Bradley Malin 10Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

46 10 Mother’s Mailing Zip Code

47 1 Mother Married?

48 50 Father’s First Name

49 50 Father’s Middle Name

50 50 Father’s Last Name

51 1 Father’s Suffix Code

52 9 Father’s Suffix Text

53 9 Father’s Social Security Number

54 8 Father’s Date of Birth

55 3 Father’s State of Birth

56 14 Mother’s Origin

57 14 Mother’s Race

58 2 Mother’s Elementary Education

59 2 Mother’s College Education

60 11 Mother’s Occupation

© 2020 Bradley Malin 11Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

61 11 Mother’s Industry

62 14 Father’s Origin

63 14 Father’s Race

64 2 Father’s Elementary Education

65 2 Father’s College Education

66 11 Father’s Occupation

67 11 Father’s Industry

68 1 Plurality

69 1 Birth Order

70 2 Live Births Still Living

71 2 Live Births Now Dead

72 4 Month/Year Last Live Birth

73 2 Number of Terminations

74 4 Month/Year Last Termination

75 1 Baby’s Weight Unit© 2020 Bradley Malin 12Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

76 5 Baby’s Weight

77 6 Date of Last Normal Menses

78 1 Month Prenatal Care Began

79 2 Total Number of Visits

80 2 Apgar Score – 1 Minute

81 2 Apgar Score – 5 Minute

82 2 Estimate of Gestation

83 6 Date of Blood Test

84 22 Laboratory

85 1 Mother Transferred In

86 30 Facility Mother Transferred From

87 1 Baby Transferred Out

88 30 Facility Baby Transferred To

89 1 Tobacco Use During Pregnancy

90 3 Number of Cigarettes/Day

7 8

9 10

11 12

Page 3: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

3

© 2020 Bradley Malin 13Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

91 1 Alcohol Use During Pregnancy

92 3 Number of Drinks/Week

93 3 Mother’s Weight Gain

94 1 Release Info For SSN

95 6 Operator Code

96 12 Hospital ID

97 1 Sent to Romans

98 1 Sent to APORS

99 16 Other Certifier Specify

100 12 Temporary Audit Number

101 16 Other Facility Specify

102 16 Other Attendant Specify

103 1 Mother’s Race

104 1 Father’s Race

105 2 Mother’s Origin© 2020 Bradley Malin 14Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

106 2 Father’s Origin

107 1 Attendant Same YN

108 1 Mailing Address Same YN

109 1 Capture Father’s Info YN

110 2 Mother’s Age

111 2 Father’s Age

112 12 Baby’s Hospital Med. Rec.

113 1 High Risk Pregnancy YN

114 1 Care Giver (For Chicago)

115 1 Record Selected For Download

116 1 Downloaded

117 1 Printed

118 12 Form Number

MEDICAL RISK FACTORS

119 1 Anemia

© 2020 Bradley Malin 15Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

120 1 Cardiac Disease

121 1 Acute/Chronic Lung Disease

122 1 Diabetes

123 1 Genital Herpes

124 1 Hydramnios/Oligohydramnios

125 1 Hemoglobinopathy

126 1 Hypertension, Chronic

127 1 Hypertension, Preg. Assoc.

128 1 Eclampsia

129 1 Incompetent Cervix

130 1 Previous Infant 4000+ Grams

131 1 Previous Preterm or SGA Infant

132 1 Renal Disease

133 1 Rh Sensitization

134 1 Uterine Bleeding© 2020 Bradley Malin 16Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

135 1 No Medical Risk Factors

136 40 Other Medical Risk Factors

OBSTETRIC PROCEDURES

137 1 Amniocentesis

138 1 Electronic Fetal Monitoring

139 1 Induction of Labor

140 1 Stimulation of Labor

141 1 Tocolysis

142 1 Ultrasound

143 1 No Obstetric Procedures

144 40 Other Obstetric Procedures

COMPLICATIONS OF LABOR & DELIVERY

145 1 Febrile (>100 or 38C)

146 1 Meconium Moderate, Heavy

147 1 Premature Rupture (>12 Hrs)

© 2020 Bradley Malin 17Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

METHOD OF DELIVERY

162 1 Vaginal

163 1 Vaginal After Previous C-Section

164 1 Primary C-Section

165 1 Repeat C-Section

166 1 Forceps

167 1 Vacuum

ABNORMAL CONDITIONS OF NEWBORN

168 1 Anemia

169 1 Birth Injury

170 1 Fetal Alcohol Syndrome

171 1 Hyaline Membrane Disease/RDS

172 1 Meconium Aspiration Syndrome

173 1 Assisted Ventilation <30

174 1 Assisted Ventilation >30© 2020 Bradley Malin 18Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

175 1 Seizures

176 1 No Abnormal Conditions of Newborn

177 40 Other Abnormal Condition of Newborn

CONGENITAL ANOMALIES OF CHILD

178 1 Anencephalus

179 1 Spina Bifida/Meningocele

180 1 Hydrocephalus

181 1 Microcephalus

182 40 Other CNS Anomalies

183 1 Heart Malformations

184 40 Other Circ./Resp. Anomalies

185 1 Rectal Atresia/Stenosis

186 1 Tracheo-Esophageal Fistula/Esophageal Atresia

187 1 Omphalocele/Gastroschisis

188 40 Other Gastrointestinal Ano.

13 14

15 16

17 18

Page 4: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

4

© 2020 Bradley Malin 19Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

189 1 Malformed Genitalia

190 1 Renal Agenesis

191 40 Other Urogenital Anomalies

192 1 Cleft Lip/Palate

193 1 Polydactyly/Syndactyly/Adactyly

194 1 Club Foot

195 1 Diaphragmatic Hernia

196 40 Other Musculoskeletal/Integumental Anomalies

197 1 Down’s Syndrome

198 40 Other Chromosomal Anomalies

199 1 No Congenital Anomalies

200 40 Other Congenital Anomalies

CODE STRIP

201 1 Record Complete YN

202 1 Record Type© 2020 Bradley Malin 20Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)Field# Size Field name

203 4 Facility ID

204 4 City of Birth

205 3 County of Birth

206 2 Mother’s State of Birth

207 2 Mother’s State of Residence

208 4 Mother’s Town of Residence

209 3 Mother’s County of Residence

210 2 Father’s State of Birth

211 14 Certifier’s License Number

212 6 Laboratory ID Number

213 4 Mother Xfer Code

214 3 Mother Xfer County Code

215 4 Baby Xfer Code

216 3 Baby Xfer County Code

217 4 Year of Birth

© 2020 Bradley Malin 21Lecture 9: Availability & Prediction

Electronic Birth Certificates (post 1999)

Field# Size Field name

218 7 Certificate #

219 1 Unique Code

220 8 File Date

221 2 Community Area

222 4 Census Tract

223 2 Century of Last Live Birth

224 2 Century of Last Termination

225 2 Century of Last Menses

© 2020 Bradley Malin 22Lecture 9: Availability & Prediction

Overview

◼ Information Generation

◼ Models of Availability

◼ Some Resources

◼ A Look at Voter Registration

◼ Curse of the SSN

© 2020 Bradley Malin 23Lecture 9: Availability & Prediction

Accessibility

◼ Characterization of datasets / data

◼ Meta-information

◼ Cost: Price per record or cost per dataset?

◼ Attribute: Type of data (e.g., name, birthdate, profession)

◼ Availability = Credentials needed to access the dataset

Semantics

Dataset

Attribute

Credentials

Dataset

Availability

Economics

Dataset

Cost

© 2020 Bradley Malin 24Lecture 9: Availability & Prediction

Availability

Anyone can access the

information little, if any,

constraints

(e.g., Google / Public Records)

Public

The data is there but there are

some barriers to entry

(e.g., Money)

Semi-Public

Requires certain credentials to

access such information

(e.g., Census researchers)

Semi-Private

Only privileged individuals

are privy to the information

(e.g., Top Secret)

Private

19 20

21 22

23 24

Page 5: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

5

© 2020 Bradley Malin 25Lecture 9: Availability & Prediction

Overview

◼ Information Generation

◼ Models of Availability

◼ Some Resources

◼ A Look at Voter Registration

◼ Curse of the SSN

© 2020 Bradley Malin 26Lecture 9: Availability & Prediction

https://data.census.gov/cedsci/

© 2020 Bradley Malin 27Lecture 9: Availability & Prediction

The Rise of Twitbookin

◼…Do we

really need to

talk about it?

Facebook

Twitter

LinkedIn

© 2020 Bradley Malin 28Lecture 9: Availability & Prediction

Intelius.com

© 2020 Bradley Malin 29Lecture 9: Availability & Prediction

Property Assessments

◼ Tennessee

http://www.assessment.cot.tn.gov/RE_Assessment/

◼ Davidson County

http://www.padctn.org/real-property-search/

Search by {Owner, Parcel, Street Address}

◼ Imagine combining with Google Maps

(http://maps.google.com) or Zillow

(http://www.zillow.com)

© 2020 Bradley Malin 30Lecture 9: Availability & Prediction

Reverse Lookups

◼ Phone

http://www.anywho.com/reverse-lookup

https://www.ussearch.com/reverse-phone-

lookup

◼ DNS

http://remote.12dt.com/

http://psacake.com/web/eg.asp

http://www.dnsstuff.com/

25 26

27 28

29 30

Page 6: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

6

© 2020 Bradley Malin 31Lecture 9: Availability & Prediction © 2020 Bradley Malin 32Lecture 9: Availability & Prediction

Collections on Everything

◼ Bankruptcy

◼ Birth

◼ Criminal

◼ Death

◼ Divorce

◼ DNS

◼ Employment

◼ Financial (e.g., donations)

◼ Marriage

◼ Military

◼ Residential

◼ Social Security

◼ Phone

◼ Voting

◼ …

© 2020 Bradley Malin 33Lecture 9: Availability & Prediction

Brokers are Real

© 2020 Bradley Malin 34Lecture 9: Availability & Prediction

Brokers are Big Business

Take a look at IQVIA

© 2020 Bradley Malin 35Lecture 9: Availability & Prediction

Birthdays

◼ http://www.birthdatabase.com/

◼ Search by {First Name, Last Name, Expected Age}

◼ Where does this information come from?

◼ Why is this available?

◼ Imagine combining with Facebook’s place of birth to

reveal DOB

© 2020 Bradley Malin 36Lecture 9: Availability & Prediction

Combining Databases?

◼ How do you integrate these

databases?

Do you trust names?

Do you trust phone

numbers?

How much information would

you need until you’re

confident of a match?

(We’ll return to this next

lecture)

31 32

33 34

35 36

Page 7: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

7

© 2020 Bradley Malin 37Lecture 9: Availability & Prediction

Registries

◼ National Sex Offender Registry

https://www.nsopw.gov/© 2020 Bradley Malin 38Lecture 9: Availability & Prediction

Tennessee

◼ http://sor.tbi.tn.gov/SOMainpg.aspx

# Field

1 Photo

2 Date of Birth

3 Race

4 Sex

5 Home Address

6 County of Residence

7 Last Date Information Updated

# Field

8 Last Registration / Report Date

9 Status

10 Classification

11 TID

12 Supervision Site

13 Offenses

14 Aliases

© 2020 Bradley Malin 39Lecture 9: Availability & Prediction

Drug Offender Registries

◼ TN: https://apps.tn.gov/methor/

◼ Searchable by County

or Name + First Initial

Why?

# Field

1 Last Name

2 First Name

3 Type of Name

4 Date of Birth

5 County

6 Offense(s)

7 Date of Conviction

© 2020 Bradley Malin 40Lecture 9: Availability & Prediction

Overview

◼ Information Generation

◼ Models of Availability

◼ Some Resources

◼ A Look at Voter Registration

◼ Curse of the SSN

© 2020 Bradley Malin 41Lecture 9: Availability & Prediction

Remember the Voter Database?

◼ Public Information Sharing

◼ Example: Washington

If you are a voter, your name, address, political

jurisdiction, gender, date of birth, voting record, date

of registration, and registration number are public

information under state law.

(RCW 29A.08.710)

◼ This is public record by law and does not violate

security or privacy policy, however

© 2020 Bradley Malin 42Lecture 9: Availability & Prediction

Behind Closed Doors

◼ More than public information for registration of voters

◼ Goal: Enable sharing of information with other states securely and

accurately in fulfillment of HAVA (the Help America Vote Act)

◼ Example: Pennsylvania

“Statewide Uniform Registry of Electors”, county election officials have

direct access to the centralized statewide database

The state uses “identifying number, name, and date of birth” for linking

to motor vehicle and/or Social Security records

◼ Use a hybrid match: the number and first two characters of the last name

must match exactly, with discretion left to the county commission to

determine if the rest of the record is a match

◼ Currently uses the AAMVA (American Association of Motor Vehicles

Administrators) criteria to match information with SSN digits: exact match of

the last four digits of Social Security Number, first name, last name, month

of birth, and year of birth.

37 38

39 40

41 42

Page 8: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

8

© 2020 Bradley Malin 43Lecture 9: Availability & Prediction

Washington

© 2020 Bradley Malin 44Lecture 9: Availability & Prediction

https://dl.ncsbe.gov/index.html?prefix=data/

North Carolina

© 2020 Bradley Malin 45Lecture 9: Availability & Prediction

Georgia

http://sos.ga.gov/index.php/e

lections/order_voter_registra

tion_lists_and_files

© 2020 Bradley Malin 46Lecture 9: Availability & Prediction

Privacy Violations By InferenceName Date Number of Voters

Catoosa 9/18/2007 1

Cobb 9/18/2007 1

Clayton 9/18/2007 1

Lee 11/06/2007 1

Gwinnett 9/18/2007 1

Dekalb 9/18/2007 3

Chattaho 11/06/2007 3

Sumter 11/06/2007 3

Seminole 11/06/2007 4

Charlton 11/06/2007 5

Dodge 9/18/2007 6

Mitchell 3/20/2007 7

Dawson 9/18/2007 7

Registration number

76686

When it is revealed how

Catoosa county voted in

this election (aggregate

results), then we uniquely

link this voter to their vote.

Georgia Voting 2007 History

© 2020 Bradley Malin 47Lecture 9: Availability & Prediction

Policy & Usage RestrictionsRCW 29A.08.740

Violations of restricted use of registered voter data - Penalties - Liabilities. (Effective January 1, 2006.)

(1) Any person who uses registered voter data furnished under RCW 29A.08.720 for the purpose

of mailing or delivering any advertisement or offer for any property, establishment,

organization, product, or service or for the purpose of mailing or delivering any solicitation

for money, services, or anything of value is guilty of a class C felony punishable by imprisonment in

a state correctional facility for a period of not more than five years or a fine of not more than ten

thousand dollars or both such fine and imprisonment, and is liable to each person provided such

advertisement or solicitation, without the person's consent, for the nuisance value of such person

having to dispose of it, which value is herein established at five dollars for each item mailed or

delivered to the person's residence. …

(2) Each person furnished data under RCW 29A.08.720 shall take reasonable precautions designed to

assure that the data is not used for the purpose of mailing or delivering any advertisement or

offer for any property, establishment, organization, product, or service or for the purpose of

mailing or delivering any solicitation for money, services, or anything of value. However, the

data may be used for any political purpose. Where failure to exercise due care in carrying out this

responsibility results in the data being used for such purposes, then such person is jointly and

severally liable for damages under subsection (1) of this section along with any other person liable

under subsection (1) of this section for the misuse of such data.

[2005 c 246 § 19. Prior: 2003 c 111 § 249; 2003 c 53 § 176; 1999 c 298 § 2; 1992 c 7 § 32; 1974 ex.s. c

127 § 3; 1973 1st ex.s. c 111 § 4. Formerly RCW 29.04.120.]

© 2020 Bradley Malin 48Lecture 9: Availability & Prediction

Overview

◼ Information Generation

◼ Models of Availability

◼ Some Resources

◼ A Look at Voter Registration

◼ Curse of the SSN

43 44

45 46

47 48

Page 9: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

9

© 2020 Bradley Malin 49Lecture 9: Availability & Prediction

SSNs – Who Cares?

◼ The Social Security Number is one of, if not, the most overloaded

numbers in the United States

◼ It binds records on finances, insurance, education, death, taxes…

◼ Two HUGE problems: Fraud & Identity Theft

https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-network-data-book-

january-december-2016/csn_cy-2016_data_book.pdf

YearTotal # of

Complaints

% of Complaints Reporting

Amount Paid

ReportedAmount

Paid

Avg.Amount

Paid

MedianAmount

Paid

2003 327,479 78% $459M $1.8k $222 2004 406,193 76% $567M $1.8k $267 2005 431,118 66% $682M $2.4k $350 2016 3,000,000 51% $744M $1.1k $450

◼ 63% Fraud: Internet Auction (12%), Foreign Money Offer (8%)

◼ 37% ID theft: Credit card (26%), Phone / Utilities (18%), Employment

(12%), Government documents / benefits (9%), Loan (5%)

© 2020 Bradley Malin 50Lecture 9: Availability & Prediction

A Brief Demonstration

© 2020 Bradley Malin 51Lecture 9: Availability & Prediction

Location and Age Matter (2005)

https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-

network-data-book-january-december-2016/csn_cy-2016_data_book.pdf

◼ Highest per capita rate of identity theft (metropolitan areas)

Rank Metropolitan Area Complaints Per 100,000 People1Phoenix-Mesa-Scottsdale, AZ 178.32Las Vegas-Paradise, NV 158.53Riverside-San Bernardino-Ontario, CA 145.7

43Nashville-Davidson--Murfreesboro, TN 63.6

◼ Rate of victimization by

age range

© 2020 Bradley Malin 52Lecture 9: Availability & Prediction

Location and Age Matter (2016)

https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-

network-data-book-january-december-2016/csn_cy-2016_data_book.pdf

◼ Highest per capita rate of identity theft (metropolitan areas)

◼ Rate of victimization by

age range

© 2020 Bradley Malin 53Lecture 9: Availability & Prediction

SSNs

◼ Federal paternalism for “social insurance”

◼ Benefits based on payroll tax contributions →

Federal Old-Age Benefits

◼ Issuance began in late 1936

◼ Issued by the Social Security Administration

Permanent residents

Temporary / working residents

http://www.ssa.gov/history

John David

Sweeney

◼ First number to John David Sweeney Jr.

(Baltimore, MD)

◼ Lowest number issued: 001-01-0001

© 2020 Bradley Malin 54Lecture 9: Availability & Prediction

SSN Policy Chronology

◼ 1935: Social Security Act creates “social insurance” program

◼ 1943: Executive Order – All federal agencies use SSN when identification

needed

◼ 1950–1971: “Adult category” for state run Supplemental Security Income

◼ 1961:

Civil Service adopts SSNs Federal employee identifier

IRS requires tax payers to use SSNs for tax reporting

◼ 1964: Treasury Dept. asks H bond purchasers for SSN

◼ 1966: VA adopts SSN as patient identifier

(1967 – Weed begins work on first EMR)

◼ 1969: DOD adopts SSN as Armed Forces personnel ID

◼ 1970: Bank Records & Foreign Translations Act: all banks, savings & loan,

credit unions & securities brokers/dealers → obtain SSNs of all customers

◼ 1971: SSA Task Force warn against overusage

◼ 1972: SSA Amendment – all legal aliens get SSN

Look at http://www.ssa.gov/history/ssn/ssnchron.html

49 50

51 52

53 54

Page 10: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

10

© 2020 Bradley Malin 55Lecture 9: Availability & Prediction

SSN Policy Chronology

◼ 1974: Privacy Act – State & local gov’t cannot withhold benefit due to

failure of SSN presentation

◼ 1975: Social Services Amendment of ’74: Parent Locator Service can

collect SSN and whereabouts from SSA records

◼ 1976: Tax Reform Act of 1976: States can use SSN for tax, general

public assistance, driver's license or motor vehicle registration

◼ 1981: Omnibus Budget Reconciliation Act: SSN of each adult member

in household of child applying to school lunch program

◼ 1982: Debt Collection Act: Federal loan program SSN in application

◼ 1987: SSNs for infants

◼ 1998: Identity Theft & Assumption Deterrence Act: "means of

identification" includes SSN

◼ 2005: Real ID Act: States must confirm SSN (with SSA) drivers license

or identity card issuance

© 2020 Bradley Malin 56Lecture 9: Availability & Prediction

Medicare

◼ http://my.medicare.gov/

◼ Medicare Identification

Number (MIN) is usually

SSN + an added letter

◼ Ex: 000-00-0000A

A = wage earner (primary)

If spouse becomes eligible

for Medicare benefits

through primary, they are

assigned a B

Many valid suffixes

◼ MIN may be different than

the SSN

© 2020 Bradley Malin 57Lecture 9: Availability & Prediction

Ferree Snafu

◼ 1938: Wallet manufacturer, E.H. Ferree, promoted new wallet

◼ Sample card was a real card

Hilda Schrader Whitcher, secretary of the company vice president

◼ Wallet sold by Woolworth department stores across the USA

◼ 1943: 5,755 people using the number

◼ SSA voided the number; issued Hilda new card

◼ > 40,000 people reported the Whitcher number as their own

◼ 1976: 40 people found using the number

◼ 1977: 12 “ “ “ “ “

◼ It’s known as “the Social Security Number issued by Woolworth”

◼ Many other cases

1940: The 219-09-9999 vs. “Provo, Utah” Case

© 2020 Bradley Malin 58Lecture 9: Availability & Prediction

SSN Assignment

◼ SSNs are almost a one-time shot

◼ You can get a new SSN only in extremely

rare circumstances

◼ You must prove

Someone has stolen your number

Someone is using it illegally

The misuse is causing you serious harm

http://www.socialsecurity.gov/ssnumber/ss5doc.htm

© 2020 Bradley Malin 59Lecture 9: Availability & Prediction

Modern Times: Restricted Use

◼ Some state laws restrict SSN use, display, and

transfer (e.g., CA in 2001)

◼ Michigan prohibits use of more than 4

consecutive digits of an SSN

Is that sufficient protection?

© 2020 Bradley Malin 60Lecture 9: Availability & Prediction

The SSN

XXX-YY-ZZZ

Area

(AN)

Group

(GN)

Serial

(SN)

55 56

57 58

59 60

Page 11: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

11

© 2020 Bradley Malin 61Lecture 9: Availability & Prediction

Area Numbers: XXX

◼ Prior to 1972: represented the state from

which a person applied for their social

security card

◼ After 1972: based on zip code in the

mailing address provided on the

application form

© 2020 Bradley Malin 62Lecture 9: Availability & Prediction

Area Numbers: XXX# STATE # STATE # STATE # STATE

001-003 NH 232

237-246

681-690

NC

387-399 WI 627-645 TX

004-007 ME 400-407 KY 468-477 MN

008-009 VT 408-415

756-763TN

478-485 IA

010-034 MA 247-251

654-658SC

486-500 MO

035-039 RI 416-424 AL 501-502 ND

040-049 CT 252-260

667-675GA 425-428

587-588

752-755

MI

503-504 SD

050-134 NY 505-508 NB

135-158 NJ 261-267

589-595

766-772

FL

509-515 KS

159-211 PA 429-432

676-679AR

516-517 MT

212-220 MD 518-519 ID

221-222 DE 268-302 OH 433-439

659-665LA

520 WY

223-231

691-699VA

303-317 IN521-524

650-653CO318-361 IL 440-448 OK

232-236 WVA 362-386 MI 449-467 TX

© 2020 Bradley Malin 63Lecture 9: Availability & Prediction

XXX

◼ **Discontinued

7/1/63

◼ 000 will NEVER

be a valid XXX

number

# STATE # STATE

525, 585

648-649NM

575-576

750-751HI

526-527

600-601

764-765

AZ

577-579 DC

580 Virgin Islands

580-584

596-599Puerto Rico

528-529

646-647UT

586 Guam

530,680 NV 586 American Samoa

531-539 WA 586 Philipine Islands

540-544 OR 700-728 Railroad Board**

545-573

602-626CA

729-733 Enumeration at Entry

574 AK

Area Numbers: XXX

© 2020 Bradley Malin 64Lecture 9: Availability & Prediction

Group Numbers: YY

◼ Range from 01-99

◼ They’re not allocated

consecutively!

Order Type Range

1st Odd 01 through 09

2nd Even 10 through 98

3rd Even 02 through 08

4th Odd 11 through 99

◼ Highest group issued as of 1/2/08

http://www.ssa.gov/employer/highgroup.txt

◼ Can also trace the allocation of group numbers

over time:

http://www.ssa.gov/employer/ssnvhighgroup.htm

Serial Numbers: ZZZZ

Last 4 digits

SNs have been issued “in monotonically increasing order” within

each State and within each GN

▪ From 0001 to 9999

However, SSA also writes:

“SSNs are assigned randomly by computer within the confines of

the area numbers allocated to a particular state based on data

keyed to the Modernized Enumeration System” (50, RM00201.060)

…reflecting SSA’s belief that idiosyncrasies in the SSN

application and vetting processes make the SN assignment

effectively random

From A. Acquisti © 2020 Bradley Malin 66Lecture 9: Availability & Prediction

Abuse Registry

https://apps.health.tn.gov/abuseregistry/

61 62

63 64

65 66

Page 12: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

12

© 2020 Bradley Malin 67Lecture 9: Availability & Prediction © 2020 Bradley Malin 68Lecture 9: Availability & Prediction

© 2020 Bradley Malin 69Lecture 9: Availability & Prediction

Tennessee

◼ Committing Fraud

1. Birth records → Mother’s Maiden Name

https://sos.tn.gov/products/tsla/birth-records

2. Birthday → Remember those public

records databases?

3. Social Security Number

Ah…

© 2020 Bradley Malin 70Lecture 9: Availability & Prediction

Inside Knowledge?

◼ What about insiders’ information?

◼ Could you steal someone’s SSN?

How would you achieve this feat?

Do you think you would be caught?

Please, please, do not make an attempt.

◼ What if it was in plain site?

© 2020 Bradley Malin 71Lecture 9: Availability & Prediction

http://www.mc.vanderbilt.edu/root/vumc.php?site=vanderbiltnursing&doc=9352

© 2020 Bradley Malin 72Lecture 9: Availability & Prediction

http://sitemason.vanderbilt.edu/new

spub/crmQtG?id=21688

67 68

69 70

71 72

Page 13: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

13

© 2020 Bradley Malin 73Lecture 9: Availability & Prediction

http://osfp.mc.vanderbilt.edu/Policies/Construction%20Identificatio

n%20Badge%20and%20Orientation%20Policy%20(12.28.06).pdf © 2020 Bradley Malin 74Lecture 9: Availability & Prediction

Online Validation

◼ http://www.ssnvalidator.com

Chances of correctly matching SSN digits by

random guess, under status quo knowledge

Alaska, 1998 New York, 1998

First 5 digits with 1 guess

All 9 digits with< 1,000 guesses

First 5 digits with 1 guess

All 9 digits with< 1,000 guesses

No auxiliary knowledge

0.0014% 0.00014% 0.0014% 0.00014%

Knowledge of state of SSN application

1% 0.1% 0.012% 0.0012%

Adapted from A. Acquisti © 2020 Bradley Malin 76Lecture 9: Availability & Prediction

Or SSDI

◼ You could wait until someone dies…

◼ Social Security Death Index (SSDI) Database

http://search.ancestry.com/search/db.aspx?dbid=3693

◼ Death reported to the Social Security

Administration

Possibly submitted by “relation” requesting Social

Security benefits

http://www.ssa.gov/pubs/10084.html

or to stop benefits

© 2020 Bradley Malin 77Lecture 9: Availability & Prediction

Rise of the SSN

◼ Over 80 million

records

◼ Data back to

1937, but the

majority is after

1962

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

1935 1945 1955 1965 1975 1985 1995 2005

Po

pu

lati

on

Siz

e

Year

SSDI

US Pop / 100

Reasons to believe that the assignment

lacks sufficient randomness

◼ In the last 30 years, SSN issuance has become more

regular

Increasing computerization of the public administration,

including SSA and its various fields offices

After 1972, SSN assignment centralized from Baltimore, MD

After 1989, Enumeration at Birth Process (EAB)

◼ Prior to 1989, only small percentage of people received SSN

when they were born

◼ Currently at least 90 percent of all newborns receive SSN via

EAB together with birth certificate

Adapted from A. Acquisti

73 74

75 76

77 78

Page 14: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

14

Hence, two hypotheses

1. Expect SSN issuance patterns to have become

more regular over the years, i.e., increasingly

correlated with an individual’s birthday and

birthplace

This should be detected through analysis of available

SSN data

2. Expect these patterns to have become so regular

that it is possible to infer unknown SSNs based

on the patterns detected on available SSNs

This should be verified by contrasting estimated SSNs

against known SSNsAdapted from A. Acquisti

Compared to previous knowledge

◼ The SSN assignment scheme follows geographical and chronological

patterns - this is well known

◼ Focused on the inverse, harder, and much more consequential

inference: exploiting the presumptive day and location of SSN

application to predict unknown SSNs

Discovered that the interpretation of the assignment scheme held

outside SSA was wrong, and SSA’s assumption of randomness was

wrong

Adapted from A. Acquisti

Predictions Based on Public Data

The Social Security Administration’s Death Master File

(DMF) is a publicly available database of the SSNs of

individuals who are deceased

▪ More recent and up-to-date than the SSDI

▪ One purpose of making this data available is to combat fraud!

▪ But it can be analyzed to find SSN issuance patterns

Used DMF to find patterns in the issuance of SSNs by

date of birth and State of SSN issuance for deceased

individuals

▪ Sorted records by reported DOB and grouped them by

reported State of issuanceAdapted from A. Acquisti

A DMF record (example)

Name Birth Death Last Residence SSN Issued

JOHN

SMITH

21 Jun

1904

Oct

1979

33540 (Zephyrhills,

Pasco, FL)022-10-3459 Massachusetts

Adapted from A. Acquisti

SSN assignment patterns:

Two representative States SSN issuance sequence (MT)

516 01 0001

516 01 0002

516 01 9999

517 01 0001

517 01 0002

517 01 9999

516 03 0001

516 01 ????

517 01 ????

516 01 ????

517 01 ????

516 03 ????

517 03 ????

Expected Observed

Adapted from A. Acquisti

79 80

81 82

83 84

Page 15: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

15

SSN predictions

1. TEST 1: Used > 500,000 DMF records to detect

patterns in SSN issuance based on birthplace and

state of issuance, and used those patterns to predict

(and verify) individual SSNs in the DMF

2. TEST 2: Mined data from an online social network to

retrieve individuals’ self reported birthdays and

birthplaces, and estimated their SSNs by

interpolating that data with DMF patterns.

1. Verified the estimates using official Enrollment data using a protected

(and IRB approved) protocol Adapted from A. Acquisti

Prediction Approach

◼ Area number

Mode AN in target’s state around target’s birthday

◼ Group number

Mode GN in target’s state around target’s birthday

◼ Serial number

Based on regressions coefficients, inserting target’s

birthday as dd

Adapted from A. Acquisti

Success metrics

◼ Accuracy in prediction of the first 5 digits of an

individual’s SSN with 1, 10, 100, and 1000

attempts

Note: 1,000 attempts is equivalent to 3-digit PIN

And is very insecure and vulnerable to brute force

attacks

Adapted from A. Acquisti

AN-GN predictability (first 5 digits)

EAB starts

here (1989)

1973 2003

CA

ME

Adapted from A. Acquisti

Full SSN predictability with <1,000

attempts

Adapted from A. Acquisti

Test 1: Overall results for DMF records

▪ With a single attempt (first five digits only):

▪ 7% (1973- 1988) 44% (1989-2003)

▪ With 10 attempts (complete 9-digit SSNs):

▪ 0.01% of (1973- 1988) 0.1% (1989-2003)

▪ With 1,000 attempts (complete 9-digit SSNs):

▪ 0.8% (1973-1988) 8.5% (1989- 2003)

▪ These are weighted averages – for smaller states and recent

years, prediction rates are higher

▪ (e.g., 1 out of 20 SSNs in DE, 1996, are identifiable with 10 or fewer

attempts)

Adapted from A. Acquisti

85 86

87 88

89 90

Page 16: Overview 1200 New New + Replicated€¦ · Newer Data Exabyte = 1 billion gigabytes * Includes analog data radio communications, paper memos, etc. ** Includes new and replicated data

16

Chances of correctly matching SSN digits

by random guess, under our algorithm

Alaska, 1998 New York, 1998

First 5 digits

with 1 guess

All 9 digits with

< 1,000

guesses

First 5 digits

with 1 guess

All 9 digits with

< 1,000

guesses

No auxiliary

knowledge

0.0014% 0.00014% 0.0014% 0.00014%

Knowledge of

state of SSN

application

1% 0.1% 0.012% 0.0012%

Predictions

based on the

algorithm

94% 58% 30% 3%

Adapted from A. Acquisti

Test 2: From social networks data

to SSNs

▪ Used birthday data of 621 alive individuals to predict

their SSN, based on interpolation with DMF data

▪ Sample: born in 1986-1990 (i.e., mostly before EAB)

▪ In most populous states (i.e., worst case scenario)

▪ Birthday and birthplace data can be obtained from

several sources, but most easily, and in mass

amounts, from online social networks

Adapted from A. Acquisti

The approach, revisited

Name Birth Death Last Residence SSN Issued

JOHN

DOE

28 July

1987

Nov

200194720

022-12-

6744NJ

Name Birth Death Last Residence SSN Issued

JOHN

FBOOK

14 July

1987??? NJ

Name Birth Death Last Residence SSN Issued

JOHN

SMITH

1 July

1987

Oct

200533540

022-10-

4592NJ

Adapted from A. Acquisti

Facebook estimations

◼ Test 2 results confirmed Test 1 predictions

Overall AN prediction accuracy: 8.5%

Overall GN prediction accuracy: 29.1%

Combined AN-GN prediction accuracy: 6.3%

◼ Compare to corresponding weighted sample in Test 1

(based on DMF data): 11.21%

Adapted from A. Acquisti

Results and extrapolations

◼ Confirms interpolation of SSN data for deceased individuals and birthday data for alive individuals can lead to the prediction of the latter’s SSNs

◼ Extrapolating to living US population, that would imply the identification of around 40 million SSNs’ first 5 digits and almost 8 million individuals’ complete SSNs

Assumes knowledge of birth data

Adapted from A. Acquisti

91 92

93 94

95