26 th annual mis conference february 13, 2013 washington, dc
DESCRIPTION
Disclosure Avoidance and the U.S. Department of Education’s School-Level Assessment Data Release. 26 th Annual MIS Conference February 13, 2013 Washington, DC. Michael Hawes Statistical Privacy Advisor U.S. Department of Education. Overview. Details of the Data Release - PowerPoint PPT PresentationTRANSCRIPT
Disclosure Avoidance and the U.S. Department of Education’s
School-Level Assessment Data Release
26th Annual MIS ConferenceFebruary 13, 2013Washington, DC
Michael HawesStatistical Privacy Advisor
U.S. Department of Education
2
Overview
Details of the Data Release
FERPA and the Need for Disclosure Avoidance
Disclosure Avoidance Techniques
Challenges
Strategy and Objectives
Process and Decisions
Selected Methodology
Evaluation
Questions
3
Details of the Data Release
School-level Math and Reading Proficiency (# of Valid Tests, % Proficient)– By Grade– By Sub-group
• Race/Ethnicity• Gender• Children with Disabilities• Economically Disadvantaged• Limited English Proficiency• Homeless• Migrant
– School Years 2008-2009, 2009-2010, 2010-2011
Available via https://explore.data.gov
4
Data Release
Public Data – Contains no PII
5
FERPA and the Need for Disclosure Avoidance FERPA protects personally identifiable information (PII)
from education records from unauthorized disclosure Requirement for written consent before sharing PII Exceptions from the consent requirement for:
– “Studies” – “Audits and Evaluations”– Health and Safety emergencies– And other purposes as specified in §99.31
Personally Identifiable Information (PII)
Name Name of parents or other family members Address Personal identifier (e.g., SSN, Student ID#) Other indirect identifiers (e.g., date or place of birth) “Other information that, alone or in combination, is linked or
linkable to a specific student that would allow a reasonable person in the school community, who does not have personal knowledge of the relevant circumstances, to identify the student with reasonable certainty.” (34 CFR § 99.3)
6
Reporting vs. Privacy
Department of Education regulations require reporting on a number of issues, often broken down across numerous sub-groups, including:
• Gender• Race/Ethnicity• Disability Status• Limited English Proficiency• Migrant Status• Economically Disadvantaged Students
BUT, slicing the data this many ways increases the risks of disclosure, and the regulations also require states to “implement appropriate strategies to protect the privacy of individual students…” (§200.7)
7
Disclosure Avoidance Techniques
Publishing even aggregate data may include PII – You must use disclosure avoidance!!!
Three Basic Flavors:– Suppression– Blurring– Perturbation
8
Suppression
Definition: Removing data to prevent the identification of individuals in small cells or with unique characteristics
Examples: Cell Suppression Row Suppression Sampling
Effect on Data Utility: Results in very little data being produced for small populations
Requires suppression of additional, non-sensitive data (e.g., complementary suppression)
Residual Risk of Disclosure:
Suppression can be difficult to perform correctly (especially for large multi-dimensional tables)
If additional data are available elsewhere, the suppressed data may be re-calculated.
9
Blurring
Definition: Reducing the precision of data that is presented to reduce the certainty of identification
Examples: Aggregation Percents Ranges Top/Bottom-Coding Rounding
Effect on Data Utility: Users cannot make inferences about small changes in the data
Reduces the ability to perform time-series or cross-case analysis
Residual Risk of Disclosure:
Generally low risk, but if row/column totals are published (or available elsewhere), then it may be possible to calculate the actual values of sensitive cells
10
Perturbation
Definition: Making small changes to the data to prevent identification of individuals from unique or rare characteristics
Examples: Data Swapping Noise Synthetic Data
Effect on Data Utility: Can minimize loss of utility compared to other methods
Issues of transparency and credibility of data
Residual Risk of Disclosure:
If someone has access to some (e.g., a single state’s) original data, they may be able to reverse-engineer the perturbation rules used to alter the rest of the data
11
12
Challenges
“Reasonable Person” Standard
Small Subgroups Challenge
State-published Data
13
Strategy and Objectives
In 2011, the Department created a Data Release Working
Group to develop a coordinated Departmental strategy for
protecting privacy in public data releases.
14
Data Release Working GroupData Release Working Group
Chief Privacy Officer Kathleen Styles
NCES Commissioner Jack Buckley
NCES Chief Statistician Marilyn Seastrom
Statistical Privacy Advisor (Chair) Michael Hawes
OPEPD Representative Lily Clark
PIMS Representative Ross Santy
IES Representative Janis Brown
OCR Representative Rebecca FitchAbby Potts
OESE Representative Jane Clark
OSERS Representative Meredith Miceli
OVAE Representative Ric Hernandez
OUS Representative Taylor StanekJohn O’Bergh
OCFO Representative Jeanne Nathanson
FSA Representative John FareBarry Goldstein
OGC (counsel) Bucky MethfesselKaren Mayo-Tall
15
Strategy and Objectives
DRWG Objectives:
Cross-departmental coordination
Protect privacy while ensuring maximum data utility
(n.b., data utility ≠ data quality)
16
Process and Decisions
Two-fold Data Release Public Data Release (privacy protected)
Restricted-use NCES Licensing (raw data)
• Qualified researchers
• Security requirements
• NCES’ (ESRA) legal protections
• FERPA-permitted uses
17
Process and Decisions
Examination of State Methods§ PTAC review of each state’s disclosure avoidance for assessment data
§ For more information, please come see today’s 4:15 session:
Session V-F“Privacy Technical Assistance Center’s
(PTAC) Analysis of State Public Reports4:15-5:15 Massachusetts Room”
18
Process and Decisions
Most states are using suppression to protect privacy Implementation details vary widely (e.g., suppression at cell or subgroup
level, with/without complementary suppression)
Suppression threshold (n-size) varies from 5-30 (most use n=10)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
5
10
15
20
25
30
35State Adoption of Sub-Group Suppression Rules
Suppression Rule (n < X)
19
Process and Decisions
Privacy Threshold Each reported cell is really two cells (% proficient, % not proficient)
Rule of Three
Goal: For any reported data, to avoid knowing with certainty the
proficiency of students when there are fewer than three individuals in
either the proficient and/or non-proficient category
Implications for subgroups with fewer than six valid tests, and (for larger
groups) for extreme values where there are fewer than three individuals
in either outcome category.
20
Process and Decisions
Initial Attempt:
Primary suppression (n<6)
Top/Bottom coding of extreme values
Complementary suppression
21
Result:
Process and Decisions
Stock Image
22
Process and Decisions RAWDATA 3th Grade 4th Grade 5th Grade All Grades
Number of Valid
Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid Tests
Percent Proficient
American Indian . . . . . . . .
Asian . . . . 1
100%
1
100%
Black
78
79%
100
76%
101
89%
279
82%
Hispanic . . . . . . . .
White 5
80%
8
100%
6
83%
19
89%
Two or More Races
. . . . . . . .
All Students
83
80%
108
78%
108
89%
299
82%
Fictitious Data – Contains no PII
23
Process and Decisions INITIAL
METHOD 3th Grade 4th Grade 5th Grade All Grades Number
of Valid Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid Tests
Percent Proficient
American Indian . . . . . . . .
Asian . . . . 1
*
1
*
Black
78 *
100
*
101
89%
279
82%
Hispanic . . . . . . . .
White 5
*
8
≥50%
6
*
19
*
Two or More Races
. . . . . . . .
All Students
83
80%
108
78%
108
89%
299
82%
Fictitious Data – Contains no PII
24
Process and Decisions
Problems with Initial Approach:
“Swiss-cheese” effect
Implementation Challenges
Vulnerability
We needed something SIMPLER, that made more data
available for small groups
25
Selected Methodology
Primary Suppression of small cells (n=1-5)
“Blurring” of remaining data Top/Bottom coding of extreme values
Ranges for subgroups with n=16-300
Whole number percentages for n>300
Special rule for All Students/All Grades
26
Selected Methodology
Magnitude of reported ranges determined by the size of the
group:Number of Students Reported in the Cell
Ranges Used for Reporting the Percent Proficient for that Group
6-15 <50%, ≥50%16-30 ≤20%, 21-39%, 40-59%, 60-79% ≥80%31-60 ≤10%, 11-19%, 20-29%, 30-39%, 40-49%, 50-
59%, 60-69%, 70-79%, 80-89%, ≥90%
61-300 ≤5%, 6-9%, 10-14%, 15-19%, 20-24%, 24-29%, 30-34%, 35-39%, 40-44%, 45-49%, 50-54%, 55-59%, 60-64%, 65-69%, 70-74%, 75-79%, 80-84%, 85-89%, 90-94%, ≥95%
More than 300 ≤1%, whole number percentages, ≥99%
27
Process and Decisions INITIAL
METHOD 3th Grade 4th Grade 5th Grade All Grades Number
of Valid Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid Tests
Percent Proficient
American Indian . . . . . . . .
Asian . . . . 1
*
1
*
Black
78 *
100
*
101
89%
279
82%
Hispanic . . . . . . . .
White 5
*
8
≥50%
6
*
19
*
Two or More Races
. . . . . . . .
All Students
83
80%
108
78%
108
89%
299
82%
Fictitious Data – Contains no PII
28
Process and DecisionsFINAL
METHOD 3th Grade 4th Grade 5th Grade All Grades Number
of Valid Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid
Tests
Percent Proficient
Number of Valid Tests
Percent Proficient
American Indian . . . . . . . .
Asian . . . . 1
PS
1
PS
Black
78
75-79%
100
75-79%
101
85-89%
279
80-84%
Hispanic . . . . . . . .
White 5
PS
8
≥50%
6
≥50%
19
≥80%
Two or More Races
. . . . . . . .
All Students
83
80-84%
108
75-79%
108
85-89%
299
82%
Fictitious Data – Contains no PII
29
Evaluation
More (albeit less precise) data about small subgroups
Cross-comparison of state reports with ED methodology
does not increase disclosure risk
Simple to program
Your feedback?
30
Questions
Why are you reporting the number of valid tests for small
subgroups? Isn’t that a disclosure?
31
Questions
Why are you reporting the number of valid tests for small
subgroups? Isn’t that a disclosure?
Number of Valid Tests ≠ Number of Students– Movement in/out of the district
– Unreadable or damaged tests
– Students absent during testing
Number Valid often published on state websites – US ED
did not want to rely on privacy protection through
obscurity
32
Questions
The data that ED has reported doesn’t match the data on our
state/school website. Why?
33
Questions
The data that ED has reported doesn’t match the data on our
state/school website. Why?
States may update their websites on different schedules
than they use to report to ED.
States may only count students who were present for the
full academic year (ED includes all student with valid tests
regardless of academic year status).
34
Other Questions
Stock Photo
Contact Information
Michael HawesStatistical Privacy AdvisorU.S. Department of [email protected](202) 453-7017