slds topical webinar summary

SLDS Topical Webinar Summary: Processes for Handling Multiple IDs to Ensure Data Quality, November 23, 2015 1

SLDS Topical Webinar SummaryProcesses for Handling Multiple IDs to Ensure Data Quality

This product of the Institute of Education Sciences (IES) was developed with the help of knowledgeable staff from state education agencies and partner organizations. The content of this publication was derived from a Statewide Longitudinal Data Systems (SLDS) Grant Program monthly topical webinar that took place on November 23, 2015. The views expressed do not necessarily represent those of the IES SLDS Grant Program. We thank the following people for their valuable contributions:

Webinar Presenters: Michael McGroarty Michigan Center for Educational Performance and Information

Chuck Murphy Kentucky Center for Education & Workforce Statistics

Meghann Omo Michigan Center for Educational Performance and Information

John Sabel Washington State Education Research & Data Center

Moderators: Kate Akers Kentucky Center for Education & Workforce Statistics, SLDS Data Linking Workgroup

Kathy Gosa SLDS Grant Program, State Support Team

For more information on the IES SLDS Grant Program or for support with system development, please visit http://nces.ed.gov/programs/SLDS.

As statewide longitudinal data systems (SLDSs) take in data for new individuals, for new years, and from new data sources over time, it is critical to ensure that individuals are correctly identified and that their records are matched accurately within the system. The record-matching process must include a method of resolving multiple identifiers (IDs) for individuals—whether consolidating IDs used by different agencies, correcting cases where multiple IDs are assigned to the same individual, or separating records for different individuals who are mistakenly assigned the same ID.

A thorough and reliable ID-resolution process helps ensure the accuracy of metrics such as graduation, dropout, college enrollment, and job placement rates, whereas incorrectly matched SLDS data can affect the quality and reproducibility of those measures.

Representatives from Kentucky, Washington State, and Michigan discuss their different processes for resolving multiple IDs, as well as how they review and improve those processes.

Kentucky

Matching process overview Kentucky’s centralized P-20W (early childhood through workforce) data system uses a master person index to store the most recent, good-quality data for a unique individual along with alias records showing alternative agency-specific IDs or other data received for that person. Data contributed to the SLDS from partner agencies are linked to existing records using several levels of matching algorithms, which are used sequentially to find similar records based on different combinations of data elements. Incoming records are assigned a point-in-time identifier (PID) that replaces the agency-specific identifier for an individual after his or her record has been run through the matching algorithms.

Once the records are connected to an existing person in the master person index—or entered as a new person—the PIDs are stored in a cross-reference table alongside the master person index’s global unique ID (P20GUID) for that individual. Agency-specific identifiers such as the public K12 school system’s state student ID are not used in the SLDS once the matching process is complete.

Finding and resolving multiple IDsThe different combinations of data elements used by Kentucky’s series of matching algorithms help spot records that are assigned different identifiers but likely represent the same individual. For example, a batch of K12 student data submitted to the SLDS may contain course records for a student who changed schools or who enrolled at more than one school and was mistakenly assigned two different K12 state student IDs (SSIDs) (see figure 1, next page). SSIDs are used as matching criteria in several levels of Kentucky’s matching algorithms, but other levels rely on information such as name, date of birth, gender, ethnicity, or Social Security number rather than SSID. Kentucky runs incoming records through its matching process one at a time, meaning that the second record for that student to enter the system will not match the first on SSID but likely will match on other data elements. The PID for the record containing the alternative SSID will be stored as an alias alongside that student’s entry in the master person index.

http://nces.ed.gov/programs/SLDS


Figure 1. If multiple IDs are assigned to the same individual, Kentucky’s matching process will link records based on additional demographic information and store alternative IDs as aliases.

Depending on the strength of the match, two records representing the same person may be linked automatically in the SLDS, or they may be reviewed manually by SLDS staff using an interface that displays possible matches and highlights their common data elements. Staff members can select an existing person in the data system to link the new records to, or they can create a new P20GUID for that individual.

Additional processes After the initial matching process, Kentucky periodically uses scripts to compare records and help identify any that were matched incorrectly. If records for multiple individuals are found to have been mistakenly linked to the same P20GUID, Kentucky’s SLDS staff members can reassign PIDs for those records to new or existing P20GUIDs. Similarly, if the matching algorithms find a record that shares an ID but no other demographic characteristics with an existing entity in the SLDS, staff members can manually review the record and link it with an existing individual or create a new entry in the master person index. Significant issues are communicated to the agency that originally submitted the dataset in hopes that errors can be corrected in future submissions.

Washington State

Matching process overviewUnlike Kentucky’s system of a master person index with associated aliases, Washington State’s centralized SLDS does not create a hierarchy among records and data for the same individual. After standardizing the format of identifying data such as names, dates of birth, Social Security numbers, and agency-specific identifiers for incoming records, Washington’s system combines those data into an “identity token” to represent each unique individual (see figure 2).

The identity tokens are used to aggregate additional data such as class schedules and grade point averages to the greatest extent possible while ensuring that each set of data represents only one person. Each new identity token entering Washington’s system is assigned a P20ID, the key identifier for unique individuals in the SLDS. P20IDs are then merged as common identifiers and demographic information show that their associated records belong to the same individual.

Finding and resolving multiple IDsLike Kentucky, Washington can use combinations of a variety of data elements to compare records and find potential matches. These elements can include names, dates of birth, state students IDs, and Social Security numbers, as well as elements such as “Last high school reported,” which has helped Washington make links among K12, career and technical education, and higher education records.

As new identity tokens enter the system, they may contain data indicating that records previously stored under two separate P20IDs actually belong to the same individual. In these cases, SLDS staff members can reassign identity tokens associated with one P20ID to another P20ID. These changes are made to a table storing the relationships among identity tokens and P20IDs; all other records in the SLDS remain intact.

Additional processesComparing records over time allows Washington to uncover cases where data have been linked to the wrong individual. For example, if a single P20ID is consistently associated with K12 data from two different school districts over time, SLDS staff members can manually review its related identity tokens and other identifiers to confirm whether data from more than one student are being stored under a single P20ID. They can then reassign certain identity tokens to another P20ID or to a new P20ID as needed.

Figure 2. Identity tokens are created from a combination of identifiers to represent a single, unique person.


Michigan

Matching process overviewMichigan’s P-20W SLDS identifies individuals using Unique Identification Codes (UICs), which are assigned by the state’s Center for Educational Performance and Information (CEPI) and stored by agencies contributing data to the centralized data system. Michigan began using UICs for K12 students in 2002 and expanded the identifier to higher education in 2011. Currently, UICs are being implemented for treasury and workforce agency records, and several Head Start agencies are piloting the identifier.

When representatives from partner agencies submit data to CEPI’s data collection application—which feeds the SLDS—they receive direct feedback about records that cannot be matched to an existing UIC. For example, after K12 district users upload their datasets, they can browse a list of potential matches for student records that could not be matched automatically. The user can select the correct UIC for a student manually, or they can request that CEPI create a new UIC for a student. All records need to be matched to an existing or new UIC before the data submission can be certified, which is required as part of Michigan’s funding procedures for districts.

Finding and resolving multiple IDsIn addition to matching records when they are first submitted, CEPI regularly runs its mass-linking process on subsets of SLDS data specifically to find cases where multiple UICs have been assigned to the same student. This check for duplicate identifiers is often performed at the end of the K12 school year before cohort graduation and dropout rates are calculated, when new postsecondary data are submitted to the system, and at other times as needed. The mass-linking process generates a list of records that could belong to the same student under multiple UICs. CEPI reviews this list to link UICs for the same student manually, or it sends the list to its data contributors to help determine which records need to be linked.

When multiple UICs are confirmed to represent the same individual, one UIC is designated as the primary identifier, and all student history data are merged under that UIC. The additional UICs are stored as secondary identifiers for the individual. To date, Michigan has assigned more than 6.1 million UICs, about 240,000 of which were identified as duplicates.

Additional processesLinked primary and secondary UICs can be unlinked if their records are later found to represent different individuals. Michigan also has found cases where history

records for different students become intermingled even if their UICs were never linked. Data submitters who spot these errors can request that records and UICs be split, and a CEPI system administrator will review and reassign UICs as needed.

In the future, Michigan plans to expand access to the UIC web service system that helps resolve unique identifiers as data are submitted. These web services are currently used by a subset of data-contributing agencies.

Continuing Challenges

Kentucky, Washington State, and Michigan adjust their record-matching procedures as needed to better resolve cases where IDs are incorrectly assigned or matched in the SLDS. Even with improvements to the processes, several common challenges remain.

• Time. Identifying matching errors and implementing processes to resolve them is time consuming for SLDS staff.

• Shortcomings in matching tools. The information technology products a state uses to import, match, and manage data in its SLDS may not support all the processes the state needs to identify and resolve matching issues. In Washington, the current version of the vendor-built matching engine used for the SLDS does not display scores used to measure the strength of the match between records. SLDS staff had to develop a workaround solution involving another vendor product to generate scores for use when manually reviewing matches.

• Incomplete data. The different agencies submitting data to an SLDS will collect different data about an individual, and even the same data elements may be recorded differently across agencies. In Michigan, records collected from higher education institutions are sometimes missing one or more of the “core fields” (first name, last name, date of birth, and gender) required to assign a UIC to a new individual in the SLDS.

• Correcting historical data. An error discovered in one dataset might also affect longitudinal records collected and matched over several years. In Kentucky, data-contributing agencies will often re-submit historical data when a matching error is found, sometimes resulting in new inconsistencies that need to be resolved when records included in the original submission are missing from the re-submitted data. Such inconsistencies can affect the reliability of reports generated with SLDS data over time.


Additional Resources

Kentucky Center for Education & Workforce Statistics https://kcews.ky.gov/

Linking Early Childhood and K12 Data: A State Example from Kentucky: SLDS Webinar https://slds.grads360.org/#communities/pdc/documents/6948

Linking K12 Education Data to Workforce: SLDS Webinar https://slds.grads360.org/#communities/pdc/documents/5871

Linking K12 Student Data with Postsecondary Data: SLDS Webinar https://slds.grads360.org/#communities/pdc/documents/5793

The Match Rate Dilemma: SLDS Webinar https://slds.grads360.org/#communities/pdc/documents/8982

Michigan Center for Educational Performance and Information http://www.michigan.gov/cepi

Washington State Education Research & Data Center http://www.erdc.wa.gov/

http://www.erdc.wa.gov/

http://www.michigan.gov/cepi

https://slds.grads360.org/#communities/pdc/documents/8982




https://kcews.ky.gov/

slds topical webinar summary

Documents