the de-identification of mooc datatsets to … · mooc enrollment continues to grow annually by 6%...

1

THE DE-IDENTIFICATION OF MOOC DATATSETS TO DEMONSTRATE THE POTENTIAL FERPA

COMPLIANCE OF MOOC PROVIDERS

A thesis presented by

Michelle H. Lessly, M.Ed.

To

Doctor of Law and Policy Program

In partial fulfillment of the requirements for the degree of Doctor of Law and Policy

College of Professional Studies Northeastern University Boston, Massachusetts

June 2016

2

ACKNOWLEDGEMENTS

Completing this thesis was not a solitary task. I want to thank the faculty and staff of the

Doctorate of Law and Public Policy at Northeastern University. I want to extend a special

extension of gratitude to Dr. Edward F. Kammerer, Jr., my primary advisor, and Dr. Neenah

Estrella-Luna for her patience and support throughout this endeavor. I would like to thank

William D. McCants, Esq., my second reader. Additionally, it is with deep appreciation that I

want to recognize my peers and friends, Cohort VIII. I am forever grateful for your challenge,

support, and friendship over the past few years, and the many years to come.

I also want to recognize my family and friends who supported me throughout this

program. Specifically, I want to thank my parents who have been an unrelenting source of

encouragement. Since I was young, you have provided me the resources and opportunities

through which I could pursue my dream of earning a terminal degree. I am proud to be your

daughter; I hope I have made you proud in return.

Additional thanks to: Clinton Blackburn, Todd Karr, John Daries, Rachel Meidl,

Monqiue Cunningham Brijbasi, Keenan Davis, Ted Johnson, Bryan Coyne, Noradeen Farlekas,

Jalisa Williams, Joni Beshansky, Michelle Puhlick, Jonathan Kramer, Melissa Feiser, Melody

Spoziti, Dr. Anne McCants, Jon Daries, Julie Rothhaar-Sanders, Nivedita Chandrasekaran,

Rebeca Kjaerbye, Kristen Covino, and the many friends and colleagues who supported me

throughout this program.

3

ABSTRACT

The disruptive technology of massive open online courses (MOOCs) offers users access

to college level courses and gives MOOC providers access to big data concerning how their

users learn. This data, which is often used for educational research, also includes users’

personally identifiable information (PII). The Family Educational Rights and Privacy Act of

1974 (FERPA) protects PII and the educational records of students who attend traditional

educational institutions, but the protection of this legislation is not currently extended to MOOC

providers or their users.

A legal analysis of FERPA demonstrates analogous relationships between key statutory

definitions and MOOC users, providers, and their datasets. By imposing the k-anonymity and l-

diversity standards, this replication study of Daries et al.’s (2014) work attempts to de-identify

MOOC datasets in accordance with C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) to

exhibit how to redact these datasets to be FERPA compliant and still maintain their utility for

research purposes. This study also seeks to determine if this de-identification method can be

standardized for universal use across MOOC providers.

The replication study, coupled with the legal analysis, suggest FERPA may not be the

proper statute to regulate the privacy protections MOOC providers afford their users. Rather, the

U.S. Department of Education and Congress should promulgate policy that outlines the

minimum privacy standards MOOC providers and other disruptive technologies afford their

users. Future research will aid in determining best practices for de-identifying MOOC datasets.

4

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ......................................................................................................... 2!

ABSTRACT ................................................................................................................................... 3!

Introduction ................................................................................................................................... 9!

Literature Review ....................................................................................................................... 14!

MOOCs and Public Policy ...................................................................................................... 14!

The Family Educational Rights and Privacy Act of 1974 ...................................................... 16!

Theoretical Framework ........................................................................................................... 18!

Digital Privacy Theory. ..................................................................................................... 18!

Solove’s Taxonomy of Privacy. ........................................................................................ 21!

Critical Review of the Literature ............................................................................................ 24!

Method and Research Design .................................................................................................... 28!

Objectives and Research Question .......................................................................................... 28!

Understanding Daries et al.’s De-identification Process ........................................................ 30!

K-anonymity ..................................................................................................................... 30!

L-diversity ......................................................................................................................... 36!

Replication of Daries et al.’s Method ..................................................................................... 37!

Data Collection ....................................................................................................................... 37!

FERPA Document Review and Legal Analysis ............................................................... 38!

Sampling Populations for De-Identification Process ........................................................ 38!

De-identification Code and Process. ................................................................................. 40!

Analysis................................................................................................................................... 42!

Document Review of FERPA ........................................................................................... 42!

5

Measuring K-anonymous Utility. ..................................................................................... 42!

Limitations .............................................................................................................................. 42!

Legal Analysis ............................................................................................................................. 44!

Statutory Definitions of Key Terms as they Pertain to MOOCs ............................................ 45!

Who is a Student? ............................................................................................................. 46!

How is Attendance Defined? ............................................................................................ 50!

Are MOOC Providers Educational Institutions or Agencies? .......................................... 51!

What Constitutes an Educational Record? ........................................................................ 53!

What is PII and how is it Protected? ................................................................................. 57!

FERPA’s Application to MOOCs ........................................................................................... 62!

Results .......................................................................................................................................... 63!

Results of De-identification Process ....................................................................................... 63!

Iteration I, MITx 2.01x. .................................................................................................... 63!

Iterations II-IV, MITx 2.01x. ............................................................................................ 65!

Iteration I, MITx 3.091x. .................................................................................................. 65!

Troubleshooting the Program. .......................................................................................... 66!

Assessing Replicability ........................................................................................................... 69!

Effectiveness of Daries’ De-identification Program. ........................................................ 69!

Role of Terms of Service Agreements and Privacy Policies on Data Releases for De-

identification. ............................................................................................................. 70!

Protecting and Releasing User Data ................................................................................. 73!

Results of Legal Analysis ....................................................................................................... 73!

Are MOOC Users Students? ............................................................................................. 74!

6

Does Enrolling in a MOOC Constitute Attendance? The definition of ............................ 75!

Are MOOC Providers Educational Institutions or Agencies? .......................................... 76!

Are MOOC Datasets Classified as Educational Records and do they Include PII? ......... 76!

Is Metadata an Educational Record? ................................................................................. 77!

Is Metadata PII? ................................................................................................................ 78!

Recommendations and Conclusions .......................................................................................... 80!

Conclusion .............................................................................................................................. 80!

Recommendations ................................................................................................................... 83!

For the Department of Education. ..................................................................................... 83!

For Congress. .................................................................................................................... 83!

For Researchers ................................................................................................................. 84!

For MOOC Providers. ....................................................................................................... 84!

References .................................................................................................................................... 86!

Appendix A: Notification of IRB Action ................................................................................... 92!

Appendix B: Outbound Data Use Agreement: MITx Data .................................................... 93!

Appendix C: De-identification Code ....................................................................................... 101!

7

LIST OF TABLES

Table 3.1. Measures ...................................................................................................................... 29!

Table 3.2. Variables ...................................................................................................................... 34!

Table 3.3. PreUtility Matrix for MITx 2.01x ................................................................................ 41!

Table 4.1. Definitions of a Student ............................................................................................... 49!

Table 5.1. Variables Selected when Running Daries De-identification Program on MITx 2.01x,

Iterations I-III .......................................................................................................................... 67!

8

LIST OF FIGURES

Figure 3.1. Risk of Re-identification due to the Intersection of MOOC User Data, Quasi-

identifiers, and User-generated, Publically Available Information ........................................ 31!

Figure 3.2. Example of Suppression and Generalization Emphases ............................................ 32!

9

Chapter 1

Introduction

Massive open online courses (MOOCs) offer a promising 21st Century solution to the

problem of access and affordability of higher education. Initially launched in the United States in

2011, MOOCs offer low-to-no cost college-level courses through partnerships with universities

or corporations. This disruptive educational model differs from the traditional college model or

online courses. MOOCs have no admission requirement and occur entirely online, allowing

thousands of users from around the world to simultaneously take a class to learn from each other

through interactions on discussion forums (Jones & Regner, 2015, Young, 2015). These courses

are often offered on demand and deliver course content through videos, filmed lectures,

discussion boards, forums, readings, and homework, all without the active intervention of a

professor. MOOCs, operated by third party providers, can be affiliated with a post-secondary

institution such as Harvard and MIT’s edX, the only open source, nonprofit MOOC provider

(edX, 2016). They can also operate as a private company such as Udacity and Coursera, both of

which were co-founded by former Stanford professors.

MOOC enrollment continues to grow annually by 6% (Allen & Seaman, 2014), now

reaching approximately 16 million users worldwide (Shah, 2014). The New York Times declared

2012 as the “year of the MOOC” (Pappano, 2012), but by 2014, skepticism regarding the MOOC

revolution was at an all-time high (Friedman, 2014). This doubt may have been propelled by

developmental setbacks such as San Jose State University’s unsuccessful attempt to offer

Udacity courses to its underprepared students (Rivard, 2013)1. Numerous reports reveal MOOC

1 In January 2013, San Jose State University announced a pilot program, in partnership with Udacity, to offer three entry-level courses MOOC courses to matriculating students (Fain, 2013). However, due to poor student performance, the pilot was cancelled in June 2013 (Rivard, 2013).

10

course attrition rates consistently teeter between 90-96% (Pope, 2014). Still, the claims that

MOOCs miss the mark overlook the innovations they contribute to the field of educational

technology and research. The truly transformative nature of this non-formal education platform

rests not in the method of knowledge delivery or course retention rates, but the in opportunities it

creates for the analysis of knowledge acquisition, especially in the digital age.

MOOCs have a multi-pronged business model, for in addition to providing access to

college courses, MOOCs function as education data warehouses. This information is known as

metadata, or “structured information that describes, explains, locates, or otherwise makes it

easier to retrieve, use, or manage an information resource” (National Information Standards

Organization, 2004, p. 1). For MOOCs, this includes users’ personally identifiable information

(PII)2 as described in the Federal Rights and Privacy Act of 1974 (20 U.S.C §1232g; Title 34

CFR Part 99), commonly known as FERPA, as well as data about the amount of time a user

spends watching a video, mouse clicks on a page within the course’s site, and the frequency in

which a user logs onto the learning platform. With an average of 43,000 registrants per course

(Ferenstein, 2014), one MOOC course can generate up to 20 terabytes of data (Hazlett, 2014).

Such collections of metadata can accumulate to become big data, which are datasets that are not

only massive, but are easily searchable and sortable (Boyd & Crawford, 2012) and are retained

for the purposes of evaluating minute details to determine patterns or trends within representative

sample populations (Young, 2015).

Big data creates privacy concerns for both users and data holders. As a wider cross-

section of organizations and companies collect data on the different facets of a user’s life which,

2 FERPA defines PII as the student’s name, the student’s family member’s names, the student’s address, personal identification numbers, other indirect identifiers such as birthdate, and other information that may be linked to a specific student (Title 34 Part 99 Subpart D §99.3).

11

when compiled into a digital dossier, creates new privacy challenges. Big data has an exceptional

ability to connect seemingly isolated pieces of information to create a holistic depiction of an

individual’s identity. These digital dossiers create a tension between the utility, or usability, of

big data and expectations for consumer privacy grounded in law and ethics. The legal and ethical

framework that guides data management must be broad enough in scope to address the

potentially conflicting needs of both data holders and the individuals providing the content of the

dataset. Dataset owners must respect those individuals by assuming the responsibility for

protecting their privacy rights (Hoser & Nitschke, 2010).

Within the context of education, FERPA, a federal statute, and its attendant regulations

with interpretative guidance detail the regulatory obligations schools have when safeguarding

student data. This law protects student privacy by regulating the collection, retention, and

distribution protocols educational institutions use to collect the information included in student

educational records. Unfortunately, the protections afforded to traditional students have yet to be

extended to MOOC users since the U.S. Department of Education has not yet determined if

MOOCs providers are classified as an educational agency under FERPA (Young, 2015). This

leaves some educators to speculate that the Department does not believe it has the authority to

determine if FERPA is applicable to this new learning platform (Kolowich, 2014). This

conjecture is further supported by the fact that MOOC providers do not currently receive federal

funding, a prerequisite of the FERPA compliance structure. Moreover, to further complicate the

question of applicability:

If FERPA applies to MOOCs, it is more likely to apply to the data, not the MOOC

provider itself. Thus, data ownership becomes an important component of how FERPA

relates to MOOCs. If data is owned by an actual educational institution, then use of that

12

data must follow a fairly standard pattern: The institution can share the data with student

consent or share the data absent consent through exceptions or de-identification (Young,

2015, p. 578).

Thus, FERPA was not prepared for the reciprocal partnership between MOOC providers

and postsecondary institutions when it was conceived over four decades ago, and neither the

Dept. of Education nor Congress have made intentional steps to address the issue of MOOC user

data privacy. MOOC providers have not been officially recognized by the Dept. of Education as

educational agencies, and since MOOCs do not receive federal funding, which would require

them to comply with FERPA, MOOC users are left without the same safeguards afforded to their

university student counterparts who attend the same class in-person or on-line. edX is currently

the only MOOC provider that voluntarily complies with FERPA (edX, 2014).

That said, MOOCs are becoming a more widely-accepted form of higher education, as

demonstrated by the partnership between edX and Arizona State University (ASU). Their

collaboration, known as the Global Freshman Academy, offers for-credit courses to

matriculating students at a significantly reduced tuition rate. MIT’s MicroMaster’s admissions-

free program provides would-be students the opportunity to take an entire semester’s worth of

courses on the edX platform before taking a qualifying exam in order to earn admission to the

on-campus, one-semester full master’s degree program. Since the MIT’s MicroMaster’s program

requires taking edX courses as part of the degree, might the enrollees of the program be

classified as MOOC users or students who should receive FERPA protection?

Therefore, the question of whether MOOCs should comply with FERPA warrants an

urgent response from the Dept. of Education. This leaves the related question of whether

MOOCs can be compliant with FERPA and still generate usable data for the purposes of

13

research. Policy makers must address the conflict between the regulatory requirements of

FERPA and the uniqueness of MOOCs. Examining this conflict through the lens of digital

privacy theory, and Solove’s taxonomy of privacy will provide a critical perspective and

necessary understanding of how the Dept. of Education should address MOOCs’ evolving

impact on the American higher education system.

This study seeks to provide a solution to this burgeoning policy concern by asking: in

what ways might MOOC provider datasets be de-identified to meet the requirements of C.F.R.

Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) of the Family Educational Rights and Privacy

Act of 1974, and still maintain their utility for the purposes of research dissemination? To answer

this question, an examination on the literature on MOOCs, FERPA, and digital privacy theory to

will provide the context in which MOOC providers and policy makers must resolve this issue. A

legal analysis on the legislative and judicial history of FERPA will inform a methodology of de-

identifying MOOC datasets to be FERPA compliant. The results of this study will yield

recommendations for MOOC providers, researchers and policy makers to resolve the concerns of

user privacy, data utility, and the potential need for MOOC to comply with FERPA.

14

Chapter 2

Literature Review

MOOCs and Public Policy

MOOCs first focused on providing open access to courses at globally-recognized, highly

ranked universities such as Harvard, Oxford, Stanford, and MIT. They have since evolved to

offer courses ranging from Google-developed coding classes to public relations seminars and

conversational English courses for non-native speakers. Though a general level of digital literacy

is required for MOOC course navigation, users are not limited by course prerequisites or

admissions requirements to enroll in their course of choice. MOOCs operate under an open

learning model, requiring users to rely on self-motivation to progress through a course, rather

than external motivators such as deadlines for homework assignments or attendance

requirements. Moreover, by divorcing online learning from the matriculating enrollment model

at a traditional university, MOOCs have developed into a new type of non-traditional educational

program.

In light of the collaboration between ASU and edX to create the Global Freshman

Academy, many MOOC providers and postsecondary institutions are exploring, and in some

cases implementing, such hybrid educational models. The American Council on Education

(ACE) recommends colleges and universities offer credit for up to five MOOC courses (ACE,

2013). By 2013, both California and Florida state legislators considered recommendations to

make MOOCs part of the degree-granting curriculum for their public college systems. While

Florida legislators did approve the use of MOOC classes in the K-12 system, concerns regarding

course quality prevented expanding the bill to public postsecondary institutions (Inside Higher

Ed, 2013). Faculty union fears prevented California lawmakers from making MOOCs a

component of the state’s three public higher education systems (Kolowich, 2013). In Arizona,

15

however, the Global Freshman Academy drew over 34,000 registrants in its first year by offering

6 credit-granting, transferable classes at $200 per credit hour (Straumsheim, 2015). This MOOC

hybrid-model, if proven successful, challenges the traditional post-secondary education

experience.

This new type of educational experience is at the center of the FERPA compliance

problem for MOOCs as it presents many challenges for MOOC providers, their university

partners, and legislators. The amorphous state of the MOOC provider does not match current

legal constructs (Jones & Regner, 2015), nor do the privacy and safety needs of MOOC users

equitably align with current legislation. For example, the Cleary Act requires colleges and

universities to track crime data on and around their campuses, but is a MOOC required to report

threats or an incident of sexual harassment between two students on a course’s discussion board?

Can a MOOC provider’s course site be considered a campus? What if these two students reside

in different countries?

The hybrid MOOC model presents even more of a challenge for FERPA in that its

requirement for compliance requires a student’s enrollment at a recognized educational agency

that receives some form of federal funds. If a user signs up for a university-created, certificate

granting course through edX’s platform, is the user enrolled as student at that FERPA regulated

university, or at edX, which is not currently an educational agency under FERPA rules? Or, is

the user not entitled to any of the FERPA protections available to a student in a physical

classroom?

The President’s Council of Advisors on Science and Technology (PCAST) recognized

the range of MOOC related privacy challenges in their 2014 report. The big data element of

MOOCs makes protecting user privacy much more demanding than in the case of a traditional

16

student whose FERPA-protected information is confined to PII, including their name, birthday,

and email address, and their educational record which contains information such as graded

coursework and transcripts. PII does not include the wide range of metadata collected by

MOOCs such as a user’s highest level of education, how many times they watched a course-

related video, or the date of their last activity on a discussion board. Thus, the majority of the

information held by MOOC providers likely may be unregulated, even if FERPA were to apply

(Young, 2015). This does leave tort law as a potential safeguard for metadata, but an ideal

privacy apparatus protects both PII and metadata. Thus, PCAST’s recommendations for privacy

protections include encryption1 and de-identification by removing full and quasi-identifiable2

variables from a dataset (Daries, Reich, Waldo, Young, Whittinghill, Ho, & Chuang, 2014).

These recommendations surpass FERPA’s current privacy regulations, demonstrating the

revisions necessary to bring FERPA up-to-date with digital privacy needs. No longer is simply

redacting PII sufficient to protect a student’s identity. Lawmakers must contemplate the totality

of the data collected on students when promulgating privacy legislation.

The Family Educational Rights and Privacy Act of 1974

First introduced as the Buckley Amendment and signed into law by President Ford in the

summer of 1974, FERPA enables students to control both the access and content included in

their educational record (Graham, Hall, & Gilmer, 2008). This statute regulates the privacy needs

of students in the K-12 system by allowing both students and their parents to have the ability to

review and correct their educational record. FERPA does revoke parental review rights for

1 PCAST defines encryption as the process which converts data into cryptography-protected rendering it useless to those without the decryption key. 2 Quasi-identifiers are pieces of data that, when combined with other data, can generate the ability to uniquely identify an individual. Examples include gender and birth date (Sweeney, 2002).

17

students once they turn 18 or are enrolled in a post-secondary institution, but it otherwise

remains applicable to colleges and universities.

Compliance is required of all institutions that receive federal funds, including federal

student aid and grant monies. Withholding these funds is the only statutorily authorized

enforcement mechanism permitted. However, when a FERPA complaint is filed, the Dept. of

Education prefers to resolve the matter through administrative actions such as required policy

revisions or trainings (Family Policy Compliance Office, 2015), rather than revoking federal

funds. The consequences of the latter not only penalize the academic institution, it can also have

significant, negative repercussions that are passed on to the student. Revoking an institute’s

federal funds due to a FERPA violation potentially means the institution can no longer afford to

educate students in the same way prior to the complaint. To date, the Dept. of Education has not

withheld funds for a FERPA violation (Young, 2015).

Since 1974, FERPA has been amended eleven times (20 U.S.C §1232g). As a result, this

statute is notoriously challenging to interpret and at times seems contradictory. Until 2008, the

Dept. of Education actively abstained from providing clarity for colleges and universities on how

to interpret and implement FERPA (Lomonte, 2010). In that same year, the Secretary of

Education issued an amendment to FERPA in order to implement stricter written notification

requirements for the release of student records to a third party, including parents, while

simultaneously making notification exceptions when information is released for the purposes of

research (Ramirez, 2009, Family Educational Rights and Privacy Act, 2008).

These recent amendments demonstrate the conflicting nature of the privacy expectations

of students and their institution’s need to share student information for the purposes of

scholarship or safety. They highlight that FERPA was created in a time when its drafters were

18

unable to conceive of a virtual learning environment in which the scope of personally identifiable

data collected would be much more expansive than the current statutory definition of PII.

FERPA permits disclosing student data when the PII is de-identified3, but how might this process

be accomplished to scale for a MOOC course?

The historical interpretation of FERPA’s standard for de-identifying student PII may not

be enough to prevent the re-identification of MOOC users. The removal of PII in compliance

with FERPA will still leave behind additional quasi-identifying information, such as VPNs,

gender, and online user-generated content, which can be used to re-identify MOOC users.

Unfortunately, FERPA does not account for these quasi-identifiers. Therefore, even once a

MOOC dataset is de-identified according to FERPA’s regulations, the statute’s safeguards will

not be applied to the dataset’s quasi-identifiers, leaving that information public and unprotected.

Theoretical Framework

Digital Privacy Theory. As MOOC providers continue to develop their ability to gather

both PII and quasi-identifiers from their users, the need to ensure individual users’ privacy

grows. However, increasing privacy protections on this data may negatively impact the utility of

the dataset. To combat this problem, MOOC providers might employ the k-anonymity algorithm

(Sweeney, 2002), the l-diversity standard (Machanavajjhala, Kifer, Gehrke, &

Venkitasubramaniam, 2007), and Dwork’s (2008) differential privacy model.

Sweeney’s k-anonymity Algorithm. In an effort to better secure privacy within datasets

while retaining research utility, Sweeney (2002) recommends employing the k-anonymity

algorithm. Using k-anonymity on an individual-data point structured dataset, can “produce a

release of the data with scientific guarantees that the individuals who are the subjects of the data

3 See C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1), (2)

19

cannot be re-identified while the data remain practically useful” (p. 557). To be successful, a k-

anonymous dataset maintains a value of k-1 between data points, or attributes, which reduces the

ability to re-identify an individual based on the totality of the information provided in the dataset.

By utilizing anonymization through the methods of generalization and suppression, k-anonymity

introduces noise into a dataset to dilute the information to make it comprehensively secure and

maintain utility.

The two redaction methods, generalization and suppression, alter data while retaining the

type of attributes collected within the dataset. It is through these two methods that noise is

injected into the dataset and generates the k-value between attributes. Through generalization, a

specific attribute is removed but still captured through a generic, yet representative category. It

replaces specific attributes, such as ages or other data that can be represented accordingly, with

ranges. For example, using generalization, a 25-year old male who lives in Boston, MA could be

represented in a k-anonymous dataset as a 25-29-year old male who lives in the region of New

England. However, this method only works for certain types of data. Suppression is employed

for data cannot be easily generalized. As in the previous example, the gender of the 25-year old

male could be represented in a k-anonymous dataset as a symbol, most commonly an asterisk,

indicating the data was collected but suppressed for the purposes of anonymization. It is

important to note generalization and suppression may be used alone or in combination depending

upon different types of data and different research questions.

L-diversity. Whereas k-anonymity is a fairly comprehensive data privacy theory,

Machanavajjhala et al., (2007) argue it still provides contextual information in which individuals

may be re-identified. l-diversity adds an additional level of protection for datasets that are

sensitive to privacy breaches due to the totality of the data made available to the public,

20

including not only the attributes represented in the data, but the background of the attributes.

Therefore, even if the 25-year old male who lives in Boston, MA is represented in an k-

anonymous dataset as * (25-29-year old) who lives in Massachusetts, if that information is

published in an unaggregated manner that provides the context in which the data was collected,

the k-anonymous data is still vulnerable to attack. These attacks fall into two category types:

homogeneity attacks and background knowledge attacks.

Homogeneity attacks occur when the attributes in the dataset are not diverse enough to

create true anonymity on an individual level. For example, an attacker may know a user enrolled

in a MOOC who is a prolific poster on the course’s discussion board. The attacker knowing that

person’s age, gender, zip code, and course may be able to determine how many posts that

individual made, through the process of elimination, if given access to that class’ discussion

board. Homogeneity attackers do not need to know the user, but rather simply have access to that

user’s demographic information to make an identification.

Background attacks build on homogeneity attacks by using contextual information to

make an identification. Background attacks are a result of an attacker having personal knowledge

about a user and making connections between sensitive data and quasi-identifiers based on

societal background knowledge or information on a specific population represented in the

dataset. Continuing with the previous example, if the attacker also knew the user was struggling

with the course content and sought assistance from others in the class, the attacker may be able to

determine which posts were the user’s. This example demonstrates a background attack using

quasi-identifiers. Based upon the vulnerability presented by these attacks, the l-diversity

algorithm increases the noise in a dataset by increasing the diversity of sensitive attributes.

However, as sensitive attributes become more l-diverse, the utility of the data may be reduced.

21

The Differential Privacy Model. Dwork (2008) also challenges k-anonymity, claiming

there is no such thing as an impenetrable privacy protection algorithm, and suggests the

differential privacy model provides a more optimum anonymization solution. This algorithm

uses noise by interjecting it on the release mechanism of the data, not the data itself. The layering

of protection, that is encoding the data release rather than the data through methods such as

generalization and suppression, interferes with an attacker’s ability to accurately capture

information or to trace back the information to re-identify individuals and retains the utility of

the data for the purposes of analysis. The differential privacy model focuses on producing

information about the data released in a published dataset.

This algorithm prevents an attacker from being “able to learn any information about any

participant that they could not learn if the participant had opted out of the database” (Tockar,

2104, n.p.). By adding noise to the release mechanism, such as a chart or graph, an attacker is

unable to determine seemingly random patterns in the data that may lead to re-identification.

Thus, the differential privacy model redefines the concept of digital privacy, moving from a

system that attempts to defend the entire dataset against attacks, to a tiered design that makes

datasets systematically less vulnerable when an inevitable attack occurs.

Solove’s Taxonomy of Privacy. In the context of MOOCs, user privacy should not

simply be reduced to the application of a security algorithm or a debate about identity protection.

A more satisfactory understanding of user privacy looks beyond anonymity and scrutinizes the

rationales behind the collection of the data in order to determine if it should be collected in the

first place. Solove’s (2008) taxonomy of privacy provides a framework for MOOC providers to

ethically develop and disseminate their user-populated datasets while maintaining the necessary

type of privacy. His argument that a single concept of privacy is not constant and cannot be

22

consistently applied reflects the complexities of the MOOC user privacy issue. By shifting the

locus of privacy from the data owner to the data subject, Solove’s taxonomy can explore the

impact of the integration of six privacy concepts: the right to be left alone, limited access to the

self, secrecy, control over personal information, personhood, and intimacy.

The concept of the right to be left alone is the underpinning for today’s privacy torts and

is similar to the notion that privacy is limited access to the self, a principle that insists an

individual should be the gatekeeper of their own personal information (Warren & Brandeis,

1890). The concept of secrecy, as popularized by Posner (1978), is the “appropriation [of] social

benefits to the entrepreneur who creates them while in private life it is more likely to conceal

discreditable facts” (p. 404). The desire for secrecy leads individuals to limit access to

information about themselves and leads to the concept of control over personal information,

which recognizes information as one’s personal property. The concept of personhood expands

upon that of personal property by viewing one’s information as a manifestation of one’s identity

and reputation. Finally, the concept of intimacy asserts the need to keep information private is

not just for the protection of one’s self, but to secure the information of those with whom the

individual may be associated. Whereas Sweeney and Dwork consider privacy from the utilitarian

perspective of the dataset owner, Solove recognizes that it is the individual who assumes more

risk when a third party, such as a MOOC provider, collects and disseminates data.

This becomes especially problematic due to exclusion, or “the failure to provide

individuals with notice and input about their records” (Solove, 2006, p. 521). Exclusion presents

a harm different from that of data privacy and security in that rather than being concerned with

re-identification, exclusion removes an individual’s ability to control what happened to their data

(Solove, 2006). FERPA’s primary goal is to eliminate exclusion, but it is this goal that further

23

complicates the application of FERPA to MOOCs. In order to register for a course, users are

often required to agree to the MOOC provider’s terms of service, which can exclude them from

the decision making process as to how and when their information is used, or to have the ability

to review the data to ensure it is an accurate portrayal of their identity. This may become

problematic if MOOCs are required to become FERPA compliant, as it requires educational

agencies to grant students access to their educational record and the ability to correct it when

necessary. That said, those terms of service agreements that do not align with FERPA may

become void under the law, which easily resolves the policy concern, but still leaves MOOC

providers with the responsibility to audit massive amounts of data to ensure compliance.

When examining digital privacy from the user’s perspective, Solove’s model highlights

the porous nature of the relationship between data subjects and data holders. To rectify this, the

taxonomy identifies four activities of the data collection process: information collection,

information processing, information dissemination, and invasion. The taxonomy’s intentional

design around the data subject, identified as “the individual whose life is most directly affected

by the activities classified in the taxonomy” (Solove, 2008, p. 103), and not around a specific

privacy conception, allows for the evolution of privacy needs in the digital age.

A MOOC provider’s act of collecting information includes user registration information

and the surveillance of their subsequent activity online. This leads to the second action in the

taxonomy, processing information, which may be aggregated and analyzed without user

knowledge. Though the purpose of MOOC data research includes learning about the potential

functionality of the platform and to expand the field of knowledge on education technology,

sharing this information can violate user trust. Moreover, the third activity, information

dissemination, reveals the vulnerability of MOOC users’ information. Poorly managed user

24

information creates opportunities in which information may be inappropriately disclosed or

privacy agreements may be violated, leading to the fourth activity of invasion. If a user’s

information is improperly disclosed, leading to an attack on their personhood, what impact might

this have on the likelihood they will feel safe enough to enroll in another MOOC course?

Critical Review of the Literature

If the Dept. of Education is to evaluate the relationship between MOOC providers and

user privacy concerns, so must it consider FERPA’s definition of PII as it pertains to big data.

The current statutory standard for de-identification is reducing or eliminating PII to create a

reasonable determination of anonymization (C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1)).

This binary conceptualization of privacy successfully operates in a traditional educational

setting, but cannot be reasonably applied in an-online setting. Metadata, such as the course name,

when the course started, and the user’s VPN, are quasi-identifying data points that may be

concatenated for the purposes of re-identification (Daries, 2014). The current assumption, that

redacting what FERPA clearly considers to be PII provides sufficient user privacy protections, is

antiquated and may not hinder MOOC providers from openly sharing quasi-identifiers.

However, an examination of the relevance of the current understanding of PII in a digital

learning environment might be irrelevant as some critics suggest FERPA does not pertain to

MOOCs. Since the Dept. of Education has remained silent on the matter, MOOC providers

currently have the liberty to make their own determination as to whether or not their course users

are protected by FERPA. Both Udacity and Coursera make no mention of their stance on FERPA

on their websites, whereas edX, a provider owned and operated by Harvard and MIT,

specifically states it complies with FERPA.

25

Still, the undetermined status of FERPA’s applicability to MOOCs has the potential to

diminish the future utility of different providers’ data (Hollands & Tirthali, 2014). For if the

Dept. of Education or Congress determine that MOOC providers are required to comply with

FERPA or other privacy regulations, those MOOC providers that have decided to not create

FERPA compliant datasets may be limited in their capacity to operate under their own business

models when attempting to share data with researchers. Moreover, if MOOC providers have no

clarity on what may legally or ethically be released, how then are researchers to take advantage

of MOOC-sourced big data?

Yet, a determination of mandatory compliance will not immediately resolve the issue of

user data privacy. Standardizing the privacy protection practice of traditional colleges and

universities is seemingly impractical if not impossible in the MOOC classroom. Whereas

redefining PII will aid in privatizing data, it does not remedy the problem of user exclusion

(Solove, 2011). MOOC providers require users to agree with their terms of service when

registering for a course, but the efficacy of these documents is dubious (Solove, 2013). Terms of

service agreements often rely on the average user not being well versed in the language and

structure of such documents, leading to common user misperceptions about the quality of privacy

controls (Turow, Feldman, & Meltzer, 2005). Since less than ten percent of individuals actually

read a terms of service agreement when registering for an online service (Smithers, 2011),

trusting in such contracts as a form of user consent for metadata collection is questionable at

best.

Fair Information Practice Principles (FIPPs) should be used to reduce users’ confusion

about their waived privacy. FIPPs insist that data holders act ethically with their data by

maintaining transparency of the data management process, keeping users informed of what

26

personal data is recorded, and to seek user consent when their data is repurposed (U.S.

Department of Health, Education, & Welfare, 1973). Incorporating FIPPs into FERPA’s

regulatory structure will help to reduce user confusion over their privacy controls and increase

MOOC provider accountability for data management practices. Or, MOOCs may use FIPPs and

FERPA as guidelines to create their own data privacy protection standards.

Additionally, policy makers need also consider how the global scope of MOOCs will

complicate statutory compliance. Whereas digital privacy theory can address the concerns

regarding data protection, it cannot account for cultural privacy norms. Solove’s taxonomy

intentionally allows for applicability within a cross-cultural context, but it fails to anticipate how

a culture’s understanding of power dynamics ebb and flow through each activity of the data

collection process (Sweeney, 2012). This can be especially problematic when determining how

public policy applies to a MOOC dataset when the MOOC and the partner institution, or user, are

not American. Notably, the European Union has very detailed requirements for protecting their

citizens’ privacy, even when their users are accessing education resources outside of the EU.

Policy makers and MOOC researchers must pay additional attention to the issue of governance in

an international educational setting.

The National Association of College and University Attorneys (NACUA) recognizes that

the legal uncertainty surrounding FERPA and MOOCs may change at any point in time due to a

number of factors. For example, in the instances when a user borrows federal funds to pay for a

course, a professor incorporates MOOC course elements into their on-campus classroom

instruction, or postsecondary institutions require students to enroll in a MOOC course to gain

degree-seeking credit, MOOC providers will need to comply with FERPA (NACUA, 2013). It is

an unreasonable expectation that MOOC providers, as they interact with hundreds of thousands

27

of users and numerous institutional partners in a given day, to self-monitor for these factors that

might change their compliance requirements. In order to optimize for both educational and

research potential, policy makers should examine how MOOCs can be effectively regulated

under FERPA.

Finally, the most prolific critics of MOOCs, university professors, claim this educational

delivery platform jeopardizes their tradition of the academy and the American system of higher

education. However, the data collected by MOOC providers may be advantageous in the

classroom and when conducting research. Unfortunately, the vast majority of MOOC research is

quantitative, and almost exclusively examines MOOCs from the perspective of user satisfaction.

Shifting the focus of MOOC research from determining the efficacy of the delivery method to

the utility of their user data will aid in the sustainability and mainstreaming of MOOCs in the

education marketplace for the public, private, and online organizations.

Critiques and research on MOOCs can help MOOC providers and policy makers

understand better the barriers to the platform’s success. Rigorous studies of the San Jose State

failure have led to vast improvements in course design and content delivery (Lewin, 2013).

Investigations on open, self-directed learning indicate that user success may be contingent upon

their perception of the security of the online learning environment (Fournier, Kop, & Durand,

2014). If users think their metadata is too readily accessible to MOOC provider personnel or

believe that their privacy has been compromised, they are less likely to be retained (Hughes,

Ventura, & Dando, 2007). There is a need for increased attention to metadata privacy and for

regulatory oversight of MOOCs as a means of ensuring user retention.

28

Chapter 3

Method and Research Design

Objectives and Research Question

My study explored the feasibility of requiring MOOC providers to be FERPA compliant

by asking in what ways might MOOC provider datasets be de-identified to meet the requirements

of C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) of the Family Educational Rights and

Privacy Act of 1974, and still maintain their utility for the purposes of research dissemination. In

addition to this question, my study also sought to determine a process to create standard,

systematic method for de-identifying MOOC platform datasets.

My study was motivated by Daries et al.’s (2014) claim:

It is possible to quantify the difference between replications from the de-identified data

and original findings; however, it is difficult to fully anticipate whether findings from

novel analyses will result in valid insights or artifacts of de-identification. Higher

standards for de-identification can lead to lower-value de-identified data. . . If findings

are likely to be biased by the de-identification process, why should researchers spend

their scarce time on de-identified data? (p. 57)

To answer the research question, my study assumed a mixed methods approach by

conducting a document review and legal analysis of FERPA, and attempting to replicate Daries

et al.’s research on measuring the impact the k-anonymity standard has on a MOOC provider

dataset while ensuring the potential for FERPA compliance. Daries and his team, comprised of

MIT and Harvard researchers, examined the feasibility of generating “a policy-based solution

that allows open access to possibly re-identifiable data while policing the uses of the data” (p.

58) according to the regulations promulgated in FERPA. Whereas Daries et al. approached the

problem of de-identification for the purposes of finding an equilibrium between privacy and

29

utility in advancing social research, my study examined the question of the application of C.F.R.

Title 34 Part 99 Subpart D §99.31(b)(1) and (2) to a publishable MOOC dataset for the purpose

of evaluating the feasibility of applying FERPA or other relevant public policies to MOOC

providers in order to protect users and their data.

Table 3.1. Measures

Measures Definitions

Can the de-identification process be successfully executed using the same protocol on sample MOOC datasets?

The de-identification process can be executed in the same manner on sample MOOC datasets and yield viable utility while maintaining FERPA compliance.

What is an acceptable level of utility? Maintains a k-5 value of for quasi-identifying variables and l-diversity for sensitive variables while minimizing entropy of the dataset after the de-identification of explicit-identifying variables (Daries et al., 2014).

Daries et al.’s research focused on the first edX dataset to be made publicly available,

known as the HarvardX-MITx Person-Course Dataset AY2013 (Person-Course). In an effort to

validate and expand upon their work, my study employed the k-anonymity standard, a process in

which data unique to a user are removed to reduce the risk of re-identification, on at least one

dataset from two MOOC providers. Since FERPA does not require a precise value for k-

anonymity, Daries et al. consulted the Department of Education’s Privacy Technical Assistance

Center standards and determined that a k-value of five (k-5) created a safely de-identified dataset

and met MIT’s standards for de-identification. My study used the same metric of de-

identification.

30

In keeping with the original research, I generated k-anonymous datasets through the

methods of generalization emphasis and suppression emphasis. Daries et al. stressed the purpose

of engaging both generalization and suppression emphases was to evaluate both methods’ merits

and challenges as it related to the utility impact of the data. Therefore, my study evaluated both

generalization and suppression on their ability to better secure users’ personally identifiable

information and to meet the standards as promulgated in C.F.R. C.F.R. Title 34 Part 99 Subpart

D §99.31(b)(1) and (2).

Understanding Daries et al.’s De-identification Process

Daries et al.’s method for de-identification included applying k-anonymity and l-diversity

to MOOC datasets. Additionally, to quantifiably measure the shift in efficacy of the datasets,

they employed a utility matrix as seen in Table 4.3. The authors’ utility matrix was modeled after

Dwork’s (2006) utility vector, which combined descriptive and general statistics to assess the

utility impact the de-identification had on the MOOC datasets.

K-anonymity. To begin the de-identification process, Daries et al. determined which

attributes, or quasi-identifiers, within the existing identified dataset should be removed to meet

MIT’s Institutional Research standards for both anonymization and report composition. The

challenge in de-identifying Person-Course came with the amount of quasi-identifiers available

within the data. One quasi-identifier may not be enough to distinguish a user, but as more unique

attributes are made available, a holistic account becomes available making a user more

vulnerable to attack. Additionally, if a user were actively posting about their MOOC experience

on social media during the course, this increases the likelihood for re-identification based upon

the information provided in the publicly available Person-Course dataset (see Figure 4.1).

31

Controlling for this potential variable was too challenging for Daries et al., but it was theorized

that it could potentially be offset by using a higher standard for anonymization.

To do this, Daries et al. (2014) used Sweeney’s (2002) k-anonymity model. In the case of

a MOOC dataset, which can have quasi-identifiers ranging from username to the number of

mouse clicks per page, a greater k-value is required to promote anonymity. For the purposes of

the Person-Course dataset, the researchers assigned a value of k-5, meaning the “k-anonymized

dataset has at least 5 records for each value combination” (Emam & Dankar, 2008, p. 628). In

order for this de-identification approach to be successful, the researchers determined they needed

to remove at least five quasi-identifiers from the dataset, which in turn served as a filtering

mechanism in reducing the risk of re-identification. As k-value increases, the data’s vulnerability

Blogs

Posts on Facebook

Tweets

Other social media

User name

VPN

Email address

Course grade

Gender

Course name

Birthdate

Enrollment date

Data collected by MOOC providers

User-generated, publicly available information

Potential data (quasi-identifiers) used to identify a MOOC user if not anonymized properly, Adapted from Sweeney, 2002.

Figure 3.1. Risk of Re-identification due to the Intersection of MOOC User Data, Quasi-identifiers, and User-generated, Publically Available Information

32

to attack decreases. However, as Daries et al. noted, as the k-value increases, so does the

likelihood that the utility of the data may be compromised.

To impose the k-anonymity model on the MOOC datasets, Daries et al. (2014) employed

both the suppression and generalization emphases. The suppression emphasis removed

identifiable attributes from the dataset and replaced it with a character to represent information

that was collected and subsequently redacted. The generalization emphasis replaced attributes

with corresponding or representative values. For example, in order to de-identify a dataset

containing users’ age, the suppression technique eliminated the cell value while maintaining the

attribute category. The generalization technique replaced the cell value with an age range, as

seen in Figure 4.2.

Figure 3.2. Example of Suppression and Generalization Emphases

Suppression Generalization

User_1 Age * 20-24

User_2 Age * 15-19

User_3 Age 30-34

In the case of Person-Course, Daries et al. (2014) identified 20 attributes as variables that

may be used to identify MOOC users (see Table 3.2). The attributes were categorized into two

categories: administrative, meaning the data was generated by the MOOC provider or was

generated by the researchers, and user-provided, which were data points generated by the user at

the time of registration with the MOOC provider. Attributes that were altered as a result of the k-

33

anonymity process were tagged with the suffix DI. Null cells, or data that was not made available

by either the MOOC provider or the user was indicated in the attribute inconsistent_flag.

34

Table 3.2. Variables

Attributes Code Type Description

Course ID course_id Administrative Course name, institution, and term

User ID userid_DI Administrative Research assigned indiscriminate ID number that correlates to a given dataset

Registered for course registered Administrative User register for a given course

Gender gender User-provided Values include female, male, and other

Country of residence final_cc_cname_DI Administrative, user provided

IP address or user disclosed, was altered through generalization emphasis

Birth year YoB User provided User’s year of birth

Education LoE User provided User’s highest level of completed education

Registration start_time_DI Administrative Date user registered for course

Forum posts nforums_posts Administrative Number of user post to discussion forum

Activity ndays_act Administrative Number of day user was active in the course

Class visits viewed Administrative Users who viewed content in the course tab

Course interactions nevents Administrative Number of user interactions with the course as determined by tracking logs

Video events nplay_video Administrative Number of times user played course videos

Chapters accessed nchapters Administrative Number of course chapters accessed by user

Chapters explored explored Administrative Users who read at < half of chapters assigned

Seeking certificate certified Administrative Users who earn a course certificate

35

Final grade grade Administrative, l-diversity sensitive

User’s final grade in the course

Activity end last_event_DI Administrative Date of user’s final interaction with course

Non-user participant

role Administrative Classifies instructors or staff in the course

Null values inconsistent_flag Administrative Classifies values that are not available due data inconsistencies

Table 3.2. Variables, continued

36

L-diversity. Daries et al. (2014) also accounted for l-diversity in the de-identified Person-

Course dataset. The researchers were able to create a k-anonymous dataset that was effective in

reducing identification risks for individual MOOC users, but it still left the possibly for a

“homogeneity attack” (Machanavajjhala, Gehrke, Kifer, & Venkitasubramaniam, 2007, p. 3).

This type of data breach capitalizes on an attacker’s contextual knowledge of a given individual,

perhaps learned through social media sites, and in employing deductive reasoning, as informed

by the data provided in a k-anonymous data, can re-identify that individual. The initial k-

anonymity process yielded individual-user population groups with sensitive variables that might

be used for re-identification. In the case of Person-Course, by knowing how a user was classified

in a few sensitive variable categories, such as date of enrollment, course name, and their IP

address at the time of their involvement in the course, it might be possible to determine which

specific user posted on a discussion board on a given date.

L-diversity could also be used to reduce statistical based reasoning data breaches known

as “background knowledge attacks” (Machanavajjhala et al., 2007, p. 4). This type data breach

allows an attacker to capitalize on the information they have about a specific demographic of

user and might enable the attacker to use that information to reduce number of attributes to be

examined when attempting to identify a specific user. However, for the purposes of their

research, Daries et al. (2014) decided to focus only on their datasets’ vulnerability based upon a

homogeneity attack.

After the Person-Course dataset was de-identified for k-anonymity, Daries et al. (2014)

assessed the data for l-diversity sensitive variables, or attributes that may be especially

vulnerable if an attacker learned of their values. For example, a study about students in a

traditional college course may provide the gender, age, and ethnicity of the learners, but in order

37

for the data to be considered l-diverse, the sensitive variable of a student’s GPA would need to

be redacted in order to protect the privacy of those students. For the purposes of Person-Course,

Daries et al.’s (2014) analysis determined that the only sensitive variable was final course grade

(grade) and would be subject to removal from the dataset if believed to present homogeneity

vulnerability. My research also ascribed the sensitive variable value to the grade attribute.

Replication of Daries et al.’s Method

To replicate Daries et al.’s (2104) study, I received approval from Northeastern

University’s Institutional Review Board and signed a data release with MIT’s Office of

Institutional Research. Correspondence with Daries provided access to a GitHub page featuring

his study’s de-identification process manual and the open-source Python code I used to de-

identify my datasets. Daries also provided additional information regarding the background,

theory, and process for his study via the MITx and HarvardX Dataverse which inclued the

Person-Course Documentation (Daries, 2014) and Person-Course De-identification (Daries,

2014) files. I frequently consulted throughout the data collection, coding, and analysis processes.

Data Collection

The research process consisted of two distinct phases: the simultaneous document review

and legal analysis of FERPA, and the coding of MOOC identified datasets. The document review

and analysis included an evaluation of the case law that examines the application of the key

terms found in C.F.R. Title 34 Part 99 Subpart A §99.3, and Subpart D §99.31(b)(1) and (2)

which regulate the conditions in which an institution may disclose information without seeking a

student’s prior consent. The process of de-identifying the MOOC datasets included running the

Python code-based program written by Daries.

38

FERPA Document Review and Legal Analysis. The document review and legal

analysis was conducted in order to determine the statutory definition of key terms and

regulations for the collection, retention, and dissemination of a student’s education record.

Subpart A §99.3 provided term definitions and Subpart D §99.31(b)(1) and (2) stipulated the

regulations for releasing student information without that student’s consent. The definitions and

case law review provided the infrastructure for the analysis of both the de-identified datasets and

the content included in the datasets that might be considered an educational record. The key

terms reviewed included student, attendance, educational agency or institution, educational

record, and personally identifiable information (PII). The review of Subpart D §99.31,9 (b)(1)

and (2) provided the context in which the de-identification process would be necessary in order

to permit the release of a dataset.

Sampling Populations for De-Identification Process. My study sought to expand the

scope of Daries et al.’s (2014) study through purposive sampling which included datasets from

the two most popular MOOC providers, edX and Coursera. These platforms were selected not

only due to their prominence in the MOOC industry, but for their focus on accessibility to higher

education, wide-range of course offerings, average amount of users per course, terms of service

agreements, and user privacy policies. Udacity, the another popular MOOC provider, was not

included in this study as it recently shifted its focus to providing courses solely on computer

science and nanotechnologies through partnerships with corporate sponsors, not post-secondary

institutions.

Datasets were requested from edX, Coursera, and Daries. edX was unable to provide

datasets per their agreement with their partner institutions, but recommended requesting datasets

directly from those partner institutions, which included MIT, Daries’ home institution. Coursera

39

did not respond to any inquires. My requests for datasets from 12 of Coursera’s partner

institutions were also denied. Daries responded by providing instructions for requesting access to

the datasets he and his team used for his study, which were MITx courses hosted on the edX

platform, as well as links to the de-identification Python code stored on GitHub, an open-source,

project hosting website.

Through the MITx Data Request protocol, I received access to four MITx datasets: MITx

2.01x (2013), MITx 3.091x (2013), MITx 8.02x (2013), and MITx 8.MReV (2013). These

datasets were selected from the collection of the original 16 datasets used in the Person-Course

study and were chosen due to the size of the user population. Sampling from courses with

smaller user populations allowed for easier data management and the reduced number of records

to be deleted. Yet these datasets were still large enough to be well representative of a typical

MOOC dataset with a mean user population of 20,586. The datasets were stored on a secure,

encrypted external hard drive and transferred electronically using a Pretty Good Privacy (PGP)

key. Once de-identified and assessed, the original datasets were deleted.

Using the data request method suggested by edX, datasets were solicited from Coursera’s

partner institutions. Using convenience sampling, 12 institutions located in the United States, and

thus could be potentially subject to FERPA compliance were contacted via email to requests

access to their Coursera-hosted course datasets. However, no institution was willing to

participate in this study. Even though partner institutions have unique, individual contracts with

Coursera, many of the universities I contacted declined my request for data citing their terms of

use agreement with the provider. These agreements prohibited sharing their participants’

identities without seeking permission from the users whose information was included in the

datasets (Coursera, 2015). Providing me with their datasets would require the partner institutions

40

to contact potentially thousands of domestic and international users. Resources were not

available to accomplish this task.

Attempting to De-Identify MOOC Datasets

The original de-identification code was forked, or imported, from Daries’ GitHub page

onto my GitHub page and then imported into the software program PyCharm. The datasets were

also imported into private directory in PyCharm, which allowed for the code to be run on the raw

dataset in a protected virtual environment. The data was then converted from SQL to CSV files

and ran through the de-identification code in Jupyter Notebook. The results were imported and

saved in PyCharm.

De-identification Code and Process. I attempted to de-identify the MITx 2.01x and

MITx 3.091x datasets. Due to programming errors, I was unable to perform the de-identification

process on the MITx 8.02x and MITx 8.MReV datasets. The de-identification progam was run

on the MITx 2.01x dataset six times and the MITx 3.091x dataset once.

In order to prepare the datasets for de-identification, and per Daries et al.’s (2014)

original research design, each user was given a 16-digit identification number comprised of both

a unique identifier and the course ID. The datasets were then evaluated by quasi-identifiable,

user-specific attributes: IP address, gender, year of birth, enrollment date, last day active, days

active, and number of forum posts. I selected these attributes to be consistent with the original

study. Daries et al. report choosing these variables due to their increased probability to be

publicly available.

I used the generalization and suppression emphases on these attributes to reduce re-

identification risks and delete extreme outliers in the dataset, which allowed for the analysis of

the truncated mean. Country names, derived from the users’ IP addresses, were changed to their

41

respective geographic regions, and, in order to reduce skew in the results, users with 60 or more

forum posts were deleted. Then the data was concatenated by stringing the quasi-identifier

variables into groups no smaller than 5 students. In order to minimize the impact on entropy, the

code was applied systematically to each quasi-identifier represented in the utility matrix. This

process attempted to yield a k-anonymous and l-diverse dataset ready for its utility assessment.

I then attempted to determine the utility of the k-anonymous and l-diverse datasets by

completing the utility matrix. Comprised of a nine by three grid, this matrix measured the de-

identified dataset’s entropy, mean, and standard deviation of each quasi-identifier (see Table

3.3). Generated by the Python code, this matrix was run on the original identified dataset and

once again each time a variable was coded for k-anonymity. The utility matrix was to be

recorded for each iteration of the analysis for each dataset, but the program encountered an error,

preventing the utility matrix from being completed.

Table 3.3. PreUtility Matrix for MITx 2.01x

Variables Entropy Mean (n) Standard Deviation

viewed 0.893515 0.689704 0.462615

explored 1.38352 0.194336 0.395689

certified 0.345462 0.0646054 0.245828

grade 1.80109 0.0692774 0.211211

nevents 8.29603 799.21 2229.94

ndays_act 4.177 9.48864 17.5364

nplay_video 5.49129 78.207 239.401

nchapters 2.92928 3.90965 3.7522

nforum_posts 0.640539 5.8006 26.4147

42

Analysis

Document Review of FERPA. I determined if MOOC providers’ datasets could meet

the statutory requirements of FERPA by analyzing the regulatory definition of the terms

educational record, PII, student, and educational agency or institution as found in Subpart A

§99.3. I also assessed if MOOC users may be considered students according to §99.3 and the

relevant case law. An in-depth analysis of the statute’s applicability to MOOCs is provided in the

subsequent chapter.

Measuring K-anonymous Utility. The de-identified datasets were then analyzed to

determine their utility. In the original study, this process allowed Daries (2014) and his team to

quantify the impact the deletion of variables had on the accuracy of the de-identified dataset.

The analysis was to measure the change between the raw datasets and the k-anonymous,

l-diverse datasets by employing a utility matrix (see Table 3.3) modeled on Dwork’s (2006)

utility vector. This matrix was also designed with the intent to measure the shift in a common

metric in information theory known as Shannon entropy, mean, and standard deviation of nine

nominal variables from the pre and post-de-identified datasets. However, due to unresolved bugs

in the code, I was unable to measure the utility of the any of the k-anonymous datasets.

Limitations

My inability to gain access to a Coursera dataset was a significant limitation of this study.

Without a representative dataset from a second MOOC provider, I was unable to determine if

this methodology can effectively de-identify non-edX data. Therefore, I was unable to answer

my secondary research goal of determining a standardized methodology for MOOC data de-

identification. Additionally, Daries et al.’s (2014) did not provide the standards by which they

determined if a dataset has maintained its utility. This is problematic as the utility impact may

43

very dependent upon how the attributes are grouped, categorized, or eliminated. Also, currently

there are no industry standards for quantifying dataset utility.

With the additional goal of creating a systematic process for de-identifying datasets that

may be used on any type of MOOC provider and still maintain the dataset’s efficacy, my study

necessitated defining utility as the “the validity and quality of a released dataset to be used as an

analytical resource” (Woo, Reiter, Oganian, & Karr, 2009). The values for entropy, mean, and

standard deviation will be discussed in Chapter 5. The broad scope of this term offered a baseline

understanding of what should be the resulting usability of a de-identified dataset. However, it

must be noted that though a general definition of utility is provided for my study, in practice,

utility may be determined on a case-by-case basis, dependent upon the needs of the individual

using the dataset.

Finally, I encountered a number of bugs in Daries’ program, which will be disucssed

more in depth in Chapter 5. Due to these problems with the code, I was unable to complete the

method in its entirety as outlined in this chapter. This limitation of my study is reflective of the

problem with Daries’ code, not the method itself.

44

Chapter 4

Legal Analysis

In the aftermath of the Watergate scandal, when the public’s desire for governmental

transparency was at an all-time high (Stone & Stoner, 2002), Senator James Buckley proposed an

amendment to the General Education Provisions Act (GEPA) that would become the Family

Educational Rights and Privacy Act of 1974, more commonly known as FERPA (20 U.S. C.

§1232g). The rationale for FERPA, as articulated in Senator Buckley’s initial appeal to

Congress, recognizes the need to curtail the “abuses of personal data by schools and Government

[sic] agencies” (120 Congressional Record, 14580). Months later in the Joint Statement in

Explanation of Buckley/Pell Amendment (120 Congressional Record, 39862-39866), Senator

Buckley claimed the purpose of the law is to provide both parents and eligible students the

ability to review their education records, as well as limit the sharing of those records without

student or parental consent in an effort to promote student privacy. FERPA was authorized as an

amendment to GEPA, therefore it did not undergo Congressional committee review, limiting its

legislative history to the Joint Statement (Stone & Stoner, 2002). FERPA became law in the

summer of 1974.

Over the past 40 years, FERPA has been amended eleven times and faced significant

criticism. Many of these amendments were enacted in response to nationally publicized, critical

incidents in higher education, such as the Campus Security Act in 1990, the USA PATRIOT Act

of 2001, and the Amendments of 2008 (Ramirez, 2009). However, because these amendments

were made in conjunction with other laws, such as the Jeanne Clery Act, or as an addendum to

the Higher Education Act, the legislative history for these amendments is also limited.

Despite these modifications, the statute’s language is indisputably imprecise, leaving

institutions to interpret the statute’s terminology of educational record to meet their own needs

45

(Graham, Hall, & Gilmer, 2008). Until 2008, the Dept. of Education actively abstained from

providing clarity for colleges and universities on how to interpret and implement FERPA

(Lomonte, 2010). This hesitation by the Dept. of Education to offer more guidance on FERPA

compliance is a consequence of the statute’s lack of detailed legislative history.

FERPA regulates K-12 and post-secondary education systems, but critics suggest it fails

to take into account the distinctive needs of these two very different populations (Lomonte,

2010). The Dept. of Education first recognized the disparate privacy goals of higher education

students and institutions through its 2011 proposal to strengthen protections around statewide

longitudinal data systems (L'Orange, Blegen, & Garcia, 2011). However, the application of this

law and the sharing of information is contingent on multiple factors including timing, the

relationship of the parties in question to the student, and the purpose for disclosure (Meloy,

2012).

To date, MOOCs have not been litigated in any United States Course. Therefore, the

following examination of the statutory definitions of key terms in FERPA, and the review of

applicable case law, is intended to be persuasive only. The cases presented are not an

authoritative assertion of the binding precedent to be enforced on MOOC providers or MOOC

users.

Statutory Definitions of Key Terms as they Pertain to MOOCs

For MOOCs and the privacy needs of their users, examining how the definitions included

in FERPA and how these regulations relate to this learning platform is essential in determining if

MOOC datasets can and should be de-identified in a manner that is compliant with FERPA. In

order to determine qualifications for compliance, as determined for the purposes of this study,

the terms evaluated in this analysis include student, attendance, educational agency or institution,

46

educational record, and personally identifiable information (PII). These definitions provided in

FERPA in §99.3, as authorized by 20 U.S.C 1232g.

Who is a Student? The statute defines a student as “any individual who is or has been in

attendance at an educational agency or institution and regarding whom the agency or institution

maintains education records” (§99.3). However, determining who and under what conditions an

individual meets the statutory definition of a student is a complicated process. The term student

appears 208 times in the statute, often in correlation with other key terms such as educational

record or PII. This is especially problematic considering these terms heavily rely on the

designation of student in their own definition. For example, FERPA classifies many of types of

information that may be considered a component of an educational record, but each relies on the

qualifier that it relates to the student in some way. The definition of a student is not independent

from the term educational record, and the meaning of educational record cannot be understood

without including the term student. The same is true of PII and attendance.

FERPA is only authorized to regulate records and information that pertains to students,

therefore it is reasonable to conclude that the reliance on term student is necessary for the

success of the statute, but is problematic due to its circular nature (Young, 2015). FERPA’s

definition of “student” is vague, creating difficulty in determining if a new type of learner may

seek protection under FERPA, or if a new learning platform may be subject to regulation.

The term student has maintained its original meaning from FERPA’s enactment in 1974.

Without any amendments that directly address the definition of the term student, one must turn to

case law in assessing if a MOOC user can be considered a student under the statute. The

application of the definition of student is examined in a number of cases, including Klein

47

Independent School District v. Mattox, 830 F.2d 756 (5th Cir. 1987), and Tarka v. Franklin, 891

F.2d 102 (5th Cir. 1989).

Klein Independent School District v. Mattox. Under the newly established Texas Open

Records Act, a request to review the college transcripts of Rebecca Holt, a teacher in the Klein

Independent School District, raised questions regarding the FERPA rights of employees. Klein v.

Mattox (1987) examines if FERPA may be used to protect educational records that are included

in a personnel record. The United States Court of Appeal for the 5th Circuit held that, because

Holt’s relationship with the Klein Independent School District was as an employee and never as

a student who attended classes within the district, she could not seek relief under FERPA.

Klein’s significance for MOOCs extends beyond the definition of student and raises the

question of the value of personal privacy when contrasted against the public’s best interest in the

context of FERPA. The court did suggest the need to vet the competency and credentialing of the

school district’s educators outweighs Holt’s desire to keep her transcripts private, thus the release

of such information does

the de-identification of mooc datatsets to … · mooc enrollment continues to grow annually by 6%...

Documents

good mooc, bad mooc 2: the return of the mooc turtle

seaman book 1

take (coc) cameroon & srilanka | seaman - belize seaman...

old seaman book 3

seaman clipper april 2011

david seaman picture

old seaman book 5

the seaman clipper

seaman book_july 2015

seaman - united states courts

monsterboekje - seaman´s book - dutch

assignment mooc on mooc

old seaman book 4

mooc and elearning 2014 - siam techno€¦ · 9. udemy 10....

able seaman marwood hoare

assemblywoman seaman legal opinion

d308 cruise report - british oceanographic data...

by david seaman

yak belize | cdc seaman book

surname names rating/rank - dundee...able seaman duncan...