the de-identification of mooc datatsets to … · mooc enrollment continues to grow annually by 6%...
Post on 27-Jun-2020
1 Views
Preview:
TRANSCRIPT
-
1
THE DE-IDENTIFICATION OF MOOC DATATSETS TO DEMONSTRATE THE POTENTIAL FERPA
COMPLIANCE OF MOOC PROVIDERS
A thesis presented by
Michelle H. Lessly, M.Ed.
To
Doctor of Law and Policy Program
In partial fulfillment of the requirements for the degree of Doctor of Law and Policy
College of Professional Studies Northeastern University Boston, Massachusetts
June 2016
-
2
ACKNOWLEDGEMENTS
Completing this thesis was not a solitary task. I want to thank the faculty and staff of the
Doctorate of Law and Public Policy at Northeastern University. I want to extend a special
extension of gratitude to Dr. Edward F. Kammerer, Jr., my primary advisor, and Dr. Neenah
Estrella-Luna for her patience and support throughout this endeavor. I would like to thank
William D. McCants, Esq., my second reader. Additionally, it is with deep appreciation that I
want to recognize my peers and friends, Cohort VIII. I am forever grateful for your challenge,
support, and friendship over the past few years, and the many years to come.
I also want to recognize my family and friends who supported me throughout this
program. Specifically, I want to thank my parents who have been an unrelenting source of
encouragement. Since I was young, you have provided me the resources and opportunities
through which I could pursue my dream of earning a terminal degree. I am proud to be your
daughter; I hope I have made you proud in return.
Additional thanks to: Clinton Blackburn, Todd Karr, John Daries, Rachel Meidl,
Monqiue Cunningham Brijbasi, Keenan Davis, Ted Johnson, Bryan Coyne, Noradeen Farlekas,
Jalisa Williams, Joni Beshansky, Michelle Puhlick, Jonathan Kramer, Melissa Feiser, Melody
Spoziti, Dr. Anne McCants, Jon Daries, Julie Rothhaar-Sanders, Nivedita Chandrasekaran,
Rebeca Kjaerbye, Kristen Covino, and the many friends and colleagues who supported me
throughout this program.
-
3
ABSTRACT
The disruptive technology of massive open online courses (MOOCs) offers users access
to college level courses and gives MOOC providers access to big data concerning how their
users learn. This data, which is often used for educational research, also includes users’
personally identifiable information (PII). The Family Educational Rights and Privacy Act of
1974 (FERPA) protects PII and the educational records of students who attend traditional
educational institutions, but the protection of this legislation is not currently extended to MOOC
providers or their users.
A legal analysis of FERPA demonstrates analogous relationships between key statutory
definitions and MOOC users, providers, and their datasets. By imposing the k-anonymity and l-
diversity standards, this replication study of Daries et al.’s (2014) work attempts to de-identify
MOOC datasets in accordance with C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) to
exhibit how to redact these datasets to be FERPA compliant and still maintain their utility for
research purposes. This study also seeks to determine if this de-identification method can be
standardized for universal use across MOOC providers.
The replication study, coupled with the legal analysis, suggest FERPA may not be the
proper statute to regulate the privacy protections MOOC providers afford their users. Rather, the
U.S. Department of Education and Congress should promulgate policy that outlines the
minimum privacy standards MOOC providers and other disruptive technologies afford their
users. Future research will aid in determining best practices for de-identifying MOOC datasets.
-
4
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ......................................................................................................... 2!
ABSTRACT ................................................................................................................................... 3!
Introduction ................................................................................................................................... 9!
Literature Review ....................................................................................................................... 14!
MOOCs and Public Policy ...................................................................................................... 14!
The Family Educational Rights and Privacy Act of 1974 ...................................................... 16!
Theoretical Framework ........................................................................................................... 18!
Digital Privacy Theory. ..................................................................................................... 18!
Solove’s Taxonomy of Privacy. ........................................................................................ 21!
Critical Review of the Literature ............................................................................................ 24!
Method and Research Design .................................................................................................... 28!
Objectives and Research Question .......................................................................................... 28!
Understanding Daries et al.’s De-identification Process ........................................................ 30!
K-anonymity ..................................................................................................................... 30!
L-diversity ......................................................................................................................... 36!
Replication of Daries et al.’s Method ..................................................................................... 37!
Data Collection ....................................................................................................................... 37!
FERPA Document Review and Legal Analysis ............................................................... 38!
Sampling Populations for De-Identification Process ........................................................ 38!
De-identification Code and Process. ................................................................................. 40!
Analysis................................................................................................................................... 42!
Document Review of FERPA ........................................................................................... 42!
-
5
Measuring K-anonymous Utility. ..................................................................................... 42!
Limitations .............................................................................................................................. 42!
Legal Analysis ............................................................................................................................. 44!
Statutory Definitions of Key Terms as they Pertain to MOOCs ............................................ 45!
Who is a Student? ............................................................................................................. 46!
How is Attendance Defined? ............................................................................................ 50!
Are MOOC Providers Educational Institutions or Agencies? .......................................... 51!
What Constitutes an Educational Record? ........................................................................ 53!
What is PII and how is it Protected? ................................................................................. 57!
FERPA’s Application to MOOCs ........................................................................................... 62!
Results .......................................................................................................................................... 63!
Results of De-identification Process ....................................................................................... 63!
Iteration I, MITx 2.01x. .................................................................................................... 63!
Iterations II-IV, MITx 2.01x. ............................................................................................ 65!
Iteration I, MITx 3.091x. .................................................................................................. 65!
Troubleshooting the Program. .......................................................................................... 66!
Assessing Replicability ........................................................................................................... 69!
Effectiveness of Daries’ De-identification Program. ........................................................ 69!
Role of Terms of Service Agreements and Privacy Policies on Data Releases for De-
identification. ............................................................................................................. 70!
Protecting and Releasing User Data ................................................................................. 73!
Results of Legal Analysis ....................................................................................................... 73!
Are MOOC Users Students? ............................................................................................. 74!
-
6
Does Enrolling in a MOOC Constitute Attendance? The definition of ............................ 75!
Are MOOC Providers Educational Institutions or Agencies? .......................................... 76!
Are MOOC Datasets Classified as Educational Records and do they Include PII? ......... 76!
Is Metadata an Educational Record? ................................................................................. 77!
Is Metadata PII? ................................................................................................................ 78!
Recommendations and Conclusions .......................................................................................... 80!
Conclusion .............................................................................................................................. 80!
Recommendations ................................................................................................................... 83!
For the Department of Education. ..................................................................................... 83!
For Congress. .................................................................................................................... 83!
For Researchers ................................................................................................................. 84!
For MOOC Providers. ....................................................................................................... 84!
References .................................................................................................................................... 86!
Appendix A: Notification of IRB Action ................................................................................... 92!
Appendix B: Outbound Data Use Agreement: MITx Data .................................................... 93!
Appendix C: De-identification Code ....................................................................................... 101!
-
7
LIST OF TABLES
Table 3.1. Measures ...................................................................................................................... 29!
Table 3.2. Variables ...................................................................................................................... 34!
Table 3.3. PreUtility Matrix for MITx 2.01x ................................................................................ 41!
Table 4.1. Definitions of a Student ............................................................................................... 49!
Table 5.1. Variables Selected when Running Daries De-identification Program on MITx 2.01x,
Iterations I-III .......................................................................................................................... 67!
-
8
LIST OF FIGURES
Figure 3.1. Risk of Re-identification due to the Intersection of MOOC User Data, Quasi-
identifiers, and User-generated, Publically Available Information ........................................ 31!
Figure 3.2. Example of Suppression and Generalization Emphases ............................................ 32!
-
9
Chapter 1
Introduction
Massive open online courses (MOOCs) offer a promising 21st Century solution to the
problem of access and affordability of higher education. Initially launched in the United States in
2011, MOOCs offer low-to-no cost college-level courses through partnerships with universities
or corporations. This disruptive educational model differs from the traditional college model or
online courses. MOOCs have no admission requirement and occur entirely online, allowing
thousands of users from around the world to simultaneously take a class to learn from each other
through interactions on discussion forums (Jones & Regner, 2015, Young, 2015). These courses
are often offered on demand and deliver course content through videos, filmed lectures,
discussion boards, forums, readings, and homework, all without the active intervention of a
professor. MOOCs, operated by third party providers, can be affiliated with a post-secondary
institution such as Harvard and MIT’s edX, the only open source, nonprofit MOOC provider
(edX, 2016). They can also operate as a private company such as Udacity and Coursera, both of
which were co-founded by former Stanford professors.
MOOC enrollment continues to grow annually by 6% (Allen & Seaman, 2014), now
reaching approximately 16 million users worldwide (Shah, 2014). The New York Times declared
2012 as the “year of the MOOC” (Pappano, 2012), but by 2014, skepticism regarding the MOOC
revolution was at an all-time high (Friedman, 2014). This doubt may have been propelled by
developmental setbacks such as San Jose State University’s unsuccessful attempt to offer
Udacity courses to its underprepared students (Rivard, 2013)1. Numerous reports reveal MOOC
1 In January 2013, San Jose State University announced a pilot program, in partnership with Udacity, to offer three entry-level courses MOOC courses to matriculating students (Fain, 2013). However, due to poor student performance, the pilot was cancelled in June 2013 (Rivard, 2013).
-
10
course attrition rates consistently teeter between 90-96% (Pope, 2014). Still, the claims that
MOOCs miss the mark overlook the innovations they contribute to the field of educational
technology and research. The truly transformative nature of this non-formal education platform
rests not in the method of knowledge delivery or course retention rates, but the in opportunities it
creates for the analysis of knowledge acquisition, especially in the digital age.
MOOCs have a multi-pronged business model, for in addition to providing access to
college courses, MOOCs function as education data warehouses. This information is known as
metadata, or “structured information that describes, explains, locates, or otherwise makes it
easier to retrieve, use, or manage an information resource” (National Information Standards
Organization, 2004, p. 1). For MOOCs, this includes users’ personally identifiable information
(PII)2 as described in the Federal Rights and Privacy Act of 1974 (20 U.S.C §1232g; Title 34
CFR Part 99), commonly known as FERPA, as well as data about the amount of time a user
spends watching a video, mouse clicks on a page within the course’s site, and the frequency in
which a user logs onto the learning platform. With an average of 43,000 registrants per course
(Ferenstein, 2014), one MOOC course can generate up to 20 terabytes of data (Hazlett, 2014).
Such collections of metadata can accumulate to become big data, which are datasets that are not
only massive, but are easily searchable and sortable (Boyd & Crawford, 2012) and are retained
for the purposes of evaluating minute details to determine patterns or trends within representative
sample populations (Young, 2015).
Big data creates privacy concerns for both users and data holders. As a wider cross-
section of organizations and companies collect data on the different facets of a user’s life which,
2 FERPA defines PII as the student’s name, the student’s family member’s names, the student’s address, personal identification numbers, other indirect identifiers such as birthdate, and other information that may be linked to a specific student (Title 34 Part 99 Subpart D §99.3).
-
11
when compiled into a digital dossier, creates new privacy challenges. Big data has an exceptional
ability to connect seemingly isolated pieces of information to create a holistic depiction of an
individual’s identity. These digital dossiers create a tension between the utility, or usability, of
big data and expectations for consumer privacy grounded in law and ethics. The legal and ethical
framework that guides data management must be broad enough in scope to address the
potentially conflicting needs of both data holders and the individuals providing the content of the
dataset. Dataset owners must respect those individuals by assuming the responsibility for
protecting their privacy rights (Hoser & Nitschke, 2010).
Within the context of education, FERPA, a federal statute, and its attendant regulations
with interpretative guidance detail the regulatory obligations schools have when safeguarding
student data. This law protects student privacy by regulating the collection, retention, and
distribution protocols educational institutions use to collect the information included in student
educational records. Unfortunately, the protections afforded to traditional students have yet to be
extended to MOOC users since the U.S. Department of Education has not yet determined if
MOOCs providers are classified as an educational agency under FERPA (Young, 2015). This
leaves some educators to speculate that the Department does not believe it has the authority to
determine if FERPA is applicable to this new learning platform (Kolowich, 2014). This
conjecture is further supported by the fact that MOOC providers do not currently receive federal
funding, a prerequisite of the FERPA compliance structure. Moreover, to further complicate the
question of applicability:
If FERPA applies to MOOCs, it is more likely to apply to the data, not the MOOC
provider itself. Thus, data ownership becomes an important component of how FERPA
relates to MOOCs. If data is owned by an actual educational institution, then use of that
-
12
data must follow a fairly standard pattern: The institution can share the data with student
consent or share the data absent consent through exceptions or de-identification (Young,
2015, p. 578).
Thus, FERPA was not prepared for the reciprocal partnership between MOOC providers
and postsecondary institutions when it was conceived over four decades ago, and neither the
Dept. of Education nor Congress have made intentional steps to address the issue of MOOC user
data privacy. MOOC providers have not been officially recognized by the Dept. of Education as
educational agencies, and since MOOCs do not receive federal funding, which would require
them to comply with FERPA, MOOC users are left without the same safeguards afforded to their
university student counterparts who attend the same class in-person or on-line. edX is currently
the only MOOC provider that voluntarily complies with FERPA (edX, 2014).
That said, MOOCs are becoming a more widely-accepted form of higher education, as
demonstrated by the partnership between edX and Arizona State University (ASU). Their
collaboration, known as the Global Freshman Academy, offers for-credit courses to
matriculating students at a significantly reduced tuition rate. MIT’s MicroMaster’s admissions-
free program provides would-be students the opportunity to take an entire semester’s worth of
courses on the edX platform before taking a qualifying exam in order to earn admission to the
on-campus, one-semester full master’s degree program. Since the MIT’s MicroMaster’s program
requires taking edX courses as part of the degree, might the enrollees of the program be
classified as MOOC users or students who should receive FERPA protection?
Therefore, the question of whether MOOCs should comply with FERPA warrants an
urgent response from the Dept. of Education. This leaves the related question of whether
MOOCs can be compliant with FERPA and still generate usable data for the purposes of
-
13
research. Policy makers must address the conflict between the regulatory requirements of
FERPA and the uniqueness of MOOCs. Examining this conflict through the lens of digital
privacy theory, and Solove’s taxonomy of privacy will provide a critical perspective and
necessary understanding of how the Dept. of Education should address MOOCs’ evolving
impact on the American higher education system.
This study seeks to provide a solution to this burgeoning policy concern by asking: in
what ways might MOOC provider datasets be de-identified to meet the requirements of C.F.R.
Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) of the Family Educational Rights and Privacy
Act of 1974, and still maintain their utility for the purposes of research dissemination? To answer
this question, an examination on the literature on MOOCs, FERPA, and digital privacy theory to
will provide the context in which MOOC providers and policy makers must resolve this issue. A
legal analysis on the legislative and judicial history of FERPA will inform a methodology of de-
identifying MOOC datasets to be FERPA compliant. The results of this study will yield
recommendations for MOOC providers, researchers and policy makers to resolve the concerns of
user privacy, data utility, and the potential need for MOOC to comply with FERPA.
-
14
Chapter 2
Literature Review
MOOCs and Public Policy
MOOCs first focused on providing open access to courses at globally-recognized, highly
ranked universities such as Harvard, Oxford, Stanford, and MIT. They have since evolved to
offer courses ranging from Google-developed coding classes to public relations seminars and
conversational English courses for non-native speakers. Though a general level of digital literacy
is required for MOOC course navigation, users are not limited by course prerequisites or
admissions requirements to enroll in their course of choice. MOOCs operate under an open
learning model, requiring users to rely on self-motivation to progress through a course, rather
than external motivators such as deadlines for homework assignments or attendance
requirements. Moreover, by divorcing online learning from the matriculating enrollment model
at a traditional university, MOOCs have developed into a new type of non-traditional educational
program.
In light of the collaboration between ASU and edX to create the Global Freshman
Academy, many MOOC providers and postsecondary institutions are exploring, and in some
cases implementing, such hybrid educational models. The American Council on Education
(ACE) recommends colleges and universities offer credit for up to five MOOC courses (ACE,
2013). By 2013, both California and Florida state legislators considered recommendations to
make MOOCs part of the degree-granting curriculum for their public college systems. While
Florida legislators did approve the use of MOOC classes in the K-12 system, concerns regarding
course quality prevented expanding the bill to public postsecondary institutions (Inside Higher
Ed, 2013). Faculty union fears prevented California lawmakers from making MOOCs a
component of the state’s three public higher education systems (Kolowich, 2013). In Arizona,
-
15
however, the Global Freshman Academy drew over 34,000 registrants in its first year by offering
6 credit-granting, transferable classes at $200 per credit hour (Straumsheim, 2015). This MOOC
hybrid-model, if proven successful, challenges the traditional post-secondary education
experience.
This new type of educational experience is at the center of the FERPA compliance
problem for MOOCs as it presents many challenges for MOOC providers, their university
partners, and legislators. The amorphous state of the MOOC provider does not match current
legal constructs (Jones & Regner, 2015), nor do the privacy and safety needs of MOOC users
equitably align with current legislation. For example, the Cleary Act requires colleges and
universities to track crime data on and around their campuses, but is a MOOC required to report
threats or an incident of sexual harassment between two students on a course’s discussion board?
Can a MOOC provider’s course site be considered a campus? What if these two students reside
in different countries?
The hybrid MOOC model presents even more of a challenge for FERPA in that its
requirement for compliance requires a student’s enrollment at a recognized educational agency
that receives some form of federal funds. If a user signs up for a university-created, certificate
granting course through edX’s platform, is the user enrolled as student at that FERPA regulated
university, or at edX, which is not currently an educational agency under FERPA rules? Or, is
the user not entitled to any of the FERPA protections available to a student in a physical
classroom?
The President’s Council of Advisors on Science and Technology (PCAST) recognized
the range of MOOC related privacy challenges in their 2014 report. The big data element of
MOOCs makes protecting user privacy much more demanding than in the case of a traditional
-
16
student whose FERPA-protected information is confined to PII, including their name, birthday,
and email address, and their educational record which contains information such as graded
coursework and transcripts. PII does not include the wide range of metadata collected by
MOOCs such as a user’s highest level of education, how many times they watched a course-
related video, or the date of their last activity on a discussion board. Thus, the majority of the
information held by MOOC providers likely may be unregulated, even if FERPA were to apply
(Young, 2015). This does leave tort law as a potential safeguard for metadata, but an ideal
privacy apparatus protects both PII and metadata. Thus, PCAST’s recommendations for privacy
protections include encryption1 and de-identification by removing full and quasi-identifiable2
variables from a dataset (Daries, Reich, Waldo, Young, Whittinghill, Ho, & Chuang, 2014).
These recommendations surpass FERPA’s current privacy regulations, demonstrating the
revisions necessary to bring FERPA up-to-date with digital privacy needs. No longer is simply
redacting PII sufficient to protect a student’s identity. Lawmakers must contemplate the totality
of the data collected on students when promulgating privacy legislation.
The Family Educational Rights and Privacy Act of 1974
First introduced as the Buckley Amendment and signed into law by President Ford in the
summer of 1974, FERPA enables students to control both the access and content included in
their educational record (Graham, Hall, & Gilmer, 2008). This statute regulates the privacy needs
of students in the K-12 system by allowing both students and their parents to have the ability to
review and correct their educational record. FERPA does revoke parental review rights for
1 PCAST defines encryption as the process which converts data into cryptography-protected rendering it useless to those without the decryption key. 2 Quasi-identifiers are pieces of data that, when combined with other data, can generate the ability to uniquely identify an individual. Examples include gender and birth date (Sweeney, 2002).
-
17
students once they turn 18 or are enrolled in a post-secondary institution, but it otherwise
remains applicable to colleges and universities.
Compliance is required of all institutions that receive federal funds, including federal
student aid and grant monies. Withholding these funds is the only statutorily authorized
enforcement mechanism permitted. However, when a FERPA complaint is filed, the Dept. of
Education prefers to resolve the matter through administrative actions such as required policy
revisions or trainings (Family Policy Compliance Office, 2015), rather than revoking federal
funds. The consequences of the latter not only penalize the academic institution, it can also have
significant, negative repercussions that are passed on to the student. Revoking an institute’s
federal funds due to a FERPA violation potentially means the institution can no longer afford to
educate students in the same way prior to the complaint. To date, the Dept. of Education has not
withheld funds for a FERPA violation (Young, 2015).
Since 1974, FERPA has been amended eleven times (20 U.S.C §1232g). As a result, this
statute is notoriously challenging to interpret and at times seems contradictory. Until 2008, the
Dept. of Education actively abstained from providing clarity for colleges and universities on how
to interpret and implement FERPA (Lomonte, 2010). In that same year, the Secretary of
Education issued an amendment to FERPA in order to implement stricter written notification
requirements for the release of student records to a third party, including parents, while
simultaneously making notification exceptions when information is released for the purposes of
research (Ramirez, 2009, Family Educational Rights and Privacy Act, 2008).
These recent amendments demonstrate the conflicting nature of the privacy expectations
of students and their institution’s need to share student information for the purposes of
scholarship or safety. They highlight that FERPA was created in a time when its drafters were
-
18
unable to conceive of a virtual learning environment in which the scope of personally identifiable
data collected would be much more expansive than the current statutory definition of PII.
FERPA permits disclosing student data when the PII is de-identified3, but how might this process
be accomplished to scale for a MOOC course?
The historical interpretation of FERPA’s standard for de-identifying student PII may not
be enough to prevent the re-identification of MOOC users. The removal of PII in compliance
with FERPA will still leave behind additional quasi-identifying information, such as VPNs,
gender, and online user-generated content, which can be used to re-identify MOOC users.
Unfortunately, FERPA does not account for these quasi-identifiers. Therefore, even once a
MOOC dataset is de-identified according to FERPA’s regulations, the statute’s safeguards will
not be applied to the dataset’s quasi-identifiers, leaving that information public and unprotected.
Theoretical Framework
Digital Privacy Theory. As MOOC providers continue to develop their ability to gather
both PII and quasi-identifiers from their users, the need to ensure individual users’ privacy
grows. However, increasing privacy protections on this data may negatively impact the utility of
the dataset. To combat this problem, MOOC providers might employ the k-anonymity algorithm
(Sweeney, 2002), the l-diversity standard (Machanavajjhala, Kifer, Gehrke, &
Venkitasubramaniam, 2007), and Dwork’s (2008) differential privacy model.
Sweeney’s k-anonymity Algorithm. In an effort to better secure privacy within datasets
while retaining research utility, Sweeney (2002) recommends employing the k-anonymity
algorithm. Using k-anonymity on an individual-data point structured dataset, can “produce a
release of the data with scientific guarantees that the individuals who are the subjects of the data
3 See C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1), (2)
-
19
cannot be re-identified while the data remain practically useful” (p. 557). To be successful, a k-
anonymous dataset maintains a value of k-1 between data points, or attributes, which reduces the
ability to re-identify an individual based on the totality of the information provided in the dataset.
By utilizing anonymization through the methods of generalization and suppression, k-anonymity
introduces noise into a dataset to dilute the information to make it comprehensively secure and
maintain utility.
The two redaction methods, generalization and suppression, alter data while retaining the
type of attributes collected within the dataset. It is through these two methods that noise is
injected into the dataset and generates the k-value between attributes. Through generalization, a
specific attribute is removed but still captured through a generic, yet representative category. It
replaces specific attributes, such as ages or other data that can be represented accordingly, with
ranges. For example, using generalization, a 25-year old male who lives in Boston, MA could be
represented in a k-anonymous dataset as a 25-29-year old male who lives in the region of New
England. However, this method only works for certain types of data. Suppression is employed
for data cannot be easily generalized. As in the previous example, the gender of the 25-year old
male could be represented in a k-anonymous dataset as a symbol, most commonly an asterisk,
indicating the data was collected but suppressed for the purposes of anonymization. It is
important to note generalization and suppression may be used alone or in combination depending
upon different types of data and different research questions.
L-diversity. Whereas k-anonymity is a fairly comprehensive data privacy theory,
Machanavajjhala et al., (2007) argue it still provides contextual information in which individuals
may be re-identified. l-diversity adds an additional level of protection for datasets that are
sensitive to privacy breaches due to the totality of the data made available to the public,
-
20
including not only the attributes represented in the data, but the background of the attributes.
Therefore, even if the 25-year old male who lives in Boston, MA is represented in an k-
anonymous dataset as * (25-29-year old) who lives in Massachusetts, if that information is
published in an unaggregated manner that provides the context in which the data was collected,
the k-anonymous data is still vulnerable to attack. These attacks fall into two category types:
homogeneity attacks and background knowledge attacks.
Homogeneity attacks occur when the attributes in the dataset are not diverse enough to
create true anonymity on an individual level. For example, an attacker may know a user enrolled
in a MOOC who is a prolific poster on the course’s discussion board. The attacker knowing that
person’s age, gender, zip code, and course may be able to determine how many posts that
individual made, through the process of elimination, if given access to that class’ discussion
board. Homogeneity attackers do not need to know the user, but rather simply have access to that
user’s demographic information to make an identification.
Background attacks build on homogeneity attacks by using contextual information to
make an identification. Background attacks are a result of an attacker having personal knowledge
about a user and making connections between sensitive data and quasi-identifiers based on
societal background knowledge or information on a specific population represented in the
dataset. Continuing with the previous example, if the attacker also knew the user was struggling
with the course content and sought assistance from others in the class, the attacker may be able to
determine which posts were the user’s. This example demonstrates a background attack using
quasi-identifiers. Based upon the vulnerability presented by these attacks, the l-diversity
algorithm increases the noise in a dataset by increasing the diversity of sensitive attributes.
However, as sensitive attributes become more l-diverse, the utility of the data may be reduced.
-
21
The Differential Privacy Model. Dwork (2008) also challenges k-anonymity, claiming
there is no such thing as an impenetrable privacy protection algorithm, and suggests the
differential privacy model provides a more optimum anonymization solution. This algorithm
uses noise by interjecting it on the release mechanism of the data, not the data itself. The layering
of protection, that is encoding the data release rather than the data through methods such as
generalization and suppression, interferes with an attacker’s ability to accurately capture
information or to trace back the information to re-identify individuals and retains the utility of
the data for the purposes of analysis. The differential privacy model focuses on producing
information about the data released in a published dataset.
This algorithm prevents an attacker from being “able to learn any information about any
participant that they could not learn if the participant had opted out of the database” (Tockar,
2104, n.p.). By adding noise to the release mechanism, such as a chart or graph, an attacker is
unable to determine seemingly random patterns in the data that may lead to re-identification.
Thus, the differential privacy model redefines the concept of digital privacy, moving from a
system that attempts to defend the entire dataset against attacks, to a tiered design that makes
datasets systematically less vulnerable when an inevitable attack occurs.
Solove’s Taxonomy of Privacy. In the context of MOOCs, user privacy should not
simply be reduced to the application of a security algorithm or a debate about identity protection.
A more satisfactory understanding of user privacy looks beyond anonymity and scrutinizes the
rationales behind the collection of the data in order to determine if it should be collected in the
first place. Solove’s (2008) taxonomy of privacy provides a framework for MOOC providers to
ethically develop and disseminate their user-populated datasets while maintaining the necessary
type of privacy. His argument that a single concept of privacy is not constant and cannot be
-
22
consistently applied reflects the complexities of the MOOC user privacy issue. By shifting the
locus of privacy from the data owner to the data subject, Solove’s taxonomy can explore the
impact of the integration of six privacy concepts: the right to be left alone, limited access to the
self, secrecy, control over personal information, personhood, and intimacy.
The concept of the right to be left alone is the underpinning for today’s privacy torts and
is similar to the notion that privacy is limited access to the self, a principle that insists an
individual should be the gatekeeper of their own personal information (Warren & Brandeis,
1890). The concept of secrecy, as popularized by Posner (1978), is the “appropriation [of] social
benefits to the entrepreneur who creates them while in private life it is more likely to conceal
discreditable facts” (p. 404). The desire for secrecy leads individuals to limit access to
information about themselves and leads to the concept of control over personal information,
which recognizes information as one’s personal property. The concept of personhood expands
upon that of personal property by viewing one’s information as a manifestation of one’s identity
and reputation. Finally, the concept of intimacy asserts the need to keep information private is
not just for the protection of one’s self, but to secure the information of those with whom the
individual may be associated. Whereas Sweeney and Dwork consider privacy from the utilitarian
perspective of the dataset owner, Solove recognizes that it is the individual who assumes more
risk when a third party, such as a MOOC provider, collects and disseminates data.
This becomes especially problematic due to exclusion, or “the failure to provide
individuals with notice and input about their records” (Solove, 2006, p. 521). Exclusion presents
a harm different from that of data privacy and security in that rather than being concerned with
re-identification, exclusion removes an individual’s ability to control what happened to their data
(Solove, 2006). FERPA’s primary goal is to eliminate exclusion, but it is this goal that further
-
23
complicates the application of FERPA to MOOCs. In order to register for a course, users are
often required to agree to the MOOC provider’s terms of service, which can exclude them from
the decision making process as to how and when their information is used, or to have the ability
to review the data to ensure it is an accurate portrayal of their identity. This may become
problematic if MOOCs are required to become FERPA compliant, as it requires educational
agencies to grant students access to their educational record and the ability to correct it when
necessary. That said, those terms of service agreements that do not align with FERPA may
become void under the law, which easily resolves the policy concern, but still leaves MOOC
providers with the responsibility to audit massive amounts of data to ensure compliance.
When examining digital privacy from the user’s perspective, Solove’s model highlights
the porous nature of the relationship between data subjects and data holders. To rectify this, the
taxonomy identifies four activities of the data collection process: information collection,
information processing, information dissemination, and invasion. The taxonomy’s intentional
design around the data subject, identified as “the individual whose life is most directly affected
by the activities classified in the taxonomy” (Solove, 2008, p. 103), and not around a specific
privacy conception, allows for the evolution of privacy needs in the digital age.
A MOOC provider’s act of collecting information includes user registration information
and the surveillance of their subsequent activity online. This leads to the second action in the
taxonomy, processing information, which may be aggregated and analyzed without user
knowledge. Though the purpose of MOOC data research includes learning about the potential
functionality of the platform and to expand the field of knowledge on education technology,
sharing this information can violate user trust. Moreover, the third activity, information
dissemination, reveals the vulnerability of MOOC users’ information. Poorly managed user
-
24
information creates opportunities in which information may be inappropriately disclosed or
privacy agreements may be violated, leading to the fourth activity of invasion. If a user’s
information is improperly disclosed, leading to an attack on their personhood, what impact might
this have on the likelihood they will feel safe enough to enroll in another MOOC course?
Critical Review of the Literature
If the Dept. of Education is to evaluate the relationship between MOOC providers and
user privacy concerns, so must it consider FERPA’s definition of PII as it pertains to big data.
The current statutory standard for de-identification is reducing or eliminating PII to create a
reasonable determination of anonymization (C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1)).
This binary conceptualization of privacy successfully operates in a traditional educational
setting, but cannot be reasonably applied in an-online setting. Metadata, such as the course name,
when the course started, and the user’s VPN, are quasi-identifying data points that may be
concatenated for the purposes of re-identification (Daries, 2014). The current assumption, that
redacting what FERPA clearly considers to be PII provides sufficient user privacy protections, is
antiquated and may not hinder MOOC providers from openly sharing quasi-identifiers.
However, an examination of the relevance of the current understanding of PII in a digital
learning environment might be irrelevant as some critics suggest FERPA does not pertain to
MOOCs. Since the Dept. of Education has remained silent on the matter, MOOC providers
currently have the liberty to make their own determination as to whether or not their course users
are protected by FERPA. Both Udacity and Coursera make no mention of their stance on FERPA
on their websites, whereas edX, a provider owned and operated by Harvard and MIT,
specifically states it complies with FERPA.
-
25
Still, the undetermined status of FERPA’s applicability to MOOCs has the potential to
diminish the future utility of different providers’ data (Hollands & Tirthali, 2014). For if the
Dept. of Education or Congress determine that MOOC providers are required to comply with
FERPA or other privacy regulations, those MOOC providers that have decided to not create
FERPA compliant datasets may be limited in their capacity to operate under their own business
models when attempting to share data with researchers. Moreover, if MOOC providers have no
clarity on what may legally or ethically be released, how then are researchers to take advantage
of MOOC-sourced big data?
Yet, a determination of mandatory compliance will not immediately resolve the issue of
user data privacy. Standardizing the privacy protection practice of traditional colleges and
universities is seemingly impractical if not impossible in the MOOC classroom. Whereas
redefining PII will aid in privatizing data, it does not remedy the problem of user exclusion
(Solove, 2011). MOOC providers require users to agree with their terms of service when
registering for a course, but the efficacy of these documents is dubious (Solove, 2013). Terms of
service agreements often rely on the average user not being well versed in the language and
structure of such documents, leading to common user misperceptions about the quality of privacy
controls (Turow, Feldman, & Meltzer, 2005). Since less than ten percent of individuals actually
read a terms of service agreement when registering for an online service (Smithers, 2011),
trusting in such contracts as a form of user consent for metadata collection is questionable at
best.
Fair Information Practice Principles (FIPPs) should be used to reduce users’ confusion
about their waived privacy. FIPPs insist that data holders act ethically with their data by
maintaining transparency of the data management process, keeping users informed of what
-
26
personal data is recorded, and to seek user consent when their data is repurposed (U.S.
Department of Health, Education, & Welfare, 1973). Incorporating FIPPs into FERPA’s
regulatory structure will help to reduce user confusion over their privacy controls and increase
MOOC provider accountability for data management practices. Or, MOOCs may use FIPPs and
FERPA as guidelines to create their own data privacy protection standards.
Additionally, policy makers need also consider how the global scope of MOOCs will
complicate statutory compliance. Whereas digital privacy theory can address the concerns
regarding data protection, it cannot account for cultural privacy norms. Solove’s taxonomy
intentionally allows for applicability within a cross-cultural context, but it fails to anticipate how
a culture’s understanding of power dynamics ebb and flow through each activity of the data
collection process (Sweeney, 2012). This can be especially problematic when determining how
public policy applies to a MOOC dataset when the MOOC and the partner institution, or user, are
not American. Notably, the European Union has very detailed requirements for protecting their
citizens’ privacy, even when their users are accessing education resources outside of the EU.
Policy makers and MOOC researchers must pay additional attention to the issue of governance in
an international educational setting.
The National Association of College and University Attorneys (NACUA) recognizes that
the legal uncertainty surrounding FERPA and MOOCs may change at any point in time due to a
number of factors. For example, in the instances when a user borrows federal funds to pay for a
course, a professor incorporates MOOC course elements into their on-campus classroom
instruction, or postsecondary institutions require students to enroll in a MOOC course to gain
degree-seeking credit, MOOC providers will need to comply with FERPA (NACUA, 2013). It is
an unreasonable expectation that MOOC providers, as they interact with hundreds of thousands
-
27
of users and numerous institutional partners in a given day, to self-monitor for these factors that
might change their compliance requirements. In order to optimize for both educational and
research potential, policy makers should examine how MOOCs can be effectively regulated
under FERPA.
Finally, the most prolific critics of MOOCs, university professors, claim this educational
delivery platform jeopardizes their tradition of the academy and the American system of higher
education. However, the data collected by MOOC providers may be advantageous in the
classroom and when conducting research. Unfortunately, the vast majority of MOOC research is
quantitative, and almost exclusively examines MOOCs from the perspective of user satisfaction.
Shifting the focus of MOOC research from determining the efficacy of the delivery method to
the utility of their user data will aid in the sustainability and mainstreaming of MOOCs in the
education marketplace for the public, private, and online organizations.
Critiques and research on MOOCs can help MOOC providers and policy makers
understand better the barriers to the platform’s success. Rigorous studies of the San Jose State
failure have led to vast improvements in course design and content delivery (Lewin, 2013).
Investigations on open, self-directed learning indicate that user success may be contingent upon
their perception of the security of the online learning environment (Fournier, Kop, & Durand,
2014). If users think their metadata is too readily accessible to MOOC provider personnel or
believe that their privacy has been compromised, they are less likely to be retained (Hughes,
Ventura, & Dando, 2007). There is a need for increased attention to metadata privacy and for
regulatory oversight of MOOCs as a means of ensuring user retention.
-
28
Chapter 3
Method and Research Design
Objectives and Research Question
My study explored the feasibility of requiring MOOC providers to be FERPA compliant
by asking in what ways might MOOC provider datasets be de-identified to meet the requirements
of C.F.R. Title 34 Part 99 Subpart D §99.31 (b)(1) and (2) of the Family Educational Rights and
Privacy Act of 1974, and still maintain their utility for the purposes of research dissemination. In
addition to this question, my study also sought to determine a process to create standard,
systematic method for de-identifying MOOC platform datasets.
My study was motivated by Daries et al.’s (2014) claim:
It is possible to quantify the difference between replications from the de-identified data
and original findings; however, it is difficult to fully anticipate whether findings from
novel analyses will result in valid insights or artifacts of de-identification. Higher
standards for de-identification can lead to lower-value de-identified data. . . If findings
are likely to be biased by the de-identification process, why should researchers spend
their scarce time on de-identified data? (p. 57)
To answer the research question, my study assumed a mixed methods approach by
conducting a document review and legal analysis of FERPA, and attempting to replicate Daries
et al.’s research on measuring the impact the k-anonymity standard has on a MOOC provider
dataset while ensuring the potential for FERPA compliance. Daries and his team, comprised of
MIT and Harvard researchers, examined the feasibility of generating “a policy-based solution
that allows open access to possibly re-identifiable data while policing the uses of the data” (p.
58) according to the regulations promulgated in FERPA. Whereas Daries et al. approached the
problem of de-identification for the purposes of finding an equilibrium between privacy and
-
29
utility in advancing social research, my study examined the question of the application of C.F.R.
Title 34 Part 99 Subpart D §99.31(b)(1) and (2) to a publishable MOOC dataset for the purpose
of evaluating the feasibility of applying FERPA or other relevant public policies to MOOC
providers in order to protect users and their data.
Table 3.1. Measures
Measures Definitions
Can the de-identification process be successfully executed using the same protocol on sample MOOC datasets?
The de-identification process can be executed in the same manner on sample MOOC datasets and yield viable utility while maintaining FERPA compliance.
What is an acceptable level of utility? Maintains a k-5 value of for quasi-identifying variables and l-diversity for sensitive variables while minimizing entropy of the dataset after the de-identification of explicit-identifying variables (Daries et al., 2014).
Daries et al.’s research focused on the first edX dataset to be made publicly available,
known as the HarvardX-MITx Person-Course Dataset AY2013 (Person-Course). In an effort to
validate and expand upon their work, my study employed the k-anonymity standard, a process in
which data unique to a user are removed to reduce the risk of re-identification, on at least one
dataset from two MOOC providers. Since FERPA does not require a precise value for k-
anonymity, Daries et al. consulted the Department of Education’s Privacy Technical Assistance
Center standards and determined that a k-value of five (k-5) created a safely de-identified dataset
and met MIT’s standards for de-identification. My study used the same metric of de-
identification.
-
30
In keeping with the original research, I generated k-anonymous datasets through the
methods of generalization emphasis and suppression emphasis. Daries et al. stressed the purpose
of engaging both generalization and suppression emphases was to evaluate both methods’ merits
and challenges as it related to the utility impact of the data. Therefore, my study evaluated both
generalization and suppression on their ability to better secure users’ personally identifiable
information and to meet the standards as promulgated in C.F.R. C.F.R. Title 34 Part 99 Subpart
D §99.31(b)(1) and (2).
Understanding Daries et al.’s De-identification Process
Daries et al.’s method for de-identification included applying k-anonymity and l-diversity
to MOOC datasets. Additionally, to quantifiably measure the shift in efficacy of the datasets,
they employed a utility matrix as seen in Table 4.3. The authors’ utility matrix was modeled after
Dwork’s (2006) utility vector, which combined descriptive and general statistics to assess the
utility impact the de-identification had on the MOOC datasets.
K-anonymity. To begin the de-identification process, Daries et al. determined which
attributes, or quasi-identifiers, within the existing identified dataset should be removed to meet
MIT’s Institutional Research standards for both anonymization and report composition. The
challenge in de-identifying Person-Course came with the amount of quasi-identifiers available
within the data. One quasi-identifier may not be enough to distinguish a user, but as more unique
attributes are made available, a holistic account becomes available making a user more
vulnerable to attack. Additionally, if a user were actively posting about their MOOC experience
on social media during the course, this increases the likelihood for re-identification based upon
the information provided in the publicly available Person-Course dataset (see Figure 4.1).
-
31
Controlling for this potential variable was too challenging for Daries et al., but it was theorized
that it could potentially be offset by using a higher standard for anonymization.
To do this, Daries et al. (2014) used Sweeney’s (2002) k-anonymity model. In the case of
a MOOC dataset, which can have quasi-identifiers ranging from username to the number of
mouse clicks per page, a greater k-value is required to promote anonymity. For the purposes of
the Person-Course dataset, the researchers assigned a value of k-5, meaning the “k-anonymized
dataset has at least 5 records for each value combination” (Emam & Dankar, 2008, p. 628). In
order for this de-identification approach to be successful, the researchers determined they needed
to remove at least five quasi-identifiers from the dataset, which in turn served as a filtering
mechanism in reducing the risk of re-identification. As k-value increases, the data’s vulnerability
Blogs
Posts on Facebook
Tweets
Other social media
User name
VPN
Email address
Course grade
Gender
Course name
Birthdate
Enrollment date
Data collected by MOOC providers
User-generated, publicly available information
Potential data (quasi-identifiers) used to identify a MOOC user if not anonymized properly, Adapted from Sweeney, 2002.
Figure 3.1. Risk of Re-identification due to the Intersection of MOOC User Data, Quasi-identifiers, and User-generated, Publically Available Information
-
32
to attack decreases. However, as Daries et al. noted, as the k-value increases, so does the
likelihood that the utility of the data may be compromised.
To impose the k-anonymity model on the MOOC datasets, Daries et al. (2014) employed
both the suppression and generalization emphases. The suppression emphasis removed
identifiable attributes from the dataset and replaced it with a character to represent information
that was collected and subsequently redacted. The generalization emphasis replaced attributes
with corresponding or representative values. For example, in order to de-identify a dataset
containing users’ age, the suppression technique eliminated the cell value while maintaining the
attribute category. The generalization technique replaced the cell value with an age range, as
seen in Figure 4.2.
Figure 3.2. Example of Suppression and Generalization Emphases
Suppression Generalization
User_1 Age * 20-24
User_2 Age * 15-19
User_3 Age 30-34
In the case of Person-Course, Daries et al. (2014) identified 20 attributes as variables that
may be used to identify MOOC users (see Table 3.2). The attributes were categorized into two
categories: administrative, meaning the data was generated by the MOOC provider or was
generated by the researchers, and user-provided, which were data points generated by the user at
the time of registration with the MOOC provider. Attributes that were altered as a result of the k-
-
33
anonymity process were tagged with the suffix DI. Null cells, or data that was not made available
by either the MOOC provider or the user was indicated in the attribute inconsistent_flag.
-
34
Table 3.2. Variables
Attributes Code Type Description
Course ID course_id Administrative Course name, institution, and term
User ID userid_DI Administrative Research assigned indiscriminate ID number that correlates to a given dataset
Registered for course registered Administrative User register for a given course
Gender gender User-provided Values include female, male, and other
Country of residence final_cc_cname_DI Administrative, user provided
IP address or user disclosed, was altered through generalization emphasis
Birth year YoB User provided User’s year of birth
Education LoE User provided User’s highest level of completed education
Registration start_time_DI Administrative Date user registered for course
Forum posts nforums_posts Administrative Number of user post to discussion forum
Activity ndays_act Administrative Number of day user was active in the course
Class visits viewed Administrative Users who viewed content in the course tab
Course interactions nevents Administrative Number of user interactions with the course as determined by tracking logs
Video events nplay_video Administrative Number of times user played course videos
Chapters accessed nchapters Administrative Number of course chapters accessed by user
Chapters explored explored Administrative Users who read at < half of chapters assigned
Seeking certificate certified Administrative Users who earn a course certificate
-
35
Final grade grade Administrative, l-diversity sensitive
User’s final grade in the course
Activity end last_event_DI Administrative Date of user’s final interaction with course
Non-user participant
role Administrative Classifies instructors or staff in the course
Null values inconsistent_flag Administrative Classifies values that are not available due data inconsistencies
Table 3.2. Variables, continued
-
36
L-diversity. Daries et al. (2014) also accounted for l-diversity in the de-identified Person-
Course dataset. The researchers were able to create a k-anonymous dataset that was effective in
reducing identification risks for individual MOOC users, but it still left the possibly for a
“homogeneity attack” (Machanavajjhala, Gehrke, Kifer, & Venkitasubramaniam, 2007, p. 3).
This type of data breach capitalizes on an attacker’s contextual knowledge of a given individual,
perhaps learned through social media sites, and in employing deductive reasoning, as informed
by the data provided in a k-anonymous data, can re-identify that individual. The initial k-
anonymity process yielded individual-user population groups with sensitive variables that might
be used for re-identification. In the case of Person-Course, by knowing how a user was classified
in a few sensitive variable categories, such as date of enrollment, course name, and their IP
address at the time of their involvement in the course, it might be possible to determine which
specific user posted on a discussion board on a given date.
L-diversity could also be used to reduce statistical based reasoning data breaches known
as “background knowledge attacks” (Machanavajjhala et al., 2007, p. 4). This type data breach
allows an attacker to capitalize on the information they have about a specific demographic of
user and might enable the attacker to use that information to reduce number of attributes to be
examined when attempting to identify a specific user. However, for the purposes of their
research, Daries et al. (2014) decided to focus only on their datasets’ vulnerability based upon a
homogeneity attack.
After the Person-Course dataset was de-identified for k-anonymity, Daries et al. (2014)
assessed the data for l-diversity sensitive variables, or attributes that may be especially
vulnerable if an attacker learned of their values. For example, a study about students in a
traditional college course may provide the gender, age, and ethnicity of the learners, but in order
-
37
for the data to be considered l-diverse, the sensitive variable of a student’s GPA would need to
be redacted in order to protect the privacy of those students. For the purposes of Person-Course,
Daries et al.’s (2014) analysis determined that the only sensitive variable was final course grade
(grade) and would be subject to removal from the dataset if believed to present homogeneity
vulnerability. My research also ascribed the sensitive variable value to the grade attribute.
Replication of Daries et al.’s Method
To replicate Daries et al.’s (2104) study, I received approval from Northeastern
University’s Institutional Review Board and signed a data release with MIT’s Office of
Institutional Research. Correspondence with Daries provided access to a GitHub page featuring
his study’s de-identification process manual and the open-source Python code I used to de-
identify my datasets. Daries also provided additional information regarding the background,
theory, and process for his study via the MITx and HarvardX Dataverse which inclued the
Person-Course Documentation (Daries, 2014) and Person-Course De-identification (Daries,
2014) files. I frequently consulted throughout the data collection, coding, and analysis processes.
Data Collection
The research process consisted of two distinct phases: the simultaneous document review
and legal analysis of FERPA, and the coding of MOOC identified datasets. The document review
and analysis included an evaluation of the case law that examines the application of the key
terms found in C.F.R. Title 34 Part 99 Subpart A §99.3, and Subpart D §99.31(b)(1) and (2)
which regulate the conditions in which an institution may disclose information without seeking a
student’s prior consent. The process of de-identifying the MOOC datasets included running the
Python code-based program written by Daries.
-
38
FERPA Document Review and Legal Analysis. The document review and legal
analysis was conducted in order to determine the statutory definition of key terms and
regulations for the collection, retention, and dissemination of a student’s education record.
Subpart A §99.3 provided term definitions and Subpart D §99.31(b)(1) and (2) stipulated the
regulations for releasing student information without that student’s consent. The definitions and
case law review provided the infrastructure for the analysis of both the de-identified datasets and
the content included in the datasets that might be considered an educational record. The key
terms reviewed included student, attendance, educational agency or institution, educational
record, and personally identifiable information (PII). The review of Subpart D §99.31,9 (b)(1)
and (2) provided the context in which the de-identification process would be necessary in order
to permit the release of a dataset.
Sampling Populations for De-Identification Process. My study sought to expand the
scope of Daries et al.’s (2014) study through purposive sampling which included datasets from
the two most popular MOOC providers, edX and Coursera. These platforms were selected not
only due to their prominence in the MOOC industry, but for their focus on accessibility to higher
education, wide-range of course offerings, average amount of users per course, terms of service
agreements, and user privacy policies. Udacity, the another popular MOOC provider, was not
included in this study as it recently shifted its focus to providing courses solely on computer
science and nanotechnologies through partnerships with corporate sponsors, not post-secondary
institutions.
Datasets were requested from edX, Coursera, and Daries. edX was unable to provide
datasets per their agreement with their partner institutions, but recommended requesting datasets
directly from those partner institutions, which included MIT, Daries’ home institution. Coursera
-
39
did not respond to any inquires. My requests for datasets from 12 of Coursera’s partner
institutions were also denied. Daries responded by providing instructions for requesting access to
the datasets he and his team used for his study, which were MITx courses hosted on the edX
platform, as well as links to the de-identification Python code stored on GitHub, an open-source,
project hosting website.
Through the MITx Data Request protocol, I received access to four MITx datasets: MITx
2.01x (2013), MITx 3.091x (2013), MITx 8.02x (2013), and MITx 8.MReV (2013). These
datasets were selected from the collection of the original 16 datasets used in the Person-Course
study and were chosen due to the size of the user population. Sampling from courses with
smaller user populations allowed for easier data management and the reduced number of records
to be deleted. Yet these datasets were still large enough to be well representative of a typical
MOOC dataset with a mean user population of 20,586. The datasets were stored on a secure,
encrypted external hard drive and transferred electronically using a Pretty Good Privacy (PGP)
key. Once de-identified and assessed, the original datasets were deleted.
Using the data request method suggested by edX, datasets were solicited from Coursera’s
partner institutions. Using convenience sampling, 12 institutions located in the United States, and
thus could be potentially subject to FERPA compliance were contacted via email to requests
access to their Coursera-hosted course datasets. However, no institution was willing to
participate in this study. Even though partner institutions have unique, individual contracts with
Coursera, many of the universities I contacted declined my request for data citing their terms of
use agreement with the provider. These agreements prohibited sharing their participants’
identities without seeking permission from the users whose information was included in the
datasets (Coursera, 2015). Providing me with their datasets would require the partner institutions
-
40
to contact potentially thousands of domestic and international users. Resources were not
available to accomplish this task.
Attempting to De-Identify MOOC Datasets
The original de-identification code was forked, or imported, from Daries’ GitHub page
onto my GitHub page and then imported into the software program PyCharm. The datasets were
also imported into private directory in PyCharm, which allowed for the code to be run on the raw
dataset in a protected virtual environment. The data was then converted from SQL to CSV files
and ran through the de-identification code in Jupyter Notebook. The results were imported and
saved in PyCharm.
De-identification Code and Process. I attempted to de-identify the MITx 2.01x and
MITx 3.091x datasets. Due to programming errors, I was unable to perform the de-identification
process on the MITx 8.02x and MITx 8.MReV datasets. The de-identification progam was run
on the MITx 2.01x dataset six times and the MITx 3.091x dataset once.
In order to prepare the datasets for de-identification, and per Daries et al.’s (2014)
original research design, each user was given a 16-digit identification number comprised of both
a unique identifier and the course ID. The datasets were then evaluated by quasi-identifiable,
user-specific attributes: IP address, gender, year of birth, enrollment date, last day active, days
active, and number of forum posts. I selected these attributes to be consistent with the original
study. Daries et al. report choosing these variables due to their increased probability to be
publicly available.
I used the generalization and suppression emphases on these attributes to reduce re-
identification risks and delete extreme outliers in the dataset, which allowed for the analysis of
the truncated mean. Country names, derived from the users’ IP addresses, were changed to their
-
41
respective geographic regions, and, in order to reduce skew in the results, users with 60 or more
forum posts were deleted. Then the data was concatenated by stringing the quasi-identifier
variables into groups no smaller than 5 students. In order to minimize the impact on entropy, the
code was applied systematically to each quasi-identifier represented in the utility matrix. This
process attempted to yield a k-anonymous and l-diverse dataset ready for its utility assessment.
I then attempted to determine the utility of the k-anonymous and l-diverse datasets by
completing the utility matrix. Comprised of a nine by three grid, this matrix measured the de-
identified dataset’s entropy, mean, and standard deviation of each quasi-identifier (see Table
3.3). Generated by the Python code, this matrix was run on the original identified dataset and
once again each time a variable was coded for k-anonymity. The utility matrix was to be
recorded for each iteration of the analysis for each dataset, but the program encountered an error,
preventing the utility matrix from being completed.
Table 3.3. PreUtility Matrix for MITx 2.01x
Variables Entropy Mean (n) Standard Deviation
viewed 0.893515 0.689704 0.462615
explored 1.38352 0.194336 0.395689
certified 0.345462 0.0646054 0.245828
grade 1.80109 0.0692774 0.211211
nevents 8.29603 799.21 2229.94
ndays_act 4.177 9.48864 17.5364
nplay_video 5.49129 78.207 239.401
nchapters 2.92928 3.90965 3.7522
nforum_posts 0.640539 5.8006 26.4147
-
42
Analysis
Document Review of FERPA. I determined if MOOC providers’ datasets could meet
the statutory requirements of FERPA by analyzing the regulatory definition of the terms
educational record, PII, student, and educational agency or institution as found in Subpart A
§99.3. I also assessed if MOOC users may be considered students according to §99.3 and the
relevant case law. An in-depth analysis of the statute’s applicability to MOOCs is provided in the
subsequent chapter.
Measuring K-anonymous Utility. The de-identified datasets were then analyzed to
determine their utility. In the original study, this process allowed Daries (2014) and his team to
quantify the impact the deletion of variables had on the accuracy of the de-identified dataset.
The analysis was to measure the change between the raw datasets and the k-anonymous,
l-diverse datasets by employing a utility matrix (see Table 3.3) modeled on Dwork’s (2006)
utility vector. This matrix was also designed with the intent to measure the shift in a common
metric in information theory known as Shannon entropy, mean, and standard deviation of nine
nominal variables from the pre and post-de-identified datasets. However, due to unresolved bugs
in the code, I was unable to measure the utility of the any of the k-anonymous datasets.
Limitations
My inability to gain access to a Coursera dataset was a significant limitation of this study.
Without a representative dataset from a second MOOC provider, I was unable to determine if
this methodology can effectively de-identify non-edX data. Therefore, I was unable to answer
my secondary research goal of determining a standardized methodology for MOOC data de-
identification. Additionally, Daries et al.’s (2014) did not provide the standards by which they
determined if a dataset has maintained its utility. This is problematic as the utility impact may
-
43
very dependent upon how the attributes are grouped, categorized, or eliminated. Also, currently
there are no industry standards for quantifying dataset utility.
With the additional goal of creating a systematic process for de-identifying datasets that
may be used on any type of MOOC provider and still maintain the dataset’s efficacy, my study
necessitated defining utility as the “the validity and quality of a released dataset to be used as an
analytical resource” (Woo, Reiter, Oganian, & Karr, 2009). The values for entropy, mean, and
standard deviation will be discussed in Chapter 5. The broad scope of this term offered a baseline
understanding of what should be the resulting usability of a de-identified dataset. However, it
must be noted that though a general definition of utility is provided for my study, in practice,
utility may be determined on a case-by-case basis, dependent upon the needs of the individual
using the dataset.
Finally, I encountered a number of bugs in Daries’ program, which will be disucssed
more in depth in Chapter 5. Due to these problems with the code, I was unable to complete the
method in its entirety as outlined in this chapter. This limitation of my study is reflective of the
problem with Daries’ code, not the method itself.
-
44
Chapter 4
Legal Analysis
In the aftermath of the Watergate scandal, when the public’s desire for governmental
transparency was at an all-time high (Stone & Stoner, 2002), Senator James Buckley proposed an
amendment to the General Education Provisions Act (GEPA) that would become the Family
Educational Rights and Privacy Act of 1974, more commonly known as FERPA (20 U.S. C.
§1232g). The rationale for FERPA, as articulated in Senator Buckley’s initial appeal to
Congress, recognizes the need to curtail the “abuses of personal data by schools and Government
[sic] agencies” (120 Congressional Record, 14580). Months later in the Joint Statement in
Explanation of Buckley/Pell Amendment (120 Congressional Record, 39862-39866), Senator
Buckley claimed the purpose of the law is to provide both parents and eligible students the
ability to review their education records, as well as limit the sharing of those records without
student or parental consent in an effort to promote student privacy. FERPA was authorized as an
amendment to GEPA, therefore it did not undergo Congressional committee review, limiting its
legislative history to the Joint Statement (Stone & Stoner, 2002). FERPA became law in the
summer of 1974.
Over the past 40 years, FERPA has been amended eleven times and faced significant
criticism. Many of these amendments were enacted in response to nationally publicized, critical
incidents in higher education, such as the Campus Security Act in 1990, the USA PATRIOT Act
of 2001, and the Amendments of 2008 (Ramirez, 2009). However, because these amendments
were made in conjunction with other laws, such as the Jeanne Clery Act, or as an addendum to
the Higher Education Act, the legislative history for these amendments is also limited.
Despite these modifications, the statute’s language is indisputably imprecise, leaving
institutions to interpret the statute’s terminology of educational record to meet their own needs
-
45
(Graham, Hall, & Gilmer, 2008). Until 2008, the Dept. of Education actively abstained from
providing clarity for colleges and universities on how to interpret and implement FERPA
(Lomonte, 2010). This hesitation by the Dept. of Education to offer more guidance on FERPA
compliance is a consequence of the statute’s lack of detailed legislative history.
FERPA regulates K-12 and post-secondary education systems, but critics suggest it fails
to take into account the distinctive needs of these two very different populations (Lomonte,
2010). The Dept. of Education first recognized the disparate privacy goals of higher education
students and institutions through its 2011 proposal to strengthen protections around statewide
longitudinal data systems (L'Orange, Blegen, & Garcia, 2011). However, the application of this
law and the sharing of information is contingent on multiple factors including timing, the
relationship of the parties in question to the student, and the purpose for disclosure (Meloy,
2012).
To date, MOOCs have not been litigated in any United States Course. Therefore, the
following examination of the statutory definitions of key terms in FERPA, and the review of
applicable case law, is intended to be persuasive only. The cases presented are not an
authoritative assertion of the binding precedent to be enforced on MOOC providers or MOOC
users.
Statutory Definitions of Key Terms as they Pertain to MOOCs
For MOOCs and the privacy needs of their users, examining how the definitions included
in FERPA and how these regulations relate to this learning platform is essential in determining if
MOOC datasets can and should be de-identified in a manner that is compliant with FERPA. In
order to determine qualifications for compliance, as determined for the purposes of this study,
the terms evaluated in this analysis include student, attendance, educational agency or institution,
-
46
educational record, and personally identifiable information (PII). These definitions provided in
FERPA in §99.3, as authorized by 20 U.S.C 1232g.
Who is a Student? The statute defines a student as “any individual who is or has been in
attendance at an educational agency or institution and regarding whom the agency or institution
maintains education records” (§99.3). However, determining who and under what conditions an
individual meets the statutory definition of a student is a complicated process. The term student
appears 208 times in the statute, often in correlation with other key terms such as educational
record or PII. This is especially problematic considering these terms heavily rely on the
designation of student in their own definition. For example, FERPA classifies many of types of
information that may be considered a component of an educational record, but each relies on the
qualifier that it relates to the student in some way. The definition of a student is not independent
from the term educational record, and the meaning of educational record cannot be understood
without including the term student. The same is true of PII and attendance.
FERPA is only authorized to regulate records and information that pertains to students,
therefore it is reasonable to conclude that the reliance on term student is necessary for the
success of the statute, but is problematic due to its circular nature (Young, 2015). FERPA’s
definition of “student” is vague, creating difficulty in determining if a new type of learner may
seek protection under FERPA, or if a new learning platform may be subject to regulation.
The term student has maintained its original meaning from FERPA’s enactment in 1974.
Without any amendments that directly address the definition of the term student, one must turn to
case law in assessing if a MOOC user can be considered a student under the statute. The
application of the definition of student is examined in a number of cases, including Klein
-
47
Independent School District v. Mattox, 830 F.2d 756 (5th Cir. 1987), and Tarka v. Franklin, 891
F.2d 102 (5th Cir. 1989).
Klein Independent School District v. Mattox. Under the newly established Texas Open
Records Act, a request to review the college transcripts of Rebecca Holt, a teacher in the Klein
Independent School District, raised questions regarding the FERPA rights of employees. Klein v.
Mattox (1987) examines if FERPA may be used to protect educational records that are included
in a personnel record. The United States Court of Appeal for the 5th Circuit held that, because
Holt’s relationship with the Klein Independent School District was as an employee and never as
a student who attended classes within the district, she could not seek relief under FERPA.
Klein’s significance for MOOCs extends beyond the definition of student and raises the
question of the value of personal privacy when contrasted against the public’s best interest in the
context of FERPA. The court did suggest the need to vet the competency and credentialing of the
school district’s educators outweighs Holt’s desire to keep her transcripts private, thus the release
of such information does
top related