what your tweets tell us about you, speaker notes

• Introduce paper title • Ask people to interact, comment, respond to our questions during presentation using

#tweetprivacy • Credits

o Charlesworth – whose Digital Lives Report was one of the only papers that provided any analysis and guidance in the area of social media archiving.

Interest in social media data is multidisciplinary, resulting in conflicting views regarding the ethical management of captured datasets. Curators will be required to navigate these conflicting views as they work to provide appropriate mechanisms for access and reuse of these data. We hope to encourage researchers, library, archive, or repository staff to engage in a cross-‐disciplinary conversation about the privacy issues (as well as the host of other issues) inherent in using social media as a primary source for research. We’re going to show you a clip from Laila Sakr’s presentation at the Tech@state Data Visualization Conference in Washington DC. The clip provides a good example of how researchers are using twitter and other social media data. [Play Clip] There are two key things I want to point out: 1. Long-‐term archiving of this data and other curatorial issues like value, authenticity, and significant properties are absent from this talk, which is not surprising. They were also absent in many of the papers we read that utilized Twitter data. This demonstrates that there is an overall emphasis by researches at this point, on collection and analysis rather than on preservation. 2. Sakr makes sure to say that she is downloading only the publicly available tweets using the search API and how this could potentially affect her sample and the validity of it. She’s not talking about it in terms of privacy issues – which further illustrates that the focus is on analysis rather privacy or the ethics. We’d like to take an informal poll similar to last night’s poll of the audience’s willingness to have their genome sequenced. Who among those of you who use Twitter as a communication tool is completely fine with having your tweets, profile information, images, location data downloaded, analyzed, archived, preserved? -‐of those of you with your hands raised, how many of you have tweeted something of a more personal nature that you might not want archived? And who here is actively involved with the collection of Twitter data? – any social media data? ?What do you do with it – Tweet here] The reason I ask is we found through our work with the Hypercities Egypt Twitter data, that the issue of whether or not there are privacy concerns with a data source like Twitter is essentially a research ethics issue; which varies depending on the role and/or subject background of the researcher and how they view the context of the data creation. (refer Confounding Relationships to point out various roles)

So, our central thesis is that perceptions of privacy in social media platforms are formed by disciplinary culture, the capabilities and constraints of the platform, and community norms the platform itself. Does analyzing a person's Tweets constitute researching a human subject? Or are Tweets a creative text which requires proper citation and credit to the authors or tweeters? Or are Tweets part of the open public record. Social scientists tend to view the data as Human Subject research, while Humanists tend to view the data as a form of publication. These very different ways of viewing the data require different methods for dealing with privacy. We feel it is important to state that social media data are not homogenous; each platform has its own unique constraints for the creation/inclusion of content as well as constraints on how users may engage in the space, and their expectations and norms of interaction. Our case study focuses on Twitter, so while we provide a general framework assessing privacy issues with social media, it must be understood, that because of the uniqueness of Twitter’s Privacy Policies, Terms of Service, Developers Rules of the Road, the analysis and interpretation are not necessarily generalizable to other platforms, such as Facebook. Like many data curation activities there will be some facets which can be generalized, while others may be platform, or subject specific. Part of determining the curation needs of social media data will be to determine these boundaries. What can we learn about you from Twitter? [Show different visualizations, then tweet map, tweet image] Depending on how the data are visualized we can learn about you as an individual, your internet relations, or as part of huge collective, or nothing about you as an individual (r-‐shief image). Different visualizations will enable better anonymization than others. However, the underlying dataset used to generate the visualizations will still contain: if your account is unprotected, name, location, photos, etc. anything you decide to share in your timeline – so if you include other personal info – like an email or some such thing, we can find it out about you. But What else can we find out about you? [show the Alyaa Gad slide – then the Google Search] Thanks to the power of search engines like google, we can get a lot more information, which may be collected and archived as well. Our Case Study or what I like to call “we’ve got tweets, now what?” Todd Presner, a UCLA Faculty member and two researchers collected a subset of the overall Twitter data available. He asked the library to archive it. Before we could do anything with it, we had to assess what he had collected. The HyperCities team used the Twitter Search API to pull data based on the location parameter (within 200 km of the center of Cairo), time period (January 30, 2011 through February 24, 2011), AND one of three hashtags (#jan25 OR #egypt OR #tahrir). They downloaded approximately 420,000 public Tweets during the initial phase of this analysis and continue to feed their site with live feeds.

Like Sakr, the data capture was motivated by the fact that significant events were taking place using Twitter, and because twitter data disappears quickly (10 days), they decided to start downloading and make it available to as many people as possible for future reference and study. There wasn’t necessarily any research question or overarching thesis behind the collection other than to provide a glimpse back to the Egyptian Revolution Twitterverse. As Dr. Charlesworth pointed out yesterday morning, legal issues with gathering this type of data won’t be at the forefront of the researcher’s mind. Based on the search parameters, the data set captured eight out of approximately forty possible Twitter data fields, revealing how the method of capture, and search parameters profoundly shape the resultant data. The data is sitting on Prof. Presner’s personal server as JSON files, but the data will soon be converted into XML for ease in depositing and managing the data in Isalandora. These facts must be documented in order for future users to have a clear understanding of the data set. “But the data are already public…” So if the general understanding that your twitter data is open and public, and that people using these platforms want to be seen AND heard, why should we be concerned about privacy? The Privacy Policy of twitter stipulates that while you “own” your content – anyone, including twitter or any third party, are given the right to access your data and re-‐use it. (our reading of the privacy policy) Those who see Twitter data as data that contains potentially identifying information about human subjects may want to anonymize the data for the authors' protection, and may see displaying user names as unethical. This runs contrary to Twitters Rules of the Road which require the display of a user id to give credit to the person who tweeted. Yet Twitter also acknowledges this public/private tension in their own policies by suggesting if there is a concern over privacy or security risks by making a user id or other information available, the individual or media should get in touch with them. The debate about the capture, reuse, and display of Twitter data is the line between thelegality of collecting this content and the ethics of doing so. To date there haven’t been any formal legal challenges about the downloading, use and archiving of Twitter data, that we are aware of. Thus ensues a wide-‐ranging debate by scholars who characterize privacy issues with social media data in the following ways: Most researchers take a harm-‐based view of privacy, in which the goal is to protect users’ information from negative actors.

This includes concern for security issues (used by government agencies to track and arrest; use as evidence).

Recognizing there are loopholes in the data, which enables someone to get a lot of information about an individual, even if all you have is a username; deletion of account and changing from public to private content captured will be available.

Finally, (Buyer beware) those users who have opted to make their accounts public have no grounds for complaint about the collection and reuse of their content, even if they did not anticipate reuse by researchers or commercial firms (Thelwall, 2010; Vieweg, 2010). Danah boyd still asks: Just because we can collect it, should we? Michael Zimmer, an Internet Privacy scholar, argues instead for a dignity-‐based view of privacy that views the act of another person taking one’s personal information from the social networking sphere, amassing into a database, making available for use and scrutiny, is an affront to the users’/subjects’ human dignity and their ability to control the flow of their personal information. Finally, What are the user’s expectations of how their tweets will be used? How many here have actually read Twitter’s privacy policy? FB? Do you understand the implications of re-‐use? ___Schmidt, Trepte, and Reinecke (2011) observe that users develop shared routines and expectations of self-‐disclosure, noting that privacy management is performed for a specific audience. Facebook for example enables users to select privacy settings on a post-‐by-‐post basis, choosing who is able to read, comment, and interact with specific content, and allowing the user fairly granular control over the flow of their information. Twitter allows only binary control; users can designate their account as “protected” (i.e. Tweets are only visible to approved followers), or “public” (enabled by default), which makes a user’s profile and timeline accessible to anyone, even those without a Twitter account. The ethical jury is going to be out on this for a while; at least until scholarly communities work out parameters and provide guidance for acceptable use of social media data. In the meantime, what are we to do? Legal and ethical policy related to privacy and social media data is still in flux and almost always lags behind the pace of research. Yet libraries, etc are pressured to act now to archive the data. Data repositories will be caught in the middle of these divergent viewpoints when trying to determine the best methods of providing access to the data. The norms of individual research disciplines often provide guidance for curators, but when researchers with divergent norms seek access to the same data, it can be difficult to determine how best to serve the broadest number of users. Experience with this data set and the literature review led us to the following recommendations Libraries or other data repositories will need to decide if archiving social media data fits with their overall institutional mission and goals. Libraries should determine the overall risks associated with collecting and archiving social media data and design strategies to mitigate those risks. Because of the significance scholars are placing on the need to collect and now in our case archive twitter data, we are convinced that providing for the collection, preservation, and reuse of social media data requires at the very least conversations among researchers, libraries, archives, institutional review

boards, scholarly societies, and other national and international organizations concerned with the production and preservation of scholarship. Part of the discussion will need to include the context or conditions under which the data have been collected. One important aspect of this, as Dr. Charlesworth mentioned in his talk yesterday, is a way to gather “legal metadata” so that going forward the archive or repository will have the necessary privacy I’s dotted and t’s crossed, in so much as it is possible. Libraries should engage researchers as early as possible in the research process. curators are presented with a golden opportunity to collaborate with researchers as close to the beginning of the research lifecycle. Through a collaborative process, we can ideally facilitate the creation of collections that balance openness with privacy concerns, and encourage broad reuse. While that early intervention may not happen, we can employ curatoratorial strategies on the backend of the data gathering will hopefully push the issue. (one of which will be discussed in our next recommendation.) Here we start to addresses the question somewhat that was asked yesterday at Dr. Charlesworth’s presentation about educating the researchers. Libraries choosing to archive social media data should develop clear and easy to use collection and deposit policies, forms and tools. It has been our argument since first working with Twitter Data that a way to both educate researchers and create ingestible, reusable data into a repository is to create a workflow that asks the necessary questions of researchers, which would aid in the creation of a codebook and documentation for the data. We created a twitter deposit form that is geared toward raising the privacy issues with this platform, educating the researcher, as well as providing a way to record the basic legal and descriptive metadata necessary for contextualizing the data for re-‐use. Teachable moments for information literacy librarians. Understand and know the source of information. Twitter adds language to their privacy policy that more explicitly state use. (gain consent – and then the data truly become open) Ideally, Twitter would take a different approach to releasing the data for research; partner directly with researchers; rather than with a third party like GNIP which charges for the data and isn’t clear what can be done with it once it’s been purchased. Lastly, thanks to all who have been tweeting during the session; we wanted to let you know that we’ve archived them in TwapperKeeper.

what your tweets tell us about you, speaker notes

Technology