design and multilingual users on twitter and wikipedia

38
Design and Multilingual Users on Twitter and Wikipedia Scott A. Hale [email protected] http://www.scotthale.net/ Oxford Internet Institute University of Oxford 17 June 2014 Scott A. Hale Design and Multilingual Users

Upload: scott-a-hale

Post on 19-Jun-2015

152 views

Category:

Data & Analytics


2 download

DESCRIPTION

Presentation given at MIT Media Lab on June 17, 2014. Presents ongoing work on design and multilingual users. Two recent papers are "Global Connectivity and Multilinguals in the Twitter Network" (http://www.scotthale.net/pubs/?chi2014) and "Multilinguals and Wikipedia Editing" (http://www.scotthale.net/pubs/?websci2014)

TRANSCRIPT

  • 1. Design and Multilingual Users on Twitter and Wikipedia Scott A. Hale [email protected] http://www.scotthale.net/ Oxford Internet Institute University of Oxford 17 June 2014 Scott A. Hale Design and Multilingual Users

2. Importance of design Scott A. Hale Design and Multilingual Users 3. Importance of design Scott A. Hale Design and Multilingual Users 4. Content is diverse across languages multilingualism...[is] the norm for most of the worlds societies (Birner, 2005), with over half of Europe and over a fth of the US multilingual (Erard, 2012); yet, many platforms are designed only with monolingual users in mind. In a Uzbekistan survey, Internet users reported accessing content in foreign languages even while simultaneously reporting poor foreign language skills (Wei & Kolko, 2005) Scott A. Hale Design and Multilingual Users 5. Content is diverse across languages multilingualism...[is] the norm for most of the worlds societies (Birner, 2005), with over half of Europe and over a fth of the US multilingual (Erard, 2012); yet, many platforms are designed only with monolingual users in mind. In a Uzbekistan survey, Internet users reported accessing content in foreign languages even while simultaneously reporting poor foreign language skills (Wei & Kolko, 2005) Users often contribute local content/knowledge (Hecht & Gergle, 2010a) Large diversity in information between languages (Hecht & Gergle, 2010b) Can lead to self-focus bias (Hecht & Gergle, 2009) Scott A. Hale Design and Multilingual Users 6. Motivations Language clustering vs. small-worlds Users thought to cluster by language in most online platforms (Barnett & Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng & Varis, 1974; Takhteyev, Gruzd, & Wellman, 2011; Wilkinson & Thelwall, 2012) Many online platforms thought to exhibit the small-world phenomenon of small path lengths between users (despite high clustering) Scott A. Hale Design and Multilingual Users 7. Motivations Language clustering vs. small-worlds Users thought to cluster by language in most online platforms (Barnett & Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng & Varis, 1974; Takhteyev et al., 2011; Wilkinson & Thelwall, 2012) Many online platforms thought to exhibit the small-world phenomenon of small path lengths between users (despite high clustering) Role of multilingual users If users cluster by language and platforms are small-worlds, there must be brokers bridging dierent language groups (spanning structural holes) Multilingual users are possible bridge users. Only one study investigating this: Ego-net level study on Twitter followingfollower network structure (Eleta & Golbeck, 2012). No study multiplatform study, no study at large-scale level Scott A. Hale Design and Multilingual Users 8. Outline What are the roles of multilinguals and platform design in shaping the spread of information in social media? Twitter and Wikipedia at a global level 1 Language will have strong role in structuring the platform 2 Users engaging with content in multiple languages (multilingual users) serve as bridges between dierent clusters/editions 3 Users primarily writing in less-represented languages will be more likely to cross-language boundaries than users writing in highly-represented languages 4 When users cross languages they will cross to larger languages (e.g. English) and thus at a language level English will form more bridges than other other languages Scott A. Hale Design and Multilingual Users 9. Data Twitter Twitter mentions, retweet network 18 days of spritzer 1% sample stream from June 2011 7,341,271 nodes. 8,545,693 directed, weighted edges Wikipedia Edits from top 46 language editions 8 July to 9 August 2013 3.5 million non-minor edits by 55,568 registered users Global Connectivity and Multilinguals in the Twitter Network (2014). http://www.scotthale.net/pubs/?chi2014 Multilinguals and Wikipedia Editing (2014). http://www.scotthale.net/pubs/?websci2014 Scott A. Hale Design and Multilingual Users 10. Twitter: Data cleaning Language classication Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham, Hale, & Ganey, 2013) Scott A. Hale Design and Multilingual Users 11. Twitter: Data cleaning Language classication Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the users tweets in one language Remove users with less than four tweets total Scott A. Hale Design and Multilingual Users 12. Twitter: Data cleaning Language classication Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the users tweets in one language Remove users with less than four tweets total Bots and spam users Remove users with no mentions (indegree=0) Select only the largest weakly-connected component (88% of nodes) Scott A. Hale Design and Multilingual Users 13. Twitter: Data cleaning Language classication Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the users tweets in one language Remove users with less than four tweets total Bots and spam users Remove users with no mentions (indegree=0) Select only the largest weakly-connected component (88% of nodes) End result 916,836 nodes (users) and 2,652,618 directed edges (mentions/retweets) Each user assigned most used language and frequency [0-1] that the most used language is used Scott A. Hale Design and Multilingual Users 14. Wikipedia: Data cleaning Non-minor edits by registered, human users to articles Only edits to main (article) namespace Removed articles agged as being created by bots Removed anonymous users Removed undeclared bots and users with only one edit session in the month Require at least four edits and at least 2 edits to one edition Matching users and articles across languages Look for common usernames across language editions Check usernames are indeed linked global accounts WikiData dump to match articles across languages 55,568 users (excluding Simple English edition) with a total of 3,518,955 edits. Scott A. Hale Design and Multilingual Users 15. User counts Twitter Language User Count English (en) 375,474 Japanese (ja) 137,263 Portuguese (pt) 133,501 Malay/Indonesian (ms) 106,223 Spanish (es) 70,246 Dutch (nl) 31,035 Korean (ko) 16,123 Thai (th) 8,629 Arabic (ar) 7,679 French (fr) 5,769 Filipino/Tagalog (l) 5,393 Wikipedia Language User Count English 22,412 German 4,920 French 3,430 Russian 3,330 Spanish 3,299 Japanese 3,164 Italian 2,202 Chinese 1,975 Portuguese 1,220 Polish 1,011 Dutch 1,007 Scott A. Hale Design and Multilingual Users 16. Twitter: Multilinguals vs Monolinguals On Twitter, 11% of users (103,000) were observed to use more than one language and designated as multilingual users. Multilingual vs. monolingual users: Comparison of tweet count, out-degree, and in-degree. Scott A. Hale Design and Multilingual Users 17. Wikipedia: Multilinguals vs Monolinguals On Wikipedia, 15.4% of users (8,544) edited more than one language edition and were designated as multilingual users. Density plot compares the number of edits made by monolingual and multilingual Wikipedia users. Size of edits does not dier signicantly. Scott A. Hale Design and Multilingual Users 18. Wikipedia: Multilinguals vs Monolinguals On Wikipedia, 15.4% of users (8,544) edited more than one language edition and were designated as multilingual users. Density plot compares the number of edits made by monolingual and multilingual Wikipedia users. Size of edits does not dier signicantly. Only 2.6% of edits are from users writing in their non-primary languages on Wikipedia. Scott A. Hale Design and Multilingual Users 19. Twitter: Language and structure Label propagation algorithm (Raghavan, Albert, & Kumara, 2007) found 20,253 communities. Histograms of the size of communities (left) and the number of languages within each community (right). Modularity score of 0.81 for this community structure. Scott A. Hale Design and Multilingual Users 20. Twitter: Language and structure Scatter plot of community size and the percentage of users in the community most often using the most prevalent language. Scott A. Hale Design and Multilingual Users 21. Language and structure Most-used language % users in most-used language Number of languages Number of nodes Malay (ms) 78.3 41 123,616 English (en) 99.3 39 114,826 Portuguese (pt) 94.3 40 101,987 Japanese (ja) 99.6 19 83,785 English (en) 75.7 44 80,387 English (en) 55.1 42 37,688 Dutch (nl) 90.6 23 20,634 Table Clusters with over 10,000 nodes found through the label propagation algorithm. Collectively 61% of all users are in one of these clusters. Scott A. Hale Design and Multilingual Users 22. Twitter: Do multilinguals bridge clusters? Size of the largest, weakly-connected component (left), total number of components (center), and average size of the components (right) created by removing all multilingual users, an equivalent number of monolingual users randomly, an equivalent number of all users randomly, and removing all multilingual users from a network with the same degree distribution but with edges randomly shued. Box plots show values from 100 realizations. Mean values are indicated with +. Scott A. Hale Design and Multilingual Users 23. Wikipedia: Do multilinguals bridge editions? Do multilinguals edit similar articles across languages? A large number of users did not edit any of the same articles in their primary languages, but a large number of users also always edited the same articles in their primary languages. Scott A. Hale Design and Multilingual Users 24. Wikipedia: Do multilinguals bridge editions? Do multilinguals edit similar articles across languages? A large number of users did not edit any of the same articles in their primary languages, but a large number of users also always edited the same articles in their primary languages. Scott A. Hale Design and Multilingual Users 25. Variations by language Twitter Wikipedia Number of users in each language compared to the percentage of these users classied as multilingual. Scott A. Hale Design and Multilingual Users 26. Twitter: Cross-language connections ar de en es fil fr gl it ja koms nl pt th Mentions and retweets across languages Nodes represent most-used language Directed, weighted edges show the log of the number of users primarily using one language who mention / retweet users in another language Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm N.B. This diers from the published paper where edges were normalized by the expected number of connections between language pairs if tweets were directed at users randomly without regard to language. Scott A. Hale Design and Multilingual Users 27. Wikipedia: Language crossings ar bg ca cs da de en es fa fifr he hu id it ja ko nl no pl pt ro ru sv tr uk zh Co-editing network graph Nodes represent language editions Directed, weighted edges show the log of the number of users primarily editing one language edition who edited another edition Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm Scott A. Hale Design and Multilingual Users 28. Wikipedia: Language crossings (English removed) ca cs de es fr it ja nl pl pt ru sv uk zh Co-editing network graph Nodes represent language editions Directed, weighted edges show the log of the number of users primarily editing one language edition who edited another edition Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm Scott A. Hale Design and Multilingual Users 29. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. 30. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage 1/self-focus bias 31. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage 1/self-focus bias Important per language variations Users in less-represented languages more likely to cross-language boundaries on Wikipedia, but no correlation on Twitter. Platform dierences? Consistent ndings of English and Japanese as outliers 32. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage 1/self-focus bias Important per language variations Users in less-represented languages more likely to cross-language boundaries on Wikipedia, but no correlation on Twitter. Platform dierences? Consistent ndings of English and Japanese as outliers Larger languages form bridges Especially English, but Other geolinguistic patterns evident Global connectivity results through the combination of multilinguals across many language pairs 33. Design and Multilingual Users on Twitter and Wikipedia Scott A. Hale [email protected] http://www.scotthale.net/ Oxford Internet Institute University of Oxford 17 June 2014 Scott A. Hale Design and Multilingual Users I would like to thank Eric T. Meyer, Taha Yasseri, Jonathan Bright, and Mike Thelwall who provided helpful comments on various aspects of this research. 34. Barnett, G. A., & Choi, Y. (1995). Physical Distance and Language as Determinants of the International Telecommunications Network. International Political Science Review, 16(3), 249265. Available from http://ips.sagepub.com/content/16/3/249.abstract Birner, B. (2005). Bilingualism (Tech. Rep.). Washington, DC, USA: Linguistic Socieyt of America. Available from http://www.linguisticsociety.org/files/Bilingual.pdf Eleta, I., & Golbeck, J. (2012). Bridging Languages in Social Networks: How Multilingual Users of Twitter Connect Language Communities. Proceedings of the American Society for Information Science and Technology, 49(1), 14. Available from http://dx.doi.org/10.1002/meet.14504901327 Erard, M. (2012, January). Are we Really Monolingual? Available from http://www.nytimes.com/2012/01/15/opinion/sunday/ are-we-really-monolingual.html Scott A. Hale Design and Multilingual Users 35. Graham, M., Hale, S. A., & Ganey, D. (2013). Where in the world are you? Geolocation and language identication in Twitter. Professional Geographer. Hale, S. A. (2012a). Impact of platform design on cross-language information exchange. In Proceedings of the 2012 acm annual conference on human factors in computing systems extended abstracts (pp. 13631368). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/2212776.2212456 Hale, S. A. (2012b). Net Increase? Cross-Lingual Linking in the Blogosphere. Journal of Computer-Mediated Communication, 17(2), 135151. Available from http://onlinelibrary.wiley.com/doi/ 10.1111/j.1083-6101.2011.01568.x/full Hale, S. A. (2014a). Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the sigchi conference on human factors in computing systems (pp. 833842). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/2556288.2557203 Scott A. Hale Design and Multilingual Users 36. Hale, S. A. (2014b). Multilinguals and Wikipedia Editing. In Proceedings of the 6th annual acm web science conference. New York, NY, USA: ACM. Available from http://arxiv.org/abs/1312.0976 Hecht, B., & Gergle, D. (2009). Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth international conference on communities and technologies (pp. 1120). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1556460.1556463 Hecht, B., & Gergle, D. (2010a). On the localness of user-generated content. In Proceedings of the 2010 acm conference on computer supported cooperative work (pp. 229232). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1718918.1718962 Hecht, B., & Gergle, D. (2010b). The Tower of Babel meets Web 2.0: User-generated content and its applications in a multilingual context. In Proceedings of the 28th international conference on human factors in computing systems (pp. 291300). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1753326.1753370 Scott A. Hale Design and Multilingual Users 37. Herring, S. C., Paolillo, J. C., Ramos-Vielba, I., Kouper, I., Wright, E., Stoerger, S., et al. (2007). Language Networks on LiveJournal. In Proceedings of the 40th annual hawaii international conference on system sciences. Washington, DC, USA: IEEE Computer Society. Available from http://dx.doi.org/10.1109/HICSS.2007.320 Nordenstreng, K., & Varis, T. (1974). Television trac: A one-way street? A survey and analysis of the international ow of television programme material. Reports and Papers on Mass Communication(70). Raghavan, U. N., Albert, R., & Kumara, S. (2007, September). Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E, 76(3), 36106. Available from http://link.aps.org/doi/10.1103/PhysRevE.76.036106 Takhteyev, Y., Gruzd, A., & Wellman, B. (2011). Geography of Twitter networks. Social Networks, 126. Available from http://www.sciencedirect.com/science/article/pii/ S0378873311000359#FCANote Scott A. Hale Design and Multilingual Users 38. Wei, C. Y., & Kolko, B. E. (2005). Resistance to globalization: Language and Internet diusion patterns in Uzbekistan. New Review of Hypermedia and Multimedia, 11(2), 205220. Wilkinson, D., & Thelwall, M. (2012). Trending Twitter topics in English: An international comparison. Journal of the American Society for Information Science and Technology, 63(8), 16311646. Available from http://dx.doi.org/10.1002/asi.22713 Zuckerman, E. (2008). Meet the bridgebloggers. Public Choice, 134(1), 4765. Zuckerman, E. (2013). Rewire: Digital Cosmopolitans in the Age of Connection. London: W. W. Norton & Company. Scott A. Hale Design and Multilingual Users