1 bettina berendt humboldt university berlin, germany – social-media blog tagging: metadata or...

42
1 Bettina Berendt Humboldt University Berlin, Germany – www.berendt.de Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people describe what they‘ve written – and how do readers understand that?

Upload: derick-copeland

Post on 12-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

1

Bettina Berendt

Humboldt University Berlin, Germany – www.berendt.de

Social-media blog tagging:

Metadata or “just more content” ?

OR:How do normal people describe what they‘ve written – and how do readers understand that?

Page 2: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

2

Acknowledgements

This presentation is based on the paper

Tags are not metadata, but “just more content” – to some people

at the International Conference on Weblogs and Social Media,

Boulder, CO, USA, March 2007

http://www.icwsm.org/papers/paper12.html

I thank my co-author

Christoph Hanser

(now at Resco, Hamburg, Germany)

Page 3: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

3

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 4: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

4

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 5: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

5

Blogs

A weblog (blog) is

a website

containing journal-style entries

presented in reverse chronological order,

often written by a single user

Page 6: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

6

Blogs and other social media („Web 2.0“)

Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links

Social network sites(e.g., MySpace)

Instant messageexchanges(Twitter)

Wikis(e.g., Wikipedia)

“Annotation platforms“(e.g., del.icio.us)

Page 7: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

7

Blogs and other social media,and their activity focus

Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links

Publication; Expression

Social network sites(e.g., MySpace)

(Self-)profiling,Meeting people

Communication

Instant messageexchanges(Twitter)

Wikis(e.g., Wikipedia)

Creating content

“Annotation platforms“(e.g., del.icio.us)

Organizing content

Page 8: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

8

Blogs and other social media,and some of their origins in older media

Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links

Diaries (Often political) journalism PR; press releases

Social network sites(e.g., MySpace)

Dating sites

Chatrooms

Instant messageexchanges(Twitter)

Wikis(e.g., Wikipedia)

Computer-supportedcooperative work

“Annotation platforms“(e.g., del.icio.us)

Bookmarks www.dmoz.org

Usenet

Page 9: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

9

Blogs: Publication bordering on communication

Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links

Publication; Expression

Social network sites(e.g., MySpace)

(Self-)profiling,Meeting people

Communication

Instant messageexchanges(Twitter)

Wikis(e.g., Wikipedia)

Creating content

“Annotation platforms“(e.g., del.icio.us)

Organizing content

Page 10: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

10

Blogs and other social media: Where tagging (= adding keywords) is most prominent

Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links

Diaries (Often political) journalism PR; press releases

“Annotation platforms“(e.g., del.icio.us)

Bookmarks

Reader tags

Author tags

Usenet

Page 11: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

11

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 12: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

12Tags in blogs (I):What does a tag tell us in terms of what a blog is about?

On March 20, 2003, Baghdad got its first taste of "Shock & Awe". A fiery mix of cruise missiles and high-IQ bombs lit up the night sky and unleashed the most lethal air bombing campaign since Vietnam. [...] Four years on, Shock & Awe is the daily reality of a war that has killed 3,200 US soldiers and well over 60,000 Iraqi civilians. [...]

Tags: Iraq, war

We expect tags to mirror content.

Page 13: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

13

Tags in blogs (II):However ...

Tags: GPS systems

... tags often complement or add to content.

... this is perceived differently by different readers.

Reader 1: geography & computersReader 2: computers & politics

Level of access, allowing a precision of position determination of less than 20 meter selective availability would be turned off, and so now all users enjoy nearly the same however, on may 1 , 2000 , then us president bill clinton announced that this The system also provides . for the user to add intelligence, as perceived by the blind user, to the central server hosting the spatial database The codes are well suited to decoding a message embedded in noise signals which may be orders of magnitude larger than the signal itself […]

Page 14: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

14

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 15: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

15

1 million new tags every month

“… bloggers are not settling on common, decentralized meanings for tags; rather, they are often independently choosing distinct tags to refer to the same concepts”

(Brooks & Montanez, 2006)

Page 16: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

16

Tag functions (Golder & Huberman, 2006)in del.icio.us

(also) typical author tags

typical reader tags

Identifying what (or who) the tagged item is about .

Refining categories

Identifying qualities or characteristics

Identifying what the tagged item is

Identifying who owns the tagged content / item .

Self reference

Task organizing

Page 17: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

17

Tag functionsmirror standard metadata elements (here: Dublin Core)

(also) typical author tags

typical reader tags

dc:subject (“the topic of the resource“) – Identifying what (or who) the tagged item is about

dc:description – Refining categories

dc:description – Identifying qualities or characteristics

dc:type, dc:format – Identifying what the tagged item is

dc:creator (dc:rights ?) – Identifying who owns the tagged content / item

dc:contributor – Self reference

(dc:relation ?) – Task organizing

Page 18: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

18

Do we expect tags / metadata to add to content?

(also) typical author tags

typical reader

tags

YES

NO dc:subject (“the topic of the resource“) – Identifying

what (or who) the tagged item is about

dc:description – Refining categories

dc:description – Identifying qualities or characteristics

dc:type, dc:format – Identifying what …

dc:creator (dc:rights ?) – Identifying who … .

dc:contributor – Self reference

(dc:relation ?) – Task organizing

[...] Eleven years ago, as a condition for ending the Persian Gulf War, the Iraqi regime was required to destroy its weapons of mass destruction, to cease all development of such weapons, […] The Iraqi regime has violated all of those obligations. It possesses and produces chemical and biological weapons. [...]

Identifying who owns the tagged item:http://www.whitehouse.gov/news/releases/2002/...

Page 19: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

19

How do author tags relate to content?

Page 20: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

20

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 21: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

21

Empirical study – Part I: Overview

Blog corpus

Content classificationwith several text mining methods

“Gold standard“derived fromcontent classificationby human annotators

Controlled vocabulary / taxonomy

Page 22: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

22

Data

Taken from the Weblogging Ecosystems 2006 corpus

random sample of 100 blog posts

written on 4th July 2006 (the first day of the large corpus)

written in English

tagged by their authors

length 50-500 words

no blogs with (only) meaningless tags

Tags were nearly all “topic tags”, broad range of topics

Page 23: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

23

Classification taxonomy: WordNet Domains (Magnini & Cavaglia, 2000) – excerpt

FACTOTUM DOCTRINES FREE_TIME APPLIED_SCIENCE PURE_SCIENCE SOCIAL_SCIENCE / number / archaeology / play / agriculture / astronomy / administration / color / astrology / / betting / alimentation / / topography / anthropology / time_period / history / / card / gastronomy / biology / ethnology / person / / heraldry / / chess / architecture / / biochemistry / / folklore / quality / linguistics / sport / / town_planning / / ecology / artisanship / metrology / / grammar / / badminton / / building_industry / / plants / body_care

/ literature / / baseball / / furniture / / zoology / commerce

/ / philology / / basketball / computer_science / / / entomology / economy

/ philosophy / / cricket / engineering / / anatomy / / banking

/ psychology / / football / / mechanics / / physiology / / book_keeping

/ / psychoanalysis / / golf / / astronautics / / genetics / / enterprise

/ art / / rugby / / electrotechnics / chemistry / / exchange

/ / dance / / soccer / / hydraulics / earth / / insurance

/ / drawing / / table_tennis / medicine / / geology / / money

/ / / painting / / tennis / / dentistry / / meteorology / / tax

/ / / philately / / volleyball / / pharmacy / / oceanography / / finance

/ / music / / cycling / / psychiatry / / paleontology / fashion

Page 24: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

24

Human annotations (“reader tags“)

5 graduate students

received corpus + WND hierarchy

Labelled each post with ≥ 0 domains (recommended: 0-3)

assigned 340/500 * 1 domain; 160/500 * 2 domains, 23/500 * 0 domains

Aggregation: consensus = domains with at least 2 votes

IAA = average pairwise Jaccard similarity

Jaccard similarity (A,B) = |A B| / |A B| = 0.5 * (1+ |A B| / |A B|) * F1

IAA = 0.39

Similarity to consensus {0.31, 0.47}

Pairwise similarity {0.26, 0.4}

The corresponding F1 values can be found at the paper‘s Web site:http://www.wiwi.hu-berlin.de/~berendt/Papers/ICWSM07/

Page 25: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

25

Automated classification

Cleaning, POS tagging, lemmatizing

All 10 combinations of

feature sets for bag-of-words author tag(s) (tag)

nouns from the title (titleN)

nouns from the blog post body (bodyN)

the top 5 TF.IDF keyphrases (noun n-grams) from the body (TF.IDF)

the top 5 IDF keyphrases (noun n-grams) from the body (IDF)

word sense disambiguation strategies the top sense of the word (T)

all senses of the word (A)

From term to sense to domain

Page 26: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

26

WordNet:words senses

Page 27: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

27

WordNet Domains as the classific-ation hierarchy: senses domains

Page 28: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

28

Study I: Results

Page 29: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

29

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 30: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

30Part II: Do automated methods err in the same way? What about automated-method – annotator suitability?

Blog corpus

Content classificationwith several text mining methods

“Gold standard“derived fromcontent classificationby human annotators

Controlled vocabulary / taxonomy

Page 31: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

31

Similarity between methods

Automated methods

Human annotators

Page 32: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

32

Combining automated methods

Tags complement content.

Definition: In a corpus of posts consisting of body elements (text, title, ...) and author tags, the tags are not metadata but content if (1) the tags have a low similarity with the body (such that body features cannot be used to predict the tags, or vice versa), and (2) the combination of body and tags predicts the human consensus classification of content better than either body or tags alone.

But: HOW are they more content?

Page 33: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

33

Tags provideadditional information

Yes! Ive been looking for a way to easily transfer songs on my iPOD to my computer. I want to back all of them up to DVD, but Apple makes it very difficult to pull them off. Thats for copyright purposes, Im sure, but there are legitimate uses for it as well. iPOD Agent, a $15 shareware program, allows you to do that and much more. Synchs contacts/notes/etc with Outlook, gets horoscopes/movie times/weather/RSS feeds. Good stuff. iPOD Soft Go get it! Adam

Tags: General, Music Tag-based methods: Music

Body-based methods:Computer ScienceLiterature

Human annotators:3* (Music, Computer Science)PlayFree time

Page 34: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

34

A closer look at agreement

Page 35: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

35

Tags don‘t give all the information – go on reading!

Todays New York Times includes this report about Sleeper Cell, a 10-part Showtime series about a faithful Muslim named Darwyn (yes, we get it) who infiltrates a terrorist group. […] discuss the shows idealism: "You learn there are peace-loving souls in every religion," said Mr. Fehr, who once served in the Israeli military. "We have to respect and strengthen the peace-believers, and hopefully find a way turn the terrorists." In that sense, the production, for all its violence […] is perhaps most ambitious for the idealism that courses through it. […]

Tags: Radio & TV, IslamTag-based methods: TV, Religion

Body-based methods:Politics (and others)

A1: Religion, TV

A3: Politics

Page 36: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

36Interpretation requires knowledgeof blogspace authoring conventions

I first wrote about Ludwika Ogorzelec s Space Crystallization Cycle after seeing her show here in NYC last February. Her prolific installation of site specific cellophane lattice has graced a broad range of settings since the series began a couple years ago. The latest... farmland. Farming With Mary is a Queensland Australia project that brought environmental artists from all over the globe to the farming community. Ludwika installed three pieces, each comprised of about 5km of cellophane, on a farm in Tuchikoi in the Mary Valley Region. She also installed one piece in Noosa Woods. Pictures after the jump.

Tags: Art Tag-based methods: Art

Body-based methods:(other domains)

I first wrote about Ludwika Ogorzelec s Space Crystallization Cycle after seeing her show here in NYC last February. Her prolific installation of site specific cellophane lattice has graced a broad range of settings since the series began a couple years ago. The latest... farmland. Farming With Mary is a Queensland Australia project that brought environmental artists from all over the globe to the farming community. Ludwika installed three pieces, each comprised of about 5km of cellophane, on a farm in Tuchikoi in the Mary Valley Region. She also installed one piece in Noosa Woods. Pictures after the jump.

A1: Art

A3: Photography

Page 37: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

37

Agenda

Blogs and other social media

What blog tags (should) say: “executive summary“ of this talk

Tag functions

Empirical study

1. Confirmatory part: Quality of automated content classification

2. Exploratory part: Complementarity and individual differences between classification methods

Some remarks on method

Page 38: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

38Searching for the basic level: Coarsening to hierarchy level 2 (which contains most human-assigned tags: average no. of occurrences of the four levels = 0.83, 8.0, 1.96, and 0.14)

A typical uncertainty of lay annotators:Religion (level 2) OR Theology (level 3) ?

Page 39: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

39

News are easier for content classification:Comparison classification blog corpus – Reuters RCV1

settings: all nouns, all synsets, TF.IDF weighting

Page 40: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

40

Summary

Blog post body and tag information are complementary.

For some readers, tags or body are the best / only indicators of content.

Gold-standard may be a meaningless concept!

Respect in the design of search engines and labelling recommenders!

Q: What do users (readers and …) do with tags?

Page 41: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

41

Outlook

Improve on the mapping to the WordNet domain hierarchy

e.g., relatedness of domains Berendt & Navigli, 2006

Use different classification system (but: which one??)

Investigate differences that arise with named entities / proper nouns

Combine methods: extraction from body, tags, lexical resources, similarity mappings / collaborative filtering, machine learning, interactive (tagyu.com, AutoTag, TagAssist, …)

Larger samples, (quasi-)experimental designs, psycholinguistic methods

Investigate impact of language and other factors

Investigate impact of tagging-system features ( Marlow et al., 2006) …

Study developments / convergence of tags

Hayes et al., 2007: “A-tags / A-blogs”

Tags: Thank you for your attention!

Page 42: 1 Bettina Berendt Humboldt University Berlin, Germany –  Social-media blog tagging: Metadata or “just more content” ? OR: How do normal people

42

Sources

See reference list of the paper, except these two papers from ICWSM 2007:

TagAssist

Sanjay Sood, Sara Owsley, Kristian Hammond and Larry Birnbaum

TagAssist: Automatic Tag Suggestion for Blog Posts

http://www.icwsm.org/papers/paper10.html

Hayes

Conor Hayes and Paolo Avesani

Using Tags and Clustering to Identify Topic-Relevant Blogs

http://www.icwsm.org/papers/paper23.html