visualising a text with a tree cloud

44
IFCS 2009 Dresden – 17/03/2009 Visualising a text with a tree cloud Philippe Gambette, Jean Véronis

Upload: philippe-gambette

Post on 23-Aug-2014

12.580 views

Category:

News & Politics


1 download

DESCRIPTION

A new visualization tool to display the words of a text (newspaper article, blog content, political speech) is presented, the tree cloud, a kind of improved tag cloud: http://www.treecloud.org.

TRANSCRIPT

Page 1: Visualising a text with a tree cloud

IFCS 2009Dresden – 17/03/2009

Visualising a text with a tree cloud

Philippe Gambette, Jean Véronis

Page 2: Visualising a text with a tree cloud

• Tag and word clouds

Outline

• Enhanced tag clouds• Tree clouds• Construction steps• Quality control

Page 3: Visualising a text with a tree cloud

• Built from a set of tags• Font size related to frequency

Tag clouds

What is considered the first tag cloud, from

D. Coupland: Microserfs, HarperCollins, Toronto,

1995

Page 4: Visualising a text with a tree cloud

• Gained popularity with Flickr

• Built from a set of tags• Font size related to frequency

Tag clouds

Flickr's all time most popular tags

Page 5: Visualising a text with a tree cloud

• Gained popularity with Wordle

• Built from a set of words from a text• Font size related to frequency

Word clouds

Page 6: Visualising a text with a tree cloud

• Gained popularity with Wordle

• Built from a set of words from a text• Font size related to frequency

Word clouds

GoogleImage(obama inaugural address wordle)

Page 7: Visualising a text with a tree cloud

• Gained popularity with Wordle

• Built from a set of words from a text• Font size related to frequency

Word clouds

GoogleImage(obama inaugural address wordle)

Page 8: Visualising a text with a tree cloud

• Gained popularity with Wordle

• Built from a set of words from a text• Font size related to frequency

Word clouds

GoogleImage(obama inaugural address wordle)

Page 9: Visualising a text with a tree cloud

• shared tags in red on del.icio.us

Add information from the text:• color intensity to express recency in Amazon

Enhanced tag / word clouds

• optimize blank space and semantic proximityKaser & Lemire, WWW'07

• “topigraphy”: 2D placement according to cooccurrenceFujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08

• group together cooccurring tags on the same lineHassan-Montero & Herrero-Solana, InScit'06

Page 10: Visualising a text with a tree cloud

• shared tags in red on del.icio.us

Add information from the text:• color intensity to express recency in Amazon

Enhanced tag / word clouds

• optimize blank space and semantic proximityKaser & Lemire, WWW'07

• “topigraphy”: 2D placement according to cooccurrenceFujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08

• group together cooccurring tags on the same lineHassan-Montero & Herrero-Solana, InScit'06

Page 11: Visualising a text with a tree cloud

• shared tags in red on del.icio.us

Add information from the text:• color intensity to express recency in Amazon

Enhanced tag / word clouds

• optimize blank space and semantic proximityKaser & Lemire, WWW'07

• “topigraphy”: 2D placement according to cooccurrenceFujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08

• group together cooccurring tags on the same lineHassan-Montero & Herrero-Solana, InScit'06

Page 12: Visualising a text with a tree cloud

• optimize blank space and semantic proximityKaser & Lemire, WWW'07

• shared tags in red on del.icio.us

Add information from the text:• color intensity to express recency in Amazon

Enhanced tag / word clouds

• “topigraphy”: 2D placement according to cooccurrenceFujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08

• group together cooccurring tags on the same lineHassan-Montero & Herrero-Solana, InScit'06

Page 13: Visualising a text with a tree cloud

• optimize blank space and semantic proximityKaser & Lemire, WWW'07

• shared tags in red on del.icio.us

Add information from the text:• color intensity to express recency in Amazon

Enhanced tag / word clouds

• “topigraphy”: 2D placement according to cooccurrenceFujimura, Fujimura, Matsubayashi, Yamada & Okuda, WWW'08

• group together cooccurring tags on the same lineHassan-Montero & Herrero-Solana, InScit'06

Page 14: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:sense desamibiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Page 15: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:sense desamibiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Mayaffre, Quand travail, famille, et patrie cooccurrent dans le

discours de NicolasSarkozy, JADT'08

Page 16: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:sense desamibiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Brunet, Les séquences (suite),JADT'08

Page 17: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:word sense disamibiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Barry, Viprey,Approche comparative

des résultats d'exploration textuelle des discours

de deux leaders africainsKeita et Touré,

JADT'08

Page 18: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:word sense disamibiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

Peyrat-Guillard,Analyse du discours

syndical sur l’entreprise, JADT'08

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Page 19: Visualising a text with a tree cloud

Extract semantic information from a text

• natural language processing:word sense disambiguation

Véronis (Hyperlex)

• literature analysis:philological approach: only consider the text

Brody

• text mining:semantic graph

Grimmer (Wordmapper)

Disambiguation of word “barrage”: dam, play-off,

roadblock, police cordon.

Véronis, HyperLex:Lexical Cartography for

Information Retrieval, 2004

• discourse analysis:tree analysis or cooccurrence graph, geodesic projection

Brunet (Hyperbase), Viprey (Astartex)

Page 20: Visualising a text with a tree cloud

Tag cloud + tree = tree cloud

GPL-licensed Treecloud in Python,available at http://www.treecloud.org

Built with TreeCloud and

SplitsTree: Huson 1998, Huson & Bryant 2006

Page 21: Visualising a text with a tree cloud

The first tree cloud

Tree cloud of the blog posts containing “Laurence Ferrari”from 25/11/2007 to 10/12/2007, by Jean Véronis

http://aixtal.blogspot.com/2007/12/actu-une-ferrari-dans-un-arbre.html

Page 22: Visualising a text with a tree cloud

Building a tree cloud – extracting the words

Extract words with frequency:

• stoplist?without stoplist with stoplist

Page 23: Visualising a text with a tree cloud

Building a tree cloud – extracting the words

Extract words with frequency:

• lemmatization?

• groups words with similar meaning?

70 most frequent words inObama's campaign speeches,

winsize=30, distance=dice, NJ-tree.

no lemmatization...sometimes interesting

Page 24: Visualising a text with a tree cloud

Building a tree cloud – dissimilarity matrix

Many semantic distance formulas based on cooccurrence

Page 25: Visualising a text with a tree cloud

Building a tree cloud – dissimilarity matrix

Many semantic distance formulas based on cooccurrence

slidingwindow S

Text

width wsliding step s

cooccurrence matricesO

11, O

12, O

21, O

22

semantic dissimilarity matrixchi squared, mutual information, liddel, dice, jaccard, gmean, hyperlex, minimum sensitivity, odds ratio, zscore, log likelihood, poisson-stirling...

Evert,Statistics of words cooccurrences,

PhD Thesis, 2005

Page 26: Visualising a text with a tree cloud

Building a tree cloud – dissimilarity matrix

Transformations needed on the dissimilarity:

• transform similarity into dissimilarity

• linear normalization for positive matrices to get distances in [0,1]

• affine normalization for matrices with positive or negative numbers, to get distances in [α,1] (for example α=0.1)

Page 27: Visualising a text with a tree cloud

Building a tree cloud – tree reconstruction

Many existing methods:

• Neighbor-JoiningSaitou & Nei, 1987

• Addtree variantsBarthelemy & Luong, 1987

• Quartet heuristicCilibrasi & Vitanyi, 2007

Page 28: Visualising a text with a tree cloud

Building a tree cloud – tree decoration

Choice of word sizes:

• computed directly from frequency (apply a log!)

or

• computed from frequency ranking (exponential distribution)

or

• statistical significance with respect to a reference corpus

Page 29: Visualising a text with a tree cloud

Building a tree cloud – tree decoration

Colors: chronology

150 most frequent words inObama's campaign speeches, winsize=30, distance=oddsratio, color=chronology, NJ-tree.

oldrecent

Built with TreeCloud and

Page 30: Visualising a text with a tree cloud

Building a tree cloud – tree decoration

Colors: dispersion

standarddeviationof the position

150 most frequent words inObama's campaign speeches, winsize=30, distance=oddsratio, color=dispersion, NJ-tree.

sparsedense

too blue?

Page 31: Visualising a text with a tree cloud

Building a tree cloud – tree decoration

Colors: dispersion

standarddeviationof the position word frequency

150 most frequent words inObama's campaign speeches, winsize=30, distance=oddsratio, color=norm-dispersion, NJ-tree.

sparsedense

too red?

Page 32: Visualising a text with a tree cloud

Building a tree cloud – tree decoration

Edge color or thickness:quality of the inducedcluster.

150 most frequent words inObama's campaign speeches, winsize=30, distance=oddsratio, color=chronology, NJ-tree.

Page 33: Visualising a text with a tree cloud

Quality control

Is there an objective quality measure of tree clouds?

Page 34: Visualising a text with a tree cloud

Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?

Page 35: Visualising a text with a tree cloud

Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?

Tree cloud variations if small changes?bootstrap to evaluate:

- stability of the result- robustness of the method

Page 36: Visualising a text with a tree cloud

Quality control

Is there an objective quality measure of tree clouds?

What is the best method to build a tree cloud from my data?

Tree cloud variations if small changes?bootstrap to evaluate:

- stability of the result- robustness of the method

Still, is there a more direct method?arboricity to show whether the distance matrix fits with a tree, which should imply stability?

Guénoche & Garreta, 2001Guénoche & Darlu, 2009

Page 37: Visualising a text with a tree cloud

Quality control – bootstrap

• Randomly delete words with probability 50%.

• Built tree cloud of original text, and altered text.

• Compute similarity of both trees (1-normalized RobinsonFoulds)

chisquaredmi

liddelldice

jaccardgmean

hyperlexms

oddsratiozscore

loglikelihoodpoissonstirling

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

similarity

4 altered versions of 10 Obama's speeches,

3000 words in average,width=30, NJ-tree.

Page 38: Visualising a text with a tree cloud

Quality control – arboricity

Relationship between “bootstrap quality” and arboricity:

0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,70

10

20

30

40

50

60

70

80

arboricity (%)

bootstrap quality

Page 39: Visualising a text with a tree cloud

Quality control – arboricity

Relationship between “bootstrap quality” and arboricity:

0,52 0,54 0,56 0,58 0,6 0,62 0,64 0,66 0,68 0,70

10

20

30

40

50

60

70

80

correlation coefficient: 0.64

arboricity below 50%: danger!

arboricity (%)

bootstrap quality

Page 40: Visualising a text with a tree cloud

Perspectives

• Make the tool available on a web interface

• Evaluate tree clouds for discourse analysis

• Build the daily tree cloud of people popular on blogs,

with

http://labs.wikio.net

http://www.treecloud.org

Page 41: Visualising a text with a tree cloud

Thank you for your attention!

http://www.treecloud.org - http://www.splitstree.org

Tree cloud of the words appearingtwice or more in the IFCS 2009 call for paper

lemmatization, width=20, distance=dice, NJ-tree.

Page 42: Visualising a text with a tree cloud

Tree clouds focused on one word

http://www.treecloud.org - http://www.splitstree.org

Tree cloud of the neighborhood of “McCain” in Obama's campaign speeches

Page 43: Visualising a text with a tree cloud

Tree clouds focused on one word

http://www.treecloud.org - http://www.splitstree.org

Tree cloud of the neighborhood of “Bush” in Obama's campaign speeches

Page 44: Visualising a text with a tree cloud

Tree clouds focused on one word

http://www.treecloud.org - http://www.splitstree.org

Tree cloud of the neighborhood of “world” in Obama's campaign speeches