universal scaling of semantic information revealed from ib word clusters or human language as...

41
Universal Scaling of Semantic Universal Scaling of Semantic Information Information Revealed from IB word clusters Revealed from IB word clusters or or Human language as optimal biological adaptation Human language as optimal biological adaptation Naftali Tishby School of Computer Science & Engineering & School of Computer Science & Engineering & Interdisciplinary Center for Neural Interdisciplinary Center for Neural Computation Computation The Hebrew University, Jerusalem, Israel The Hebrew University, Jerusalem, Israel http://www.cs.huji.ac.il/~tishby Workshop on Machine Learning in Natural Language Processing Workshop on Machine Learning in Natural Language Processing CRI, Haifa University CRI, Haifa University December 2006 December 2006

Upload: lorenzo-pinion

Post on 14-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Universal Scaling of Semantic Universal Scaling of Semantic InformationInformation

Revealed from IB word clustersRevealed from IB word clustersoror

Human language as optimal biological adaptationHuman language as optimal biological adaptation

Naftali TishbySchool of Computer Science & Engineering &School of Computer Science & Engineering &

Interdisciplinary Center for Neural ComputationInterdisciplinary Center for Neural ComputationThe Hebrew University, Jerusalem, IsraelThe Hebrew University, Jerusalem, Israel

http://www.cs.huji.ac.il/~tishby

Workshop on Machine Learning in Natural Language ProcessingWorkshop on Machine Learning in Natural Language ProcessingCRI, Haifa UniversityCRI, Haifa University

December 2006December 2006

Page 2: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Outline:Outline: Language – Language – a window into our cognitive processinga window into our cognitive processing

What can we learnWhat can we learn from word statistics? from word statistics? How can weHow can we quantify quantify itit?? Is there a Is there a “correct level” “correct level” of description of description ??

Information BottleneckInformation Bottleneck (IB) (IB) and the representation of relevanceand the representation of relevance Finding Approximate sufficient statistics Finding Approximate sufficient statistics

Words, documents and Words, documents and meaningmeaning… … Trading complexity and accuracyTrading complexity and accuracy

ScalingScaling of semantic information of semantic information Possible models: Possible models: small world propertiessmall world properties

Page 3: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

0 1 2 3 4 5 6

x 104

0

4000

6000

Number of observed words

What are words?What are words?• acquired persistent neural activity associated with perception and cognitive functions• appear in every language in a regular power-law sub-linear rate

Nu

mb

er o

f d

iffe

ren

t w

ord

s

Log number of words

Lo

g n

um

be

r o

f d

iffe

ren

t w

ord

s

8.5 9 9.5 10 10.5 117.5

8

8.5

9

9.5

data

y = 0.64x + 2.07

10000

8000

2000

Page 4: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

english

0 1000 2000 30000

1

2

3x 10

4

# docs

# di

ffer

ent

wor

ds

4 5 6 7 88

8.5

9

9.5

10

10.5

log(# docs)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

0 2 4 6

x 105

0

1

2

3x 10

4

# words

# di

ffer

ent

wor

ds

10 11 12 13 148

8.5

9

9.5

10

10.5

log(# words)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

data

y = 0.55x + 5.92

data

y = 0.56x + 2.81

Page 5: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

hebrew

0 500 10000

5000

10000

# docs

# di

ffer

ent

wor

ds

4 5 6 77.5

8

8.5

9

9.5

log(# docs)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

0 2 4 6

x 104

0

2000

4000

6000

8000

10000

# words

# di

ffer

ent

wor

ds

8 9 10 117.5

8

8.5

9

9.5

log(# words)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

data

y = 0.65x + 4.57

data

y = 0.64x + 2.07

Page 6: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

korean

0 500 1000 15000

2

4

6x 10

4

# docs

# di

ffer

ent

wor

ds

4 5 6 7 88.5

9

9.5

10

10.5

11

log(# docs)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

0 0.5 1 1.5 2

x 105

0

1

2

3

4

5x 10

4

# words

# di

ffer

ent

wor

ds

9 10 11 12 138.5

9

9.5

10

10.5

11

log(# words)

log(

# di

ffer

ent

wor

ds)

first 100 docs are not displayed

data

y = 0.77x + 5.13

data

y = 0.70x + 2.21

Page 7: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Rank – Frequency of wordsRank – Frequency of words

Words exhibit “scale-free” statistics- Words exhibit “scale-free” statistics- Zipf’s lawZipf’s law

0 2 4 6 8 10-11

-10

-9

-8

-7

-6

-5

-4

-3

log Rank

log

F

req

ue

nc

y

Hebrew Zipf curve

Page 8: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

How are words/languages How are words/languages generated?generated? Basic observations:Basic observations:

• serve for serve for communicationcommunication and representation and representation• adapt to variableadapt to variable world world statisticsstatistics • collectivecollective (social) entity (social) entity • acquired continuouslyacquired continuously (individually and collectively)(individually and collectively)

Competition Competition between comm. efficiency between comm. efficiency and adaptability / learnabilityand adaptability / learnability

Page 9: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Complexity

Acc

ura

cy

Possible Models/representations

Limited dataLimited data

Bounded Bounded

ComputationComputation

Complexity – Accuracy Tradeoff

Page 10: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Can we quantify it…?

When there is a (relevant) prediction or distortion measure

Accuracy good predictions (low distortion/error)

Complexity long minimal description (optimal codes)

A general tradeoff between distortion and compression:

Information Theory

Page 11: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

What can we learn from word co-occurrence...?

Audio Health www Drug noise Dos Doctor ...

Topic1 12 0 0 0 8 0 0 ...

Topic2 0 9 2 11 1 0 6 ...

Topic3 0 10 1 6 0 0 20 ...

Topic4 9 1 0 0 7 0 1 ...

Topic5 0 3 9 0 1 10 0 ...

Topic6 1 11 0 6 0 1 7 ...

Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...

Topic9 0 12 1 16 0 1 12 ...

Topic10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

Page 12: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

We need to index the max number of non-overlapping green blobs inside the blue blob:

(mutual information!)

XX̂)x|x̂(p

)ˆ|(2 XXnH

)(2 XnH

)ˆ,()ˆ|()( 22/2 XXnIXXnHXnH

Representation and Mutual Representation and Mutual InformationInformation

Page 13: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

IB: an Information TheoreticIB: an Information Theoretic Principle PrincipleFor extracting For extracting RelevantRelevant structure structure

The minimal representation of X that keeps as much information about another variable, Y, as possible.

Generalizes the classical notion of “sufficient statistics”. ( , )

ˆ

I X YX Y

X

)ˆ,( XXI

),ˆ( YXI

ˆ( | )ˆ ˆ( , ) ( , )p x xMin I X X I X Y

Page 14: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

The Self Consistent EquationsSelf Consistent Equations Marginal:

Markov condition:

Bayes’ rule:

x

xpxxpxp )()|ˆ()ˆ(

x

xxpxypxyp )ˆ|()|()ˆ|(

)|ˆ()ˆ(

)()ˆ|( xxp

xp

xpxxp

0)|ˆ(

)]|ˆ([

xxp

xxpL

)ˆ,(exp

),(

)ˆ()|ˆ( xxD

xZ

xpxxp KL

Page 15: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

The emerged effective distortioneffective distortion measure:

y

KLKL

xyp

xypxyp

xypxypDxxD

)ˆ|(

)|(log)|(

)ˆ|(|)|(ˆ,

• Regular if is absolutely continuous w.r.t.

• Small if predicts y as well as x:

)ˆ|( xyp )|( xyp

yx

yx

xyp

xxp

xyp

)ˆ|(

)|ˆ(

)|(

ˆ

Page 16: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

The Information BottleneckInformation Bottleneck Algorithm

)ˆ,()ˆ,(min

),(logminminmin

)|ˆ(),ˆ(),ˆ|(

)|ˆ()ˆ()ˆ|(

xxDXXI

xZ

KLxxpxpxyp

xxpxpxyp

xtt

tx

t

KLt

t

tt

xxpxypxyp

xxpxpxp

xxDxZ

xpxxp

)ˆ|()|()ˆ|(

)|ˆ()()ˆ(

)ˆ,(exp),(

)ˆ()|ˆ(1

“free energy”

Page 17: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

The emergent effective distortion measure:

)ˆ|(|)|(ˆ, xypxypDxxD KLKL

)ˆ(xp )|ˆ( xxp

)ˆ|( xypGeneralizedBA-algorithm

Page 18: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Can be calculated analytically for Markov chains, Gaussian processes, etc., and numerically in general.

IY

IX

IC1Y (IC1

X)

IC2Y (IC2

X)

IC3Y (IC3

X)

The limit is always the convexenvelope of increasing complexityInformation Curves

Page 19: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Naftali Tishby ACAI-99 20

Words and topics again...

Audio Health www Drug noise Dos Doctor ...

Topic1 12 0 0 0 8 0 0 ...

Topic2 0 9 2 11 1 0 6 ...

Topic3 0 10 1 6 0 0 20 ...

Topic4 9 1 0 0 7 0 1 ...

Topic5 0 3 9 0 1 10 0 ...

Topic6 1 11 0 6 0 1 7 ...

Topic7 0 0 8 0 2 12 2 ... Topic8 15 0 1 1 10 0 0 ...

Topic9 0 12 1 16 0 1 12 ...

Topic10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

Page 20: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Simple Example

Audio Noise Health Drug Doctor www Dos ....

Doc1 12 8 0 0 0 0 0 ...

Doc4 9 7 1 0 1 0 0 ...

Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ...

Doc3 0 0 10 6 20 1 0 ...

Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

Page 21: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Audio Noise Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

A new compact representation

The document clusters preserve the relevant

information between the documents and words

Page 22: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Analyzing Co-Occurrence Tables

Topics

WordsTopics-Words counts matrix

Page 23: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Words

The exact same counts matrix after permutation

Topics

Page 24: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Word clusters

TopicClusters

The eord clusters provide a compact representation that preserve the informationabout the topics

Page 25: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Quantified by Mutual Information

21 2

1212112121

X,X )X(P

)XX(Plog)XX(P)X(P )XX(H)X(H)X;X(I

The distinctionsinside each clusterAre less relevant forpredicting the class

WordsIrrelevant

distinctions

Page 26: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

alt.atheismrec.autosrec.motorcyclesrec.sport.*sci.medsci.spacesoc.religion.christiantalk.politics.*

comp.*misc.forsalesci.cryptsci.electronics

carturkishgameteamjesusgunhockey…

xfileimageencryptionwindowdosmac…

New

sgro

up

Word

P(TC,TW)

Page 27: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

word

comp.graphicscomp.os.ms-windows.misccomp.windows.x

comp.sys.ibm.pc.hardwarecomp.sys.mac.hardwaremisc.forsalesci.cryptsci.electronics

windowsimagewindowjpeggraphics…

encryptiondbideescrowmonitor…

P(TC,TW)

Page 28: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

word

P(TC,TW)

Page 29: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

word

alt.atheismrec.sport.baseballrec.sport.hockeysoc.religion.christiantalk.politics.mideasttalk.religion.misc

rec.autosrec.motorcyclessci.medsci.spacetalk.politics.gunstalk.politics.misc

armenianturkishjesushockeyisraeliarmenians…

carqgunbikefbihealth…

P(TC,TW)

Page 30: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

Word

P(TC,TW)

Page 31: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

Word

P(TC,TW)

Page 32: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Symmetric IB through Deterministic Annealing

New

sgro

up

Wordatheistschristianityjesusbiblesinfaith…

alt.atheismsoc.religion.christiantalk.religion.misc

P(TC,TW)

Page 33: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

We observe Semantic Scaling

-3.1 -3 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1-7

-6.5

-6

-5.5

-5

-4.5

y = 1.92*x - 0.866

data 1 linear

),(

),ˆ(

YXI

YXIIY

)(/),ˆ( XHXXII X

X

Y

X

Y

I

I

I

I

1

1

92.1)1(1 XY IcI

Page 34: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(T;X)/H(X)

I(T

;Y)/

I(X

;Y)

20NG Noam data

20NG russian data

Page 35: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Simplified Chinese2.09

Traditional Chinese1.73

Dutch2.3

French2.22

Hebrew1.63

Italian2.35

Japanese1.42

Portuguese2.9

Spanish1.89

Page 36: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0-8

-7

-6

-5

-4

-3

-2

-1

0

log(1-I(T;X)/H(X))

log

(1-I

(T;Y

)/I(

X;Y

))

Chinese SimplifiedChinese Traditional

Dutch

French

HebrewItalian

Japanese

Korean

PorgutueseSpanish

English 20NG Jose

English UTF

English ReutersEnglish 20NG Noam

Page 37: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5-14

-12

-10

-8

-6

-4

-2

log(1-I(T;X)/H(X))

log

(1-I

(T;Y

)/I(

X;Y

))

Random selection of 200 words

Page 38: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

)();( ˆˆ XcHYXIXX

Can we understand it?

)ˆ|;( XYXI

),(

),ˆ(

YXI

YXIIY

)(/),ˆ( XHXXII X

)ˆ|(

)ˆ|;(

XXH

XYXI

H

I

X

Y

Any subset of Any subset of the language the language has the same has the same exponent! exponent!

)ˆ|( XXH

Page 39: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

)();( ˆˆ XcHYXIXX

But what does it tell about Language?

XY

XY

HI

XXH

H

XYXI

I

loglog

)ˆ|()ˆ|;(

)ˆ|(

)ˆ|;(

XXH

XYXI

H

I

X

Y

““Efficiency of the words”: Efficiency of the words”:

Log-ratio of added Log-ratio of added

Word EntropyWord Entropy

that is transferred tothat is transferred to

Meaningful InformationMeaningful Information

Language appears to have Language appears to have constantconstant word efficiency! word efficiency!

~ 2~ 2

Page 40: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Possible Explanations?Possible Explanations? Power laws are too common to mean anything… Power laws are too common to mean anything…

Zipf’s law and similar… Zipf’s law and similar… “never trust linear log-log plots…”“never trust linear log-log plots…”

It’s It’s a property of my Analysisa property of my Analysis, not of Language, not of Language How do I know that its not all in How do I know that its not all in the way we clusterthe way we cluster the the

words?words?

Words are generated at a Words are generated at a Constant level of Constant level of Ambiguity:Ambiguity: words are generated at awords are generated at a constant rate, constant rate, depending depending

only on the concept (occurred) only on the concept (occurred) ambiguity in ambiguity in usage usage irrespective of vocabulary size or domainirrespective of vocabulary size or domain

Small worldSmall world (scale free) properties of word (scale free) properties of word acquisition…acquisition…

Page 41: Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer

Many Thanks to…Many Thanks to…

Bill BialekBill Bialek Fernando PereiraFernando Pereira Noam SlonimNoam Slonim

Dmitry DavidovDmitry Davidov Amir NavotAmir Navot Josemine MagdalenJosemine Magdalen

Banter Co. (z”l)Banter Co. (z”l)