universal similarity - nvti · russian authors (in original cyrillic) s(t)=0.949 i.s. turgenev,...

40
Universal Similarity Paul Vitanyi CWI and University of Amsterdam,

Upload: others

Post on 27-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Universal Similarity

Paul VitanyiCWI and University of Amsterdam,

Page 2: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

The Problem:

1 2 3

4 5

Given: Literal objects

Determine: “Similarity” Distance Matrix (distances between every pair)‏

(binary files) ‏

Applications: Clustering, Classification, Evolutionary trees of Internet documents, computer programs, chain letters,

genomes, languages, texts, music pieces, ocr, ……

Page 3: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Andrey Nikolaevich Kolmogorov(1903-1987, Tambov, Russia)‏

Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity

Page 4: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

TOOL:

Information Distance (Li, Vitanyi, 96; Bennett,Gacs,Li,Vitanyi,Zurek, 98)‏

D(x,y) = min { |p|: p(x)=y & p(y)=x}

Binary program for a Universal Computer(Lisp, Java, C, Universal Turing Machine) ‏

Theorem (i) D(x,y) = max {K(x|y),K(y|x)}Kolmogorov complexity of x given y, definedas length of shortest binary ptogram thatoutputs x on input y.

(ii) D(x,y) ≤D’(x,y) Any computable distance satisfying ∑2 --D’(x,y)‏ y for every x.

≤ 1

(iii) D(x,y) is a metric.

Page 5: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

However:

x

So, we Normalize:

d(x,y) = D(x,y)

Y

X’ Y’

D(x,y)=D(x’,y’) = But x and y are much more similar than x’ and y’

Max {K(x),K(y)}

Normalized Information Distance (NID) ‏The “Similarity metric”

Page 6: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Properties NID:

Theorem:

Drawback: NID(x,y) = d(x,y) is noncomputable, since K(.) is!

• 0 ≤ d(x,y) ≤ 1• d(x,y) is a metric symmetric,triangle inequality, d(x,x)=0

:

Page 7: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

In Practice:

Replace NID(x,y) by

NCD(x,y)= Z(xy)-min{Z(x),Z(y)} max{Z(x),Z(y)}

This NCD is actually about the same formula as NID, but rewritten using “Z” instead of “K”

Normalized Compression Distance (NCD) ‏

Length (#bits) compressedversion x using compressor Z(gzip, bzip2, PPMZ,…) ‏

Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04

Page 8: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Family of compression-basedsimilarities

The NCD is actually a family ofsimilarity measures, parametrizedwith the compressor, e.g.,

gzip, bzip2, PPMZ,...

(forget the crippled compressors like compress, awk, ...) ‏

Page 9: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Application: Clustering of Natural Data

Unusual We don’t know number of clusters We don’t have criterion to distinguish clusters

Therefore, we hierarchically cluster to let thedata decide these issues naturally.

Page 10: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Applications:First One: Phylogeny of Species

Eutherian Orders: Ferungula, Primates, Rodents (Outgroup: Platypus, Wallaroo)‏

Hasegawa et al 98 concatenates selected proteins and gets different groupings depending on proteins used

We use whole mtDNA , Approximate K(.) by GenCompress to determine NCD matrix; Get only one tree.

Page 11: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Who is our closer relative?

Page 12: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Evolutionary Tree of Mammals:Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04

Page 13: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Embedding NCD Matrix in dendrogram (hierarchical clustering)for this Large Phylogeny (no errors it seems) ‏

Therian hypothesisVersusMarsupiontihypothesis

Mammals:

Eutheria Metatheria Prototheria

Which pair is closest?

Cilibrasi, Vitanyi 2005

Page 14: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

NCD Matrix 24 Species (mtDNA).Diagonal elements about 0. Distances between primates ca 0.6.

Page 15: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Identifying SARS Virus: S(T)=0.988

AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis

coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2.

Page 16: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Clustering : Phylogeny of 15 languages: Native American, NativeAfrican, Native European Languages

Page 17: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Applications Everywhere

Genomics and Language Tree just one example;also used with (e.g.): Cilibrasi, Vitanyi, de Wolf, 2003/2004;Cilibrasi, Vitanyi, 2005.MIDI music files (music clustering)‏Plagiarism detectionPhylogeny of chain lettersSARS virus classificationComputer worms and internet traffic (attacks) analysisLiteratureOCRAstronomy—Radio telecope time sequencesSpam detection

Time sequences:(All data bases used inall major data-miningconferences of last 10Y)‏Superior over all methods:In: Anomaly detectionHeterogenous data

Page 18: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Russian Authors (in original Cyrillic)S(T)=0.949

I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky 1821--1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828--1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809--1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891--1940 [The Master and Margarita, The Fatefull Eggs, The Heart of a Dog]

Page 19: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Same Russian Texts in EnglishTranslation; S(T)=0953

Files start to cluster according to translators!I.S. Turgenev, 1818--1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821--1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828--1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809—1852 [Dead Souls (C.J. Hogarth), Taras Bulba ($\approx$ G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled ($\approx$ I.F. Hapgood]; M. Bulgakov 1891--1940 [The Master and Margarita (R. Pevear, L. Volokhonsky),

The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]

Page 20: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

12 Classical Pieces (Bach, Debussy, Chopin)---- No errors

Page 21: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Optical Character Recognition: DataHandwritten Digits from NIST Data Base

Page 22: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Optical Character Recognition: Clustering:S(T)=0.901

Page 23: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Heterogenous Data; Clusteringperfect with S(T)=0.95.

Clustering of radically different data.

No features known.

Only our parameter-free method can do this!!

Page 24: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

You can use it too!

CompLearn Toolkit: http://www.complearn.org

“x” and “y” are literal objects (files); What about abstract objects like “home”, “red”, “Socrates”, “chair”, ….?

Or names for literal objects?

But what if wedo not have the objectas a file????

Page 25: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Non-Literal Objects

Googling for Meaning

Google distribution: g(x) = Google page count “x” # pages indexed

Cilibrasi, Vitanyi, 2004/2007.

Page 26: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Google Compressor

Google code length:

G(x) = log 1 / g(x) ‏

This is the Shannon-Fano code length that has minimum expected code word length w.r.t. g(x).

Hence we can view Google as a Google Compressor.

Page 27: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Normalized Google Distance (NGD) ‏

NGD(x,y) = G(x,y) – min{G(x),G(y)} max{G(x),G(y)} Same formula as NCD, using Z = G (Google compressor) ‏

Use the Google counts and the CompLearn Toolkit to apply NGD.

Page 28: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Example

“horse”: #hits = 46,700,000 “rider”: #hits = 12,200,000 “horse” “rider”: #hits = 2,630,000 #pages indexed: 8,058,044,651

NGD(horse,rider) = 0.443Theoretically+empirically: scale-invariant

Page 29: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Colors and Numbers—The Names!Hierarchical Clustering

colors

numbers

Page 30: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Hierarchical Clustering of 17th Century Dutch Painters,Paintings given by name, without painter’s name.

Hendrickje slapend, Portrait of Maria Trip, Portrait of Johannes Wtenbogaert, The Stone Bridge, The Prophetess Anna, Leiden Baker Arend Oostwaert, Keyzerswaert, Two Men Playing Backgammon, Woman at her Toilet, Prince's Day, The Merry Family, Maria Rey, Consul Titus Manlius Torquatus, Swartenhont, Venus and Adonis

Page 31: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Mathematicians

Page 32: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

H5N1 (Birdflu) virus mutaions

Page 33: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Next: Binary Classification

Here we use the NGDfor a Support Vector Machine (SVM) ‏binary classification learner(we could also use a neuralnetwork) ‏

Setup:Anchor terms, positive/negative examples,Test set Accuracy

Page 34: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Using NGD in SVM (Support Vector Machines) tolearn concepts (binary classification)‏

Example:

Emergencies

Page 35: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Example: Classifying Prime Numbers

Actually, 91 is not a prime. Soaccuracy is17/19=89,47%

Page 36: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Example: Electrical Terms

Page 37: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Example: Religious Terms

Page 38: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Comparison with WordNet Semanticshttp://www.cogsci.princeton.edu/~wn

NGD-SVM Classifier on 100randomly selected WordNetCategories

Randomly selected positive, negative and test sets

Histogram gives accuracy With respect to PhD experts entered knowledge in theWordNet Database

Mean Accuracy is 0.8725Standard deviation is 0.1169

Accuracy almost always > 75%--Automatically

Page 39: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Translation Using NGD

Problem:

Translation:

Page 40: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky

Selected BibliographyD. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping, Physical Review Letters, 88:4(2002) 048702.C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, and W. Zurek. Information Distance, IEEE Transactions on Information Theory, 44:4(1998), 1407--1423.C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76--81.X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker, Shared information and program plagiarism detection, IEEE Trans. Inform. Th.,

50:7(2004), 1545--1551.R. Cilibrasi, The CompLearn Toolkit, 2003, http://complearn.sourceforge.net/ .R. Cilibrasi, P.M.B. Vitanyi, R. de Wolf, Algorithmic clustering of music based on string compression, Computer Music Journal, 28:4(2004), 49-67.R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Inform. Th., 51:4(2005), 1523-1545.R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004)E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22---25, 2004, 206--215.M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.M. Li and P.M.B. Vitanyi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal Society of London,

Series A, 452(1996), 769-789.M. Li and P.M.B Vitanyi. Algorithmic Complexity, pp. 376--382 in: International Encyclopedia of the Social \& Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264.M. Li and P.M.B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications,Springer-Verlag, New York, 2nd Edition, 1997.A.Londei, V. Loreto, M.O. Belardinelli, Music style and authorship categorization by informative compressors, Proc. 5th Triannual Conference of the European Society for the Cognitive Sciences of Music (ESCOM),September 8-13, 2003, Hannover, Germany, pp. 200-203.S. Wehner, Analyzing network traffic and worms using compression, Manuscript, CWI, 2004. Partially availableat http://homepages.cwi.nl/~wehner/worms/