universal similarity - nvti · russian authors (in original cyrillic) s(t)=0.949 i.s. turgenev,...

Post on 27-Jun-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Universal Similarity

Paul VitanyiCWI and University of Amsterdam,

The Problem:

1 2 3

4 5

Given: Literal objects

Determine: “Similarity” Distance Matrix (distances between every pair)‏

(binary files) ‏

Applications: Clustering, Classification, Evolutionary trees of Internet documents, computer programs, chain letters,

genomes, languages, texts, music pieces, ocr, ……

Andrey Nikolaevich Kolmogorov(1903-1987, Tambov, Russia)‏

Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity

TOOL:

Information Distance (Li, Vitanyi, 96; Bennett,Gacs,Li,Vitanyi,Zurek, 98)‏

D(x,y) = min { |p|: p(x)=y & p(y)=x}

Binary program for a Universal Computer(Lisp, Java, C, Universal Turing Machine) ‏

Theorem (i) D(x,y) = max {K(x|y),K(y|x)}Kolmogorov complexity of x given y, definedas length of shortest binary ptogram thatoutputs x on input y.

(ii) D(x,y) ≤D’(x,y) Any computable distance satisfying ∑2 --D’(x,y)‏ y for every x.

≤ 1

(iii) D(x,y) is a metric.

However:

x

So, we Normalize:

d(x,y) = D(x,y)

Y

X’ Y’

D(x,y)=D(x’,y’) = But x and y are much more similar than x’ and y’

Max {K(x),K(y)}

Normalized Information Distance (NID) ‏The “Similarity metric”

Properties NID:

Theorem:

Drawback: NID(x,y) = d(x,y) is noncomputable, since K(.) is!

• 0 ≤ d(x,y) ≤ 1• d(x,y) is a metric symmetric,triangle inequality, d(x,x)=0

:

In Practice:

Replace NID(x,y) by

NCD(x,y)= Z(xy)-min{Z(x),Z(y)} max{Z(x),Z(y)}

This NCD is actually about the same formula as NID, but rewritten using “Z” instead of “K”

Normalized Compression Distance (NCD) ‏

Length (#bits) compressedversion x using compressor Z(gzip, bzip2, PPMZ,…) ‏

Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04

Family of compression-basedsimilarities

The NCD is actually a family ofsimilarity measures, parametrizedwith the compressor, e.g.,

gzip, bzip2, PPMZ,...

(forget the crippled compressors like compress, awk, ...) ‏

Application: Clustering of Natural Data

Unusual We don’t know number of clusters We don’t have criterion to distinguish clusters

Therefore, we hierarchically cluster to let thedata decide these issues naturally.

Applications:First One: Phylogeny of Species

Eutherian Orders: Ferungula, Primates, Rodents (Outgroup: Platypus, Wallaroo)‏

Hasegawa et al 98 concatenates selected proteins and gets different groupings depending on proteins used

We use whole mtDNA , Approximate K(.) by GenCompress to determine NCD matrix; Get only one tree.

Who is our closer relative?

Evolutionary Tree of Mammals:Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04

Embedding NCD Matrix in dendrogram (hierarchical clustering)for this Large Phylogeny (no errors it seems) ‏

Therian hypothesisVersusMarsupiontihypothesis

Mammals:

Eutheria Metatheria Prototheria

Which pair is closest?

Cilibrasi, Vitanyi 2005

NCD Matrix 24 Species (mtDNA).Diagonal elements about 0. Distances between primates ca 0.6.

Identifying SARS Virus: S(T)=0.988

AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis

coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2.

Clustering : Phylogeny of 15 languages: Native American, NativeAfrican, Native European Languages

Applications Everywhere

Genomics and Language Tree just one example;also used with (e.g.): Cilibrasi, Vitanyi, de Wolf, 2003/2004;Cilibrasi, Vitanyi, 2005.MIDI music files (music clustering)‏Plagiarism detectionPhylogeny of chain lettersSARS virus classificationComputer worms and internet traffic (attacks) analysisLiteratureOCRAstronomy—Radio telecope time sequencesSpam detection

Time sequences:(All data bases used inall major data-miningconferences of last 10Y)‏Superior over all methods:In: Anomaly detectionHeterogenous data

Russian Authors (in original Cyrillic)S(T)=0.949

I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky 1821--1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828--1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809--1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891--1940 [The Master and Margarita, The Fatefull Eggs, The Heart of a Dog]

Same Russian Texts in EnglishTranslation; S(T)=0953

Files start to cluster according to translators!I.S. Turgenev, 1818--1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821--1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828--1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809—1852 [Dead Souls (C.J. Hogarth), Taras Bulba ($\approx$ G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled ($\approx$ I.F. Hapgood]; M. Bulgakov 1891--1940 [The Master and Margarita (R. Pevear, L. Volokhonsky),

The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]

12 Classical Pieces (Bach, Debussy, Chopin)---- No errors

Optical Character Recognition: DataHandwritten Digits from NIST Data Base

Optical Character Recognition: Clustering:S(T)=0.901

Heterogenous Data; Clusteringperfect with S(T)=0.95.

Clustering of radically different data.

No features known.

Only our parameter-free method can do this!!

You can use it too!

CompLearn Toolkit: http://www.complearn.org

“x” and “y” are literal objects (files); What about abstract objects like “home”, “red”, “Socrates”, “chair”, ….?

Or names for literal objects?

But what if wedo not have the objectas a file????

Non-Literal Objects

Googling for Meaning

Google distribution: g(x) = Google page count “x” # pages indexed

Cilibrasi, Vitanyi, 2004/2007.

Google Compressor

Google code length:

G(x) = log 1 / g(x) ‏

This is the Shannon-Fano code length that has minimum expected code word length w.r.t. g(x).

Hence we can view Google as a Google Compressor.

Normalized Google Distance (NGD) ‏

NGD(x,y) = G(x,y) – min{G(x),G(y)} max{G(x),G(y)} Same formula as NCD, using Z = G (Google compressor) ‏

Use the Google counts and the CompLearn Toolkit to apply NGD.

Example

“horse”: #hits = 46,700,000 “rider”: #hits = 12,200,000 “horse” “rider”: #hits = 2,630,000 #pages indexed: 8,058,044,651

NGD(horse,rider) = 0.443Theoretically+empirically: scale-invariant

Colors and Numbers—The Names!Hierarchical Clustering

colors

numbers

Hierarchical Clustering of 17th Century Dutch Painters,Paintings given by name, without painter’s name.

Hendrickje slapend, Portrait of Maria Trip, Portrait of Johannes Wtenbogaert, The Stone Bridge, The Prophetess Anna, Leiden Baker Arend Oostwaert, Keyzerswaert, Two Men Playing Backgammon, Woman at her Toilet, Prince's Day, The Merry Family, Maria Rey, Consul Titus Manlius Torquatus, Swartenhont, Venus and Adonis

Mathematicians

H5N1 (Birdflu) virus mutaions

Next: Binary Classification

Here we use the NGDfor a Support Vector Machine (SVM) ‏binary classification learner(we could also use a neuralnetwork) ‏

Setup:Anchor terms, positive/negative examples,Test set Accuracy

Using NGD in SVM (Support Vector Machines) tolearn concepts (binary classification)‏

Example:

Emergencies

Example: Classifying Prime Numbers

Actually, 91 is not a prime. Soaccuracy is17/19=89,47%

Example: Electrical Terms

Example: Religious Terms

Comparison with WordNet Semanticshttp://www.cogsci.princeton.edu/~wn

NGD-SVM Classifier on 100randomly selected WordNetCategories

Randomly selected positive, negative and test sets

Histogram gives accuracy With respect to PhD experts entered knowledge in theWordNet Database

Mean Accuracy is 0.8725Standard deviation is 0.1169

Accuracy almost always > 75%--Automatically

Translation Using NGD

Problem:

Translation:

Selected BibliographyD. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping, Physical Review Letters, 88:4(2002) 048702.C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, and W. Zurek. Information Distance, IEEE Transactions on Information Theory, 44:4(1998), 1407--1423.C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76--81.X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker, Shared information and program plagiarism detection, IEEE Trans. Inform. Th.,

50:7(2004), 1545--1551.R. Cilibrasi, The CompLearn Toolkit, 2003, http://complearn.sourceforge.net/ .R. Cilibrasi, P.M.B. Vitanyi, R. de Wolf, Algorithmic clustering of music based on string compression, Computer Music Journal, 28:4(2004), 49-67.R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Inform. Th., 51:4(2005), 1523-1545.R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004)E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22---25, 2004, 206--215.M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.M. Li and P.M.B. Vitanyi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal Society of London,

Series A, 452(1996), 769-789.M. Li and P.M.B Vitanyi. Algorithmic Complexity, pp. 376--382 in: International Encyclopedia of the Social \& Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264.M. Li and P.M.B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications,Springer-Verlag, New York, 2nd Edition, 1997.A.Londei, V. Loreto, M.O. Belardinelli, Music style and authorship categorization by informative compressors, Proc. 5th Triannual Conference of the European Society for the Cognitive Sciences of Music (ESCOM),September 8-13, 2003, Hannover, Germany, pp. 200-203.S. Wehner, Analyzing network traffic and worms using compression, Manuscript, CWI, 2004. Partially availableat http://homepages.cwi.nl/~wehner/worms/

top related