universal similarity - nvti · russian authors (in original cyrillic) s(t)=0.949 i.s. turgenev,...
Post on 27-Jun-2020
10 Views
Preview:
TRANSCRIPT
Universal Similarity
Paul VitanyiCWI and University of Amsterdam,
The Problem:
1 2 3
4 5
Given: Literal objects
Determine: “Similarity” Distance Matrix (distances between every pair)
(binary files)
Applications: Clustering, Classification, Evolutionary trees of Internet documents, computer programs, chain letters,
genomes, languages, texts, music pieces, ocr, ……
Andrey Nikolaevich Kolmogorov(1903-1987, Tambov, Russia)
Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity
TOOL:
Information Distance (Li, Vitanyi, 96; Bennett,Gacs,Li,Vitanyi,Zurek, 98)
D(x,y) = min { |p|: p(x)=y & p(y)=x}
Binary program for a Universal Computer(Lisp, Java, C, Universal Turing Machine)
Theorem (i) D(x,y) = max {K(x|y),K(y|x)}Kolmogorov complexity of x given y, definedas length of shortest binary ptogram thatoutputs x on input y.
(ii) D(x,y) ≤D’(x,y) Any computable distance satisfying ∑2 --D’(x,y) y for every x.
≤ 1
(iii) D(x,y) is a metric.
However:
x
So, we Normalize:
d(x,y) = D(x,y)
Y
X’ Y’
D(x,y)=D(x’,y’) = But x and y are much more similar than x’ and y’
Max {K(x),K(y)}
Normalized Information Distance (NID) The “Similarity metric”
Properties NID:
Theorem:
Drawback: NID(x,y) = d(x,y) is noncomputable, since K(.) is!
• 0 ≤ d(x,y) ≤ 1• d(x,y) is a metric symmetric,triangle inequality, d(x,x)=0
:
In Practice:
Replace NID(x,y) by
NCD(x,y)= Z(xy)-min{Z(x),Z(y)} max{Z(x),Z(y)}
This NCD is actually about the same formula as NID, but rewritten using “Z” instead of “K”
Normalized Compression Distance (NCD)
Length (#bits) compressedversion x using compressor Z(gzip, bzip2, PPMZ,…)
Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04
Family of compression-basedsimilarities
The NCD is actually a family ofsimilarity measures, parametrizedwith the compressor, e.g.,
gzip, bzip2, PPMZ,...
(forget the crippled compressors like compress, awk, ...)
Application: Clustering of Natural Data
Unusual We don’t know number of clusters We don’t have criterion to distinguish clusters
Therefore, we hierarchically cluster to let thedata decide these issues naturally.
Applications:First One: Phylogeny of Species
Eutherian Orders: Ferungula, Primates, Rodents (Outgroup: Platypus, Wallaroo)
Hasegawa et al 98 concatenates selected proteins and gets different groupings depending on proteins used
We use whole mtDNA , Approximate K(.) by GenCompress to determine NCD matrix; Get only one tree.
Who is our closer relative?
Evolutionary Tree of Mammals:Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04
Embedding NCD Matrix in dendrogram (hierarchical clustering)for this Large Phylogeny (no errors it seems)
Therian hypothesisVersusMarsupiontihypothesis
Mammals:
Eutheria Metatheria Prototheria
Which pair is closest?
Cilibrasi, Vitanyi 2005
NCD Matrix 24 Species (mtDNA).Diagonal elements about 0. Distances between primates ca 0.6.
Identifying SARS Virus: S(T)=0.988
AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis
coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2.
Clustering : Phylogeny of 15 languages: Native American, NativeAfrican, Native European Languages
Applications Everywhere
Genomics and Language Tree just one example;also used with (e.g.): Cilibrasi, Vitanyi, de Wolf, 2003/2004;Cilibrasi, Vitanyi, 2005.MIDI music files (music clustering)Plagiarism detectionPhylogeny of chain lettersSARS virus classificationComputer worms and internet traffic (attacks) analysisLiteratureOCRAstronomy—Radio telecope time sequencesSpam detection
Time sequences:(All data bases used inall major data-miningconferences of last 10Y)Superior over all methods:In: Anomaly detectionHeterogenous data
Russian Authors (in original Cyrillic)S(T)=0.949
I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky 1821--1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828--1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809--1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891--1940 [The Master and Margarita, The Fatefull Eggs, The Heart of a Dog]
Same Russian Texts in EnglishTranslation; S(T)=0953
Files start to cluster according to translators!I.S. Turgenev, 1818--1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821--1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828--1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809—1852 [Dead Souls (C.J. Hogarth), Taras Bulba ($\approx$ G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled ($\approx$ I.F. Hapgood]; M. Bulgakov 1891--1940 [The Master and Margarita (R. Pevear, L. Volokhonsky),
The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]
12 Classical Pieces (Bach, Debussy, Chopin)---- No errors
Optical Character Recognition: DataHandwritten Digits from NIST Data Base
Optical Character Recognition: Clustering:S(T)=0.901
Heterogenous Data; Clusteringperfect with S(T)=0.95.
Clustering of radically different data.
No features known.
Only our parameter-free method can do this!!
You can use it too!
CompLearn Toolkit: http://www.complearn.org
“x” and “y” are literal objects (files); What about abstract objects like “home”, “red”, “Socrates”, “chair”, ….?
Or names for literal objects?
But what if wedo not have the objectas a file????
Non-Literal Objects
Googling for Meaning
Google distribution: g(x) = Google page count “x” # pages indexed
Cilibrasi, Vitanyi, 2004/2007.
Google Compressor
Google code length:
G(x) = log 1 / g(x)
This is the Shannon-Fano code length that has minimum expected code word length w.r.t. g(x).
Hence we can view Google as a Google Compressor.
Normalized Google Distance (NGD)
NGD(x,y) = G(x,y) – min{G(x),G(y)} max{G(x),G(y)} Same formula as NCD, using Z = G (Google compressor)
Use the Google counts and the CompLearn Toolkit to apply NGD.
Example
“horse”: #hits = 46,700,000 “rider”: #hits = 12,200,000 “horse” “rider”: #hits = 2,630,000 #pages indexed: 8,058,044,651
NGD(horse,rider) = 0.443Theoretically+empirically: scale-invariant
Colors and Numbers—The Names!Hierarchical Clustering
colors
numbers
Hierarchical Clustering of 17th Century Dutch Painters,Paintings given by name, without painter’s name.
Hendrickje slapend, Portrait of Maria Trip, Portrait of Johannes Wtenbogaert, The Stone Bridge, The Prophetess Anna, Leiden Baker Arend Oostwaert, Keyzerswaert, Two Men Playing Backgammon, Woman at her Toilet, Prince's Day, The Merry Family, Maria Rey, Consul Titus Manlius Torquatus, Swartenhont, Venus and Adonis
Mathematicians
H5N1 (Birdflu) virus mutaions
Next: Binary Classification
Here we use the NGDfor a Support Vector Machine (SVM) binary classification learner(we could also use a neuralnetwork)
Setup:Anchor terms, positive/negative examples,Test set Accuracy
Using NGD in SVM (Support Vector Machines) tolearn concepts (binary classification)
Example:
Emergencies
Example: Classifying Prime Numbers
Actually, 91 is not a prime. Soaccuracy is17/19=89,47%
Example: Electrical Terms
Example: Religious Terms
Comparison with WordNet Semanticshttp://www.cogsci.princeton.edu/~wn
NGD-SVM Classifier on 100randomly selected WordNetCategories
Randomly selected positive, negative and test sets
Histogram gives accuracy With respect to PhD experts entered knowledge in theWordNet Database
Mean Accuracy is 0.8725Standard deviation is 0.1169
Accuracy almost always > 75%--Automatically
Translation Using NGD
Problem:
Translation:
Selected BibliographyD. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping, Physical Review Letters, 88:4(2002) 048702.C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, and W. Zurek. Information Distance, IEEE Transactions on Information Theory, 44:4(1998), 1407--1423.C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76--81.X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker, Shared information and program plagiarism detection, IEEE Trans. Inform. Th.,
50:7(2004), 1545--1551.R. Cilibrasi, The CompLearn Toolkit, 2003, http://complearn.sourceforge.net/ .R. Cilibrasi, P.M.B. Vitanyi, R. de Wolf, Algorithmic clustering of music based on string compression, Computer Music Journal, 28:4(2004), 49-67.R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Inform. Th., 51:4(2005), 1523-1545.R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004)E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22---25, 2004, 206--215.M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.M. Li and P.M.B. Vitanyi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal Society of London,
Series A, 452(1996), 769-789.M. Li and P.M.B Vitanyi. Algorithmic Complexity, pp. 376--382 in: International Encyclopedia of the Social \& Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264.M. Li and P.M.B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications,Springer-Verlag, New York, 2nd Edition, 1997.A.Londei, V. Loreto, M.O. Belardinelli, Music style and authorship categorization by informative compressors, Proc. 5th Triannual Conference of the European Society for the Cognitive Sciences of Music (ESCOM),September 8-13, 2003, Hannover, Germany, pp. 200-203.S. Wehner, Analyzing network traffic and worms using compression, Manuscript, CWI, 2004. Partially availableat http://homepages.cwi.nl/~wehner/worms/
top related