universal similarity - nvti · russian authors (in original cyrillic) s(t)=0.949 i.s. turgenev,...
TRANSCRIPT
![Page 1: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/1.jpg)
Universal Similarity
Paul VitanyiCWI and University of Amsterdam,
![Page 2: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/2.jpg)
The Problem:
1 2 3
4 5
Given: Literal objects
Determine: “Similarity” Distance Matrix (distances between every pair)
(binary files)
Applications: Clustering, Classification, Evolutionary trees of Internet documents, computer programs, chain letters,
genomes, languages, texts, music pieces, ocr, ……
![Page 3: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/3.jpg)
Andrey Nikolaevich Kolmogorov(1903-1987, Tambov, Russia)
Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity
![Page 4: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/4.jpg)
TOOL:
Information Distance (Li, Vitanyi, 96; Bennett,Gacs,Li,Vitanyi,Zurek, 98)
D(x,y) = min { |p|: p(x)=y & p(y)=x}
Binary program for a Universal Computer(Lisp, Java, C, Universal Turing Machine)
Theorem (i) D(x,y) = max {K(x|y),K(y|x)}Kolmogorov complexity of x given y, definedas length of shortest binary ptogram thatoutputs x on input y.
(ii) D(x,y) ≤D’(x,y) Any computable distance satisfying ∑2 --D’(x,y) y for every x.
≤ 1
(iii) D(x,y) is a metric.
![Page 5: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/5.jpg)
However:
x
So, we Normalize:
d(x,y) = D(x,y)
Y
X’ Y’
D(x,y)=D(x’,y’) = But x and y are much more similar than x’ and y’
Max {K(x),K(y)}
Normalized Information Distance (NID) The “Similarity metric”
![Page 6: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/6.jpg)
Properties NID:
Theorem:
Drawback: NID(x,y) = d(x,y) is noncomputable, since K(.) is!
• 0 ≤ d(x,y) ≤ 1• d(x,y) is a metric symmetric,triangle inequality, d(x,x)=0
:
![Page 7: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/7.jpg)
In Practice:
Replace NID(x,y) by
NCD(x,y)= Z(xy)-min{Z(x),Z(y)} max{Z(x),Z(y)}
This NCD is actually about the same formula as NID, but rewritten using “Z” instead of “K”
Normalized Compression Distance (NCD)
Length (#bits) compressedversion x using compressor Z(gzip, bzip2, PPMZ,…)
Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04
![Page 8: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/8.jpg)
Family of compression-basedsimilarities
The NCD is actually a family ofsimilarity measures, parametrizedwith the compressor, e.g.,
gzip, bzip2, PPMZ,...
(forget the crippled compressors like compress, awk, ...)
![Page 9: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/9.jpg)
Application: Clustering of Natural Data
Unusual We don’t know number of clusters We don’t have criterion to distinguish clusters
Therefore, we hierarchically cluster to let thedata decide these issues naturally.
![Page 10: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/10.jpg)
Applications:First One: Phylogeny of Species
Eutherian Orders: Ferungula, Primates, Rodents (Outgroup: Platypus, Wallaroo)
Hasegawa et al 98 concatenates selected proteins and gets different groupings depending on proteins used
We use whole mtDNA , Approximate K(.) by GenCompress to determine NCD matrix; Get only one tree.
![Page 11: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/11.jpg)
Who is our closer relative?
![Page 12: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/12.jpg)
Evolutionary Tree of Mammals:Li Badger Chen Kwong Kearney Zhang 01Li Vitanyi 01/02Li Chen Li Ma Vitanyi 04
![Page 13: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/13.jpg)
Embedding NCD Matrix in dendrogram (hierarchical clustering)for this Large Phylogeny (no errors it seems)
Therian hypothesisVersusMarsupiontihypothesis
Mammals:
Eutheria Metatheria Prototheria
Which pair is closest?
Cilibrasi, Vitanyi 2005
![Page 14: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/14.jpg)
NCD Matrix 24 Species (mtDNA).Diagonal elements about 0. Distances between primates ca 0.6.
![Page 15: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/15.jpg)
Identifying SARS Virus: S(T)=0.988
AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis
coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2.
![Page 16: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/16.jpg)
Clustering : Phylogeny of 15 languages: Native American, NativeAfrican, Native European Languages
![Page 17: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/17.jpg)
Applications Everywhere
Genomics and Language Tree just one example;also used with (e.g.): Cilibrasi, Vitanyi, de Wolf, 2003/2004;Cilibrasi, Vitanyi, 2005.MIDI music files (music clustering)Plagiarism detectionPhylogeny of chain lettersSARS virus classificationComputer worms and internet traffic (attacks) analysisLiteratureOCRAstronomy—Radio telecope time sequencesSpam detection
Time sequences:(All data bases used inall major data-miningconferences of last 10Y)Superior over all methods:In: Anomaly detectionHeterogenous data
![Page 18: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/18.jpg)
Russian Authors (in original Cyrillic)S(T)=0.949
I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky 1821--1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828--1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809--1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891--1940 [The Master and Margarita, The Fatefull Eggs, The Heart of a Dog]
![Page 19: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/19.jpg)
Same Russian Texts in EnglishTranslation; S(T)=0953
Files start to cluster according to translators!I.S. Turgenev, 1818--1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821--1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828--1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809—1852 [Dead Souls (C.J. Hogarth), Taras Bulba ($\approx$ G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled ($\approx$ I.F. Hapgood]; M. Bulgakov 1891--1940 [The Master and Margarita (R. Pevear, L. Volokhonsky),
The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]
![Page 20: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/20.jpg)
12 Classical Pieces (Bach, Debussy, Chopin)---- No errors
![Page 21: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/21.jpg)
Optical Character Recognition: DataHandwritten Digits from NIST Data Base
![Page 22: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/22.jpg)
Optical Character Recognition: Clustering:S(T)=0.901
![Page 23: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/23.jpg)
Heterogenous Data; Clusteringperfect with S(T)=0.95.
Clustering of radically different data.
No features known.
Only our parameter-free method can do this!!
![Page 24: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/24.jpg)
You can use it too!
CompLearn Toolkit: http://www.complearn.org
“x” and “y” are literal objects (files); What about abstract objects like “home”, “red”, “Socrates”, “chair”, ….?
Or names for literal objects?
But what if wedo not have the objectas a file????
![Page 25: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/25.jpg)
Non-Literal Objects
Googling for Meaning
Google distribution: g(x) = Google page count “x” # pages indexed
Cilibrasi, Vitanyi, 2004/2007.
![Page 26: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/26.jpg)
Google Compressor
Google code length:
G(x) = log 1 / g(x)
This is the Shannon-Fano code length that has minimum expected code word length w.r.t. g(x).
Hence we can view Google as a Google Compressor.
![Page 27: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/27.jpg)
Normalized Google Distance (NGD)
NGD(x,y) = G(x,y) – min{G(x),G(y)} max{G(x),G(y)} Same formula as NCD, using Z = G (Google compressor)
Use the Google counts and the CompLearn Toolkit to apply NGD.
![Page 28: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/28.jpg)
Example
“horse”: #hits = 46,700,000 “rider”: #hits = 12,200,000 “horse” “rider”: #hits = 2,630,000 #pages indexed: 8,058,044,651
NGD(horse,rider) = 0.443Theoretically+empirically: scale-invariant
![Page 29: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/29.jpg)
Colors and Numbers—The Names!Hierarchical Clustering
colors
numbers
![Page 30: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/30.jpg)
Hierarchical Clustering of 17th Century Dutch Painters,Paintings given by name, without painter’s name.
Hendrickje slapend, Portrait of Maria Trip, Portrait of Johannes Wtenbogaert, The Stone Bridge, The Prophetess Anna, Leiden Baker Arend Oostwaert, Keyzerswaert, Two Men Playing Backgammon, Woman at her Toilet, Prince's Day, The Merry Family, Maria Rey, Consul Titus Manlius Torquatus, Swartenhont, Venus and Adonis
![Page 31: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/31.jpg)
Mathematicians
![Page 32: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/32.jpg)
H5N1 (Birdflu) virus mutaions
![Page 33: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/33.jpg)
Next: Binary Classification
Here we use the NGDfor a Support Vector Machine (SVM) binary classification learner(we could also use a neuralnetwork)
Setup:Anchor terms, positive/negative examples,Test set Accuracy
![Page 34: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/34.jpg)
Using NGD in SVM (Support Vector Machines) tolearn concepts (binary classification)
Example:
Emergencies
![Page 35: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/35.jpg)
Example: Classifying Prime Numbers
Actually, 91 is not a prime. Soaccuracy is17/19=89,47%
![Page 36: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/36.jpg)
Example: Electrical Terms
![Page 37: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/37.jpg)
Example: Religious Terms
![Page 38: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/38.jpg)
Comparison with WordNet Semanticshttp://www.cogsci.princeton.edu/~wn
NGD-SVM Classifier on 100randomly selected WordNetCategories
Randomly selected positive, negative and test sets
Histogram gives accuracy With respect to PhD experts entered knowledge in theWordNet Database
Mean Accuracy is 0.8725Standard deviation is 0.1169
Accuracy almost always > 75%--Automatically
![Page 39: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/39.jpg)
Translation Using NGD
Problem:
Translation:
![Page 40: Universal Similarity - NVTI · Russian Authors (in original Cyrillic) S(T)=0.949 I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky](https://reader033.vdocuments.us/reader033/viewer/2022052800/5f0fcc217e708231d445ee53/html5/thumbnails/40.jpg)
Selected BibliographyD. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping, Physical Review Letters, 88:4(2002) 048702.C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, and W. Zurek. Information Distance, IEEE Transactions on Information Theory, 44:4(1998), 1407--1423.C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76--81.X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker, Shared information and program plagiarism detection, IEEE Trans. Inform. Th.,
50:7(2004), 1545--1551.R. Cilibrasi, The CompLearn Toolkit, 2003, http://complearn.sourceforge.net/ .R. Cilibrasi, P.M.B. Vitanyi, R. de Wolf, Algorithmic clustering of music based on string compression, Computer Music Journal, 28:4(2004), 49-67.R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Inform. Th., 51:4(2005), 1523-1545.R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004)E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22---25, 2004, 206--215.M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.M. Li and P.M.B. Vitanyi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal Society of London,
Series A, 452(1996), 769-789.M. Li and P.M.B Vitanyi. Algorithmic Complexity, pp. 376--382 in: International Encyclopedia of the Social \& Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264.M. Li and P.M.B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications,Springer-Verlag, New York, 2nd Edition, 1997.A.Londei, V. Loreto, M.O. Belardinelli, Music style and authorship categorization by informative compressors, Proc. 5th Triannual Conference of the European Society for the Cognitive Sciences of Music (ESCOM),September 8-13, 2003, Hannover, Germany, pp. 200-203.S. Wehner, Analyzing network traffic and worms using compression, Manuscript, CWI, 2004. Partially availableat http://homepages.cwi.nl/~wehner/worms/