philips research, jan korst, 26 november 20041 ontology-based extraction of information from the...
TRANSCRIPT
Philips Research, Jan Korst, 26 november 2004 1
Ontology-based Extraction of Information
from the Internet
Jan KorstPhilips Reseach
Joint work with Michael Verschoor, Nick de Jong, and Gijs Geleijnse
Philips Research, Jan Korst, 26 november 2004 2
Overview
• Context
• Ontologies
• Searching for enumerations / tables in web pages
• Case Study: Searching for famous persons on the web
• Concluding remarks
Philips Research, Jan Korst, 26 november 2004 3
Context
recommender system:
ontologies andmetadata
matching andreasoning
preferences,personal history,
and calender
electronic program guide,cultural agenda
recommendationsfor TV shows,
expositions in museums,theatre shows, etc.
Philips Research, Jan Korst, 26 november 2004 4
Ontologies
An ontology is a “specification of a conceptualization”. [Tom Gruber]
In other words: a formal description of the concepts and their relationships in a certain domain.
Example: music domain
concepts: composers, songs, albums, performers,… relationships: …
To define/specify ontologies for given knowledge domains semantic web languages as RDF(S) and OWL are useful.
Philips Research, Jan Korst, 26 november 2004 5
Ontologies
An ontology O is defined by a 4-tuple (C, I, P, T ), where:
• C is a set of classes c e.g. composer, song, album, performer,…
• I = { I (c ) | c C } , withI (c ) the set of instances of class c
• P is a set of properties p (c,c’ ) for some c, c’ C e.g. is_composer_of (composer, song)
is_contained_in (song, album)
• T = {T (p) | p P } , withT (p) { (s, p, o) | s I (c), o I (c’ )} for each p P
the set of true statements (triples).
Philips Research, Jan Korst, 26 november 2004 6
Problem statement
For a partially given ontology O’ = (C, I’, P, T’ ) of a given knowledge domain, with I’ I and T’ T, extend I’ to I’’ and T’ to T’’ to approximate I and T as well as possible.
In other words: how can we populate databases.
Research questions:
- Can this be automated ? - Can we do this by extracting information the web ?
Philips Research, Jan Korst, 26 november 2004 7
Quality of Approximation
For each class c, we define precision and recall as follows:
precision (c ) =
recall (c ) =
For each property p, precision and recall are defined likewise.
|)(''||)('')(|
cIcIcI
|)(||)('')(|
cIcIcI
)(cI
)(' cI
)('' cI
Philips Research, Jan Korst, 26 november 2004 8
Searching for enumerations on the web
basic idea: words in an enumeration tend to be of the same class.
Given a small subset of instances of a given class, we want to automatically extend this subset: more-of-the-same.
algorithm: - select web pages in which a given sequence or given subset of instances occurs, using Google.
- scan these pages for enumerations in which one or more of the given instances occurs.
- extract other terms that are in these enumerations.
Similar approach has been applied on a corpus of documentsin molecular biology [Nenadić, Spasić & Ananiadou, 2002].
Philips Research, Jan Korst, 26 november 2004 9
Preselection of relevant
web pages
Extraction ofInstances/
Statements
Filter to removefalse positives
General structure of the algorithm
Philips Research, Jan Korst, 26 november 2004 10
Examples
"bach vivaldi mozart" 611 --> [63] bach[154], mozart[46],
vivaldi[45],
haydn[17], beethoven[14], ensembles[9], handel[9], chopin[7],
haendel[5],
schubert[5], bizet[4], j[4], albinoni[3], brahms[3], s[3], sanz[3],
tartini[3], 2[2], chaconne[2], corelligeminiani[2], gershwin[2],
gluck[2],
http[2], inteacutegrale[2], minor[2], paganini[2], ravel[2],
strauss[2],
stravinsky[2], tchaikovsky[2], teleman[2], telemann[2], albeniz[1],
bellini[1], benda[1], berlioz[1], bloch[1], boccherini[1], boellman[1],
boieldieu[1], bruch[1], caccini[1], caldera[1], corelli[1], diabelli[1],
dowland[1], giuliani[1], grieg[1], homekcrrcom[1], jsbach[1],
martin[1],
milano[1], ortiz[1], pergolesi[1], prokofiev[1], purcell[1],
rimskykorsakov[1],
schumann[1], smetana[1], title[1], torelli[1], vieuxtemps[1]
Philips Research, Jan Korst, 26 november 2004 11
Examples (2)
"france germany england italy" 246 --> [54] france[322],
germany[259], brazil[257], italy[239], argentina[223],
england[218], spain[215], holland[212], yugoslavia[140],
croatia[133], denmark[129], norway[122], chile[91], belgium[88],
nigeria[83], romania[83], mexico[66], bulgaria[59], colombia[54],
scotland[34], austria[33], cameroon[30], team[25], usa[22],
sth[18], states[16], morocco[13], ar[12], netherlands[12],
saudi[11], africa[10], bahamas[10], paraguay[10], czech[8],
jamaica[8], scandinavia[8], canada[7], japan[7], acquitane[4],
australia[4], bali[4], caribbean[4], china[4], czechoslovakia[4],
luxembourg[4], poland[4], us[4], flanders[2], acadeacutemiques[1],
asn[1], cortona[1], europe[1], korea[1], park[1]
Philips Research, Jan Korst, 26 november 2004 12
Examples (3)
poincare hilbert brouwer 1110 --> [90] brouwer[20], hilbert[20],
abel[18],
deligne[18], gregory[18], mandelbrot[18], taylor[18], turing[18],
cavalieri[17],
poisson[17], banach[16], kolmogorov[16], wiener[16], goldbach[15],
grassmann[15],
cohen[13], hausdorff[13], jacobi[13], kronecker[13], torricelli[13],
vinogradov[13],
riemann[12], dedekind[11], frege[11], artin[10], babbage[10], barrow[10],
boole[10], bourgain[10], eukleidõs[10], euler[10], fraenkel[10],
heaviside[10], legendre[10], möbius[10], shannon[10], tchebychev[10],
borel[9], fibonacci[9], fisher[9], grothendieck[9], aryabhata[8], birkhoff[8],
bolyai[8], cayley[8], church[8],
descartes[8], hypatie[8], markov[8], minkowski[8], bolzano[7], cramer[7],
dee[7],
painlevÕ[7], cantor[6], morgan[6], puthagoras[6], gauss[5], haldane[5],
hauptman[5], irons[5], lejeune[5], schwartz[5], lie[4], bayes[3],
poincareacute[3], poincarÕ[3], biography[2], brahmagupta[2], carnap[2],
goumldel[2], gödel[2], …
Philips Research, Jan Korst, 26 november 2004 13
Hypernym-based filtering
Patterns that indicate hypernym relations are distinguished:
”h such as i1 , i2 , …, in” and
”i1 , i2 , …, in and other h ” [Hearst, 1992]
In these patterns h is the plural of the intended class.
Philips Research, Jan Korst, 26 november 2004 14
Geographic Data
Extract all countries:
Input set Precision Recall
France, China, Germany 0.89 0.99Georgia, Ghana, Latvia 0.84 0.99Kiribati, Monaco, Togo 0.79 0.99
Find out which countries have
a border in common.
Philips Research, Jan Korst, 26 november 2004 15
Case Study: Finding Famous Persons on the Web
Objective: generate a long list of famous persons, by searching the web.
- A famous person is a person that gets enough hits when being Googled.
- We restrict ourselves to persons that have already died.
Philips Research, Jan Korst, 26 november 2004 16
Definition of number of hits
Using only the last name is not specific enough. e.g. Bach, Smith
Even the full name might not be specific enough. e.g. Theo van Gogh
In addition, some persons score better with middle name, others without. e.g. Johann Sebastian Bach vs. Johann Bach Antonio Vivaldi vs. Antonio Lucio Vivaldi
While others are best known with initials only. e.g. HG Wells, DH Lawrence
Philips Research, Jan Korst, 26 november 2004 17
Definition of number of hits
We use the number of hits that are found with query:
“<last name> (<year of birth> - <year of death>)” e.g. “Bach (1685 – 1750)”
By not using the full name, we combine different variants. e.g. Johann Sebastian Bach and JS Bach
For kings, queens, popes, etc, the Latin ordinal number is used as last name.This combines the variants in different languages. e.g. Charles V Carlos V Karel V
Philips Research, Jan Korst, 26 november 2004 18
Basic idea
We use potential time intervals
“(<year of birth> - <year of death>)”
as starting point to search for persons.
Issue exact queries to Google of the following form:
allintitle: “(y1 – y2)”
where y1 ∈ [1000..1999] and y2-y1 ∈ [20..110],and analyse the summaries Google returns.
Look for the six words that precede “(y1 – y2)” and analyse these words.
Philips Research, Jan Korst, 26 november 2004 19
Google batch processing
To process the Google queries we use a program that allows batch processing (Nick de Jong):
Program allows parallel execution of multiple queries.
file with queriesfile with queries
GoogleQueryGoogleQuery
file with resultsfile with results
Philips Research, Jan Korst, 26 november 2004 20
Main Problem: how to separate person names from other names.
Art Blakey Art DecoWest Mae West VirginiaRaul Delcroix Real DecretoHP Lovecraft HP InkjetKoye Somefun Have SomeFun
Potential approaches:- filter out non-persons by using a list of stop words.- filter out non-persons by using an exhaustive list of first names.- carry out further tests (“X was born in”).
We only used a list of 500 stop words, including:Album, Anniversary, Archive, Articles, Biographie, Biography, Births, Boats, Burials, Catalog, Census,…
Philips Research, Jan Korst, 26 november 2004 21
Additional Problem:
a single person can be presented in various ways
Vasilij Kandinskij Wassily KandinskyVasily KandinskyVassily KandinskyKandinsky, WassilyKandinsky Wassily
Johann Sebastian BachJS BachJohann SebastianSebastian BachBach, Johann Sebastian
Philips Research, Jan Korst, 26 november 2004 22
Example of the word sequences that are found:
[allintitle: "(1769 - 1852)" -genealogy -genealogie] 111 Rose-Philippine Duchesne ( Rose-Philippine Duchesne (Wellesley, 1st Duke of Wellington ( Home Study Service Rose Philippine DuchesneArthur, 1st Duke of Wellington ( The Duke of Wellington (Wellesley, 1st Duke of Wellington (Arthur Wellesley, Duke of Wellington. (Wellesley, first Duke of Wellington ( People > Duke of Wellington ( > Pobl > Dug Wellington ( medal depicting Duke of Wellington ( Arthur Wellesley Wellington (Wellesley, 1st Duke of Wellington (John Landseer (Wellington, Arthur Wellesley,Duke of,Learning Library: WELLINGTON, DUKE OF (
Philips Research, Jan Korst, 26 november 2004 23
Another Example:
George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric HANDEL ( Georg Frideric Handel |from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (
Philips Research, Jan Korst, 26 november 2004 24
1. first reduce capitals:
If a word consists of capitals only, then replace all but the first.e.g. HANDEL Handel
Unless the word contains a hyphen.e.g. SAINT-SAENS Saint-Saens
Unless the word represents a latin ordinal number.e.g. Louis XIV Louis XIV
Unless the word starts with ‘MC’.e.g. MCCULLOCH McCulloch
Unless the word is an abbreviation (initials).e.g. DE KNUTH DE Knuth
Philips Research, Jan Korst, 26 november 2004 25
Example:
George Frederick Handel ( GEORGE F. HANDEL ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British Classical DVD: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric HANDEL ( Georg Frideric Handel |from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by HANDEL, Georg Friedrich (
Philips Research, Jan Korst, 26 november 2004 26
Example:
George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (
Philips Research, Jan Korst, 26 november 2004 27
2. delete pre- and suffixes:
Delete parts that cannot be part of the name.
First delete suffix.
Next, scan through the words from back to front,until e.g. a colon or point is encountered.
Philips Research, Jan Korst, 26 november 2004 28
Example:
George Frederick Handel ( George F. Handel ( X. George Frederick Handel. ( Handel, George Frideric ( George Frederic Handel,... George Frederic Handel ( CD:Composers - H: Handel, George Frederic (German/British, Classical Dvd: Handel, George Frederic (German/British, George Frederic Handel ( ... George Frideric Handel ( Georg Frideric Handel | from Alibris George Frideric Handel ( New Window. George Frideric Handel (up artist Handel, George F. ( Giulio Cesare. by GF Handel ( piece by Handel, Georg Friedrich (
Philips Research, Jan Korst, 26 november 2004 29
Example:
George Frederick Handel George F. HandelX. George Frederick HandelHandel, George FridericGeorge Frederic HandelGeorge Frederic Handel Handel, George FredericHandel, George FredericGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich
Philips Research, Jan Korst, 26 november 2004 30
3. correct inversions:
If two words remain, where the first ends with a comma, then reverse.e.g. West, Mae Mae West
If three words remain, where the first ends with a comma, then reverse.e.g. Handel, George Frederick George Frederick Handel
If three words remain, where the second ends with a comma, then reverse.e.g. Van Gogh, Vincent Vincent van Gogh
Problem: not all inverted names contain commas.
Philips Research, Jan Korst, 26 november 2004 31
Example:
George Frederick Handel George F. HandelX. George Frederick HandelHandel, George FridericGeorge Frederic HandelGeorge Frederic Handel Handel, George FredericHandel, George FredericGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich
Philips Research, Jan Korst, 26 november 2004 32
Example:
George Frederick Handel George F. HandelX. George Frederick HandelGeorge Frideric HandelGeorge Frederic HandelGeorge Frederic Handel George Frederic HandelGeorge Frederic HandelGeorge Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich
Philips Research, Jan Korst, 26 november 2004 33
4. save two- and three-word names
Scan the list of strings and those consisting of two or three words are stored,provided that they do not contain stop words.
In addition, count how often they are found.
Philips Research, Jan Korst, 26 november 2004 34
Example:
George Frederick Handel George Frederic Handel 5George F. Handel George Frideric Handel 2X. George Frederick Handel George F. Handel 1George Frideric Handel George Frederick Handel 1George Frederic Handel Georg Frideric Handel 1George Frederic Handel by GF Handel 1George Frederic HandelGeorge Frederic Handel George Frederic HandelGeorge Frideric Handel Georg Frideric Handelfrom Alibris George Frideric Handel George Frideric Handel up artist Handel, George F. by GF Handel piece by Handel, Georg Friedrich
For each lastname/years combinationthe form that was found most
often is used.
Philips Research, Jan Korst, 26 november 2004 35
Unexpected Observations
- Franz-Eugen Schlachter (1859 – 1911) has 64,500 hits, but all from the same server!
It concerns an on-line bible, where each bible page is implemented as a separate web page, with Franz-Eugen Schlachter in the title.
We can use the similar pages information that Google gives, to filter these out.
- Koop Juliana (1948 - 1980) has 8,200 hits. “Koop Juliana” results in considerably less hits than “Juliana (1948 – 1980)”. That can be an indication that the first name is not correct.
Philips Research, Jan Korst, 26 november 2004 36
Number of Persons Found
1000 – 1099: 401100 – 1199: 421200 – 1299: 791300 – 1399: 1061400 – 1499: 3571500 – 1599: 10501600 – 1699: 2258 1700 – 1799: 72391800 – 1899: 286371900 – 1999: 12101
Total 51909
Philips Research, Jan Korst, 26 november 2004 37
Top 16 born between 1500 and 1599
1 William Shakespeare (1564 - 1616) 51300 2 Rene Descartes (1596 - 1650) 33400 3 Galileo Galilei (1564 - 1642) 27300 4 Francis Bacon (1561 - 1626) 25200 5 John Dowland (1563 - 1626) 25000 6 Orlandus Lassus (1532 - 1594) 23200 7 Johannes Kepler (1571 - 1630) 22700 8 Thomas Hobbes (1588 - 1679) 15400 9 Frescobaldi Girolamo (1583 - 1643) 11900 10 Claudio Monteverdi (1567 - 1643) 11600 11 Peter Paul Rubens (1577 - 1640) 11400 12 Tycho Brahe (1546 - 1601) 11000 13 Michel de Montaigne (1533 - 1592) 10700 14 John Calvin (1509 - 1564) 9990 15 Elizabeth I (1558 - 1603) 7520 16 Andrea Palladio (1508 - 1580) 714017 Gibbons Orlando (1508 – 1580) 703018 Nicolas Poussin (1594 - 1665) 6790
Philips Research, Jan Korst, 26 november 2004 38
Top 16 born between 1600 and 1699
1 Johann Sebastian Bach (1685 - 1750) 86600 2 Antonio Vivaldi (1678 - 1741) 39700 3 Henry Purcell (1659 - 1695) 37600 4 Georg Philipp Telemann (1681 - 1767) 35700 5 Georg Friedrich Haendel (1685 - 1759) 336006 Voltaire (1694 - 1778) 32800 7 Isaac Newton (1642 - 1727) 31700 8 Domenico Scarlatti (1685 - 1757) 28300 9 Arcangelo Corelli (1653 - 1713) 27300 10 Francois Couperin (1668 - 1733) 27100 11 Jean-Philippe Rameau (1683 - 1764) 26700 12 Alessandro Scarlatti (1660 - 1725) 25600 13 Tomaso Albinoni (1671 - 1751) 25000 14 Jean-Baptiste Lully (1632 - 1687) 24900 15 Giuseppe Tartini (1692 - 1770) 23800 16 de la Barca (1600 - 1681) 2300017 John Locke (1632 - 1704) 22800 18 Blaise Pascal (1623 - 1662) 22700
Philips Research, Jan Korst, 26 november 2004 39
Top 16 born between 1700 and 1799
1 Wolfgang Amadeus Mozart (1756 - 1791) 79000 2 Ludwig van Beethoven (1770 - 1827) 69400 3 Franz Schubert (1797 - 1828) 62300 4 Napoleon Bonaparte (1769 - 1821) 61500 5 Joseph Haydn (1732 - 1809) 50300 6 Johann Wolfgang Goethe (1749 - 1832) 45800 7 Immanuel Kant (1724 - 1804) 35800 8 Gioacchino Rossini (1792 - 1868) 34300 9 Benjamin Franklin (1706 - 1790) 28600 10 Washington Irving (1783 - 1859) 26900 11 Luigi Boccherini (1743 - 1805) 25100 12 Luigi Cherubini (1760 - 1842) 24100 13 William Blake (1757 - 1827) 22000 14 Arthur Schopenhauer (1788 - 1860) 2190015 Thomas Jefferson (1743 - 1826) 20100 16 Jean-Jacques Rousseau (1712 - 1778) 1940017 Boyce William (1711 - 1779) 1740018 Heinrich Heine (1797 - 1856) 15900
Philips Research, Jan Korst, 26 november 2004 40
Top 16 born between 1800 and 1899
1 Charles Darwin (1809 - 1882) 73400 2 Albert Einstein (1879 - 1955) 70500 3 Johannes Brahms (1833 - 1897) 60600 4 James Joyce (1882 - 1941) 59300 5 Peter Iljitsch Tschaikowsky (1840 - 1893) 476006 47600 Robert Schumann (1810 - 1856) 45300 7 Frederic Chopin (1810 - 1849) 41200 8 Giuseppe Verdi (1813 - 1901) 41100 9 Claude Debussy (1862 - 1918) 39400 10 Winston Churchill (1874 - 1965) 39300 11 Franz Liszt (1811 - 1886) 38500 12 Richard Wagner (1813 - 1883) 38300 13 Richard Strauss (1864 - 1949) 37800 14 Antonin Dvorak (1841 - 1904) 35700 15 Maurice Ravel (1875 - 1937) 35300 16 Gustav Mahler (1860 - 1911) 34300
Philips Research, Jan Korst, 26 november 2004 41
Top 16 born between 1900 and 1999 16 nov. 2004 29 nov. 20041 Ronald Reagan (1911 - 2004) 44800 Yasser Arafat (1929 - 2004) 842002 Benjamin Britten (1913 - 1976) 31700 Ronald Reagan (1911 - 2004) 466003 John Peel (1939 - 2004) 27400 Benjamin Britten (1913 - 1976) 320004 Samuel Barber (1910 - 1981) 26600 Samuel Barber (1910 - 1981) 263005 John Fitzgerald Kennedy (1917 - 1963) 24100 John Peel (1939 - 2004) 217006 Robertson Davies (1913 - 1995) 18900 Robertson Davies (1913 - 1995) 188007 Yasser Arafat (1929 - 2004) 16600 John F. Kennedy (1917 - 1963) 173008 Peter Ustinov (1921 - 2004) 16500 Peter Ustinov (1921 - 2004) 167009 Kurt Cobain (1967 - 1994) 14800 Kurt Cobain (1967 - 1994) 1440010 Salvador Dali (1904 - 1989) 14600 Salvador Dali (1904 - 1989) 1400011 Christopher Reeve (1952 - 2004) 13900 Jon Lee (1968 - 2002) 1390012 Jon Lee (1968 - 2002) 13900 Marlon Brando (1924 - 2004) 1120013 Marlon Brando (1924 - 2004) 11200 Christopher Reeve (1952 - 2004) 1080014 Van Gogh (1957 - 2004) 10900 Jean-Paul Sartre (1905 - 1980) 979015 Albert Camus (1913 - 1960) 9730 Chostakovitch Dimitri (1906 - 1975) 964016 Jean-Paul Sartre (1905 - 1980) 9630 Albert Camus (1913 - 1960) 918017 Ted Hughes (1930 - 1998) 8970 Van Gogh (1957 - 2004) 905018 Jim Morrison (1943 - 1971) 8930 Steve Reich (1965 - 1995) 8370
Philips Research, Jan Korst, 26 november 2004 42
Top 16 born between 1000 and 1999
1 Johann Sebastian Bach (1685 - 1750) 86600 2 Wolfgang Amadeus Mozart (1756 - 1791) 79000 3 Charles Darwin (1809 - 1882) 73400 4 Albert Einstein (1879 - 1955) 70500 5 Ludwig van Beethoven (1770 - 1827) 69400 6 Franz Schubert (1797 - 1828) 62300 7 Napoleon Bonaparte (1769 - 1821) 61500 8 Johannes Brahms (1833 - 1897) 60600 9 James Joyce (1882 - 1941) 59300 10 Leonardo da Vinci (1452 - 1519) 53400 11 William Shakespeare (1564 - 1616) 51300 12 Joseph Haydn (1732 - 1809) 50300 13 Peter Iljitsch Tschaikowsky (1840 - 1893) 47600 14 Johann Wolfgang Goethe (1749 - 1832) 45800 15 Robert Schumann (1810 - 1856) 4530016 Ronald Reagan (1911 - 2004) 44800
Philips Research, Jan Korst, 26 november 2004 43
Testing recall
Herinneringen in Steen 195 persons recall: 0.77
150 found: James Baldwin, Olaf Palme, Simone Signoret, Henry Moore, Carel Willink, Joan Miro, Theolonius Monk, Georges Brassens, John Lennon, Jean-Paul Sartre, Simone de Beauvoir, Mae West, Kurt Gödel, Elvis Presley, Maria Callas, Charlie Chaplin, Benjamin Britten, Paul Robeson, Mao Zedong, Agatha Christie, Lotte Lehmann, Robert Stolz, Edward Kennedy, Pablo Picasso, Pablo Casals, Maurits Cornelis Escher, Ezra Pound, Jim Morrison, Louis Armstrong, Igor Stravinsky, Jimi Hendrix, Barnett Newman, Charles de Gaule, Judy Garland, Dwight David Eisenhower, Ho Tsji Minh, Martin Luther King, Robert Kennedy, Erneste Guevara, John William Coltrane,…
45 not found: Louis Paul Boon, Adriaan Roland Holst, Stijn Streuvels, Ernest Claes, Johannes XXIII, Dag Hammarskjöld, William Christopher Handy, Lucien Guitry, Antony Fokker, Pieter Jelles Troelstra, Paul van Ostaijen, Hugo Verriest,…
Philips Research, Jan Korst, 26 november 2004 44
Testing recall
Het Kunst Boek of the first 200 (dead) persons recall: 0.84
167 found: Jaques-Laurent Agasse, Josef Albers, Allesandro Algardi, Washington Allston, Jacopo Amigoni, Fra Angelico, Antonello da Messina, Alexander Archipenko, Giuseppe Arcimboldo, Hendrick Avercamp, Francis Bacon, Giacomo Balla, Fra Bartolommeo, Jean-Michel Basquiat, Jacopo Bassano, Pompeo Batoni, Willi Baumeister, Frederic Bazille, Domenico Beccafumi, Max Beckmann, Gentille Bellini, Giovanni Bellini, Hans Bellmer, Gianlorenzo Bernini, Josef Beuys, Albert Bierstadt,…
45 not found: Andrea del Sarto, Sofonisba Anguissola, Jean Arp, John James Audubon, Hans Baldung, Andre Beauneveu, Bernardo Bellotto, George Bellows,…
Philips Research, Jan Korst, 26 november 2004 45
Testing recall
The Science Book of the 156 (dead) persons recall: 0.70
109 found: Leon Battista Alberti, Nicolas Copernicus, Andreas Vesalius, Conrad Gesner, Tycho Brahe, William Gilbert, Johannes Kepler, Galileo Galilei, John Napier, William Harvey, Blaise Pascal, Pierre de Fermat, Christiaan Huygens, James Clerk Maxwell, Robert Boyle, Nicolaus Steno, Giovanni Domenico Cassini, Isaac Newton, Edmond Halley, Carolus Linnaeus, Lazzaro Spallanzani, Johan Heinrich Lambert, Joseph Priestley, Antoine Laurent Lavoisier, William Herschel, Henry Cavendish, James Hutton, Edward Jenner, Pierre-Simon Laplace, Georges Cuvier, Thomas Robert Malthus, Alexander von Humboldt, Allesandro Volta, Thomas Young,...
45 not found: Fibonacci, Piero della Francesca, Jeremiah Horrocks, Antoni van Leeuwenhoek, Rudolph Jacob Camerarius, George Hadley, Carl Wilhelm Scheele, James Hall, Joseph von Frauenhofer, William Smith,…
Philips Research, Jan Korst, 26 november 2004 46
Testing precision precisionCounting false positives: 4900 – 4999 0.90
9900 – 9999 0.88 14900 – 14999 0.96 19900 – 19999 0.97
Povijest Jugoslavije (1918 - 1991) Oeuvre Poetique (1925 - 1965) Alabama Wills (1808 – 1870) Black Tennesseans (1900 - 1930) Nippon Porcelain (1891 - 1921) Personal Favorites (1977 - 1998) Wheeling Glass (1829 - 1939) Political Impact (1770 - 1814) Movie Set (1959 - 1980) Transatlantic Dialogues (1775 - 1815)Sailing Navy (1775 - 1854) Home Children (1869 - 1930) Peace Pilgrim (1908 - 1981) Briton Riviere (1840 - 1920) La Regle (1917 - 1947) Farm Tractors (1890 - 1960)Western Warfare (1775 - 1882) Le Peintre (1877 - 1968)Exakta Cameras (1933 - 1978) Offene Briefe (1945 - 1968) Portraitmatilde Muti (1862 - 1943) Nature Morte (1946 - 1993) Dessins Inconnus (1901 - 1954) Jacques Lacan-Seminaires (1952 - 1980) Legendary Parties (1922 - 1972) Memory Joggers (1940 - 1989)Klondike Ho (1897 - 1997) Events From (1907 - 1977)
estimated precision for first 5000: 0.90
Philips Research, Jan Korst, 26 november 2004 47
Some observations
- Composers dominate the top for some centuries.
- Recently-died persons have relatively high score.
- Person names only consisting of one word, such as pseudonyms Voltaire, Caravaggio, and Nadar are not yet found.
- Likewise, names consisting of four or more words are not yet found, such as Joost van den Vondel.
- Also, persons that died as teenagers are not found, such as Jeanne d’Arc and Anne Frank. - More advanced approximate pattern matching is required to better cluster the name variations of one person and potential errors in years.
Philips Research, Jan Korst, 26 november 2004 48
Concluding remarks
- Enumeration search offers an interesting approach to find more-of-the-same, since it is generally applicable.
- The famous-persons case study indicates that with simple techniques already non-trivial results can be obtained.
- Further research: extend the case study to also include information on nationality, profession, etc. of persons. Automatically search for biographic data.
- Other intended application domains: music and medical domain.
Philips Research, Jan Korst, 26 november 2004 49
Fun Section
Election of ‘De Grootste Nederlander’: Vincent van Gogh
Philips Research, Jan Korst, 26 november 2004 50
Fun Section
Persons that are born and died in the same years:
Sir Christopher Wren (1632 – 1723)Anthony van Leeuwenhoek (1632 – 1723)
Leo Tolstoy (1828 - 1910)Henri Dunant (1828 - 1910)
Edouard Manet (1832 - 1883) Gustave Dore (1832 - 1883)
JRR Tolkien (1892 – 1973)Pearl Buck (1892 – 1973)
Miles Davis (1926 – 1991)Klaus Kinski (1926 – 1991)