timo honkela: spaces of knowledge
TRANSCRIPT
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Timo Honkela
Modeling Meaning and Knowledge1 Feb 2016
Spaces of Knowledge
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.html
Advent of vector-based information retrieval
● Gerarg Salton: Documents and queries represented as vectors of term counts
● Similarity between a document and a query is given by the cosine between the term vector and the document vector
● TF-IDF (term-frequency-inverse-document frequency) for weighting of a term in a document
● Inverse document frequency had been introduced by Karen Spärck-Jones in 1972
https://en.wikipedia.org/wiki/Gerard_Salton
https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
University
Society
D
D
DQ Q
Q
1
1
2
2
3
3
Document 1: The word “university” appears three times and “society” once, etc.
Query 1: “university”
https://en.wikipedia.org/wiki/Cosine_similarity
https://en.wikipedia.org/wiki/Sine
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Contexts tell about meaning
● John Rupert Firth: “You shall know a word by the company it keeps”
● Ludwig Wittgenstein: “For a large class of cases of the employment of the word ‘meaning’—though not for all—this way can be explained in this way: the meaning of a word is its use in the language” (PI 43)
https://en.wikipedia.org/wiki/John_Rupert_Firth
http://plato.stanford.edu/entries/wittgenstein/#Mea
https://en.wikipedia.org/wiki/Ludwig_Wittgenstein
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Analysis of term-document matrices
● The same idea as in information retrieval can also be applied in studying words and expressions
● Statistical analysis of document-term matrices gives rise to models of relationship between words or documents
● Classical examples include – Latent Semantic Analysis (Deerwester, Dumais et al. 1988)
– Self-Organizing Semantic Maps (Ritter & Kohonen 1989)
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Word spaces, clusters, clouds, ...
● The analysis of the statistical information related to word contexts can be turned into visualizations of the word relations
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Automated learning of word re
lations
using self-organizing m
ap on text c
ontext data
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Chemistry
Natural sciencesand engineering
Bio- andenvironmentalsciences
Health
Culture andsociety
Map of Finnish Science
(T. Honkela & M. Klami 2007)
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
From term weightingto term selection
● TF-IDF is a widely used method for term weighting
● Likey (Language Independent Keyphrase Extraction) was developed to select terms automally by camparing the corpus at hand with another corpus, called a reference corpus(Paukkeri et al. 2008, Paukkeri & Honkela 2010)
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
1. the 12768472. of 10679183. and 8178524. in 6253305. to 3574536. for 2253077. is 2057238. on 1625099. research 15725110. be 15147511. with 13685412. will 13599213. as 12270714. are 11650815. by 11387816. university 98003...
1. the 20236172. of 9456223. to 8832064. and 7177185. in 6114216. that 4737397. a 4457758. is 4451199. we 30559010. for 29609211. i 29041212. this 28692413. on 27461414. it 25134315. be 24691716. are 197082...
Most frequent word forms (types) intwo corpora
Academycorpus
Europarlcorpus
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Doc
umen
ts
Terms
SOM
Document map
Likey
Referencecorpus
(EU partiament)
Academycorpus
Term list
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Extralinguistic contexts
● Human beings learn language in real world contexts that include visual, tactile, etc. perceptions
● In order to model meaning in a human-like manner, these other modalities have to be taken into account
● In a project called “Multimodally Grounded Language Technology” we associated visual patterns of human movements with expressions that had been used to describe these movements
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
RUNNING
WALKING
LIMPING
JOGGING
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Modeling subjectivityof meaning
● In our method Grounded Intersubjective Concept Analysis (GICA), we added a new “dimension” to the term-document matrices
● We did not assume that each person understands and uses every word in a similar manner but wanted to model the personal variation
● This was achieved by using Subject-Object-Context tensors (Honkela et al. 2012)
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
GICA: Grounded IntersubjectiveConcept Analysis
Honkela, Raitio, Lagus & Nieminen 2012
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Analysis of “health” in theState of the Union addresses
Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity. Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar.Proc. of IJCNN 2012.
Timo Honkela, Modeling Meaning and Knowledge, Spaces of Knowledge, 1.2.2016
Thank you foryou attention!