sims 290-2: applied natural language processing
Post on 14-Jan-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
2
Next Few Classes
This week: lexicons and ontologiesToday:
WordNet’s structure, computing term similarity
Wed: Guest lecture: Prof. Charles Fillmore on FrameNet
Next week: Enron labeling in classThe entire assignment will be due on Nov 15
Following week: Question-Answering
3
Text Categorization Assignment
Great job, you learned a lot!Comparing to a baselineSelecting featuresComparing relative usefulness of featuresTraining, testing, cross-validation
I learned a lot too! (from your results)(I’ll send you your feedback today)
4
Text Categorization Assignment
FeaturesBoosting weights of terms in subject line is helpful.Stemming does help in some circumstances (often works well with SVM, for example), but not always.
– Counter-intuitively, stemming can increase the number of features in our implementation, because it increases how many terms pass the minimum-document-occurrence cutoff.
– An example of the porter stemmer not hiding differences when it otherwise would: converting gaseous to "gase" and so not conflating "gas" for fuel for motorcycles with "gaseous" for the science group.
5
Text Categorization Assignment
FeaturesTerms with more than just the default alphabetical terms are helpful, maybe because in part getting the domain name information, but also because of getting technical terms.It's probably best to use the Weka feature selector to tell you what *kind* of features are performing well, but not to select those for use exclusively. I'm surprised that no one tried bigrams or noun-noun compounds as features.
6
Text Categorization Assignment
Feature WeightingTf.idf: Almost everyone who tried it found it was raw term frequency (there were exceptions). Binary feature weights with document count minimum thresholds can be a good substitute.An interesting variation on tf.idf is to do it in a class-based manner.
– weight terms higher that only occur in one class vs. the others.
– A couple of students tried this and got good results on the diverse comparison, but less good on the homogenous. This makes sense since the measure would not help as much in distinguishing similar newsgroups that share many terms.
7
Text Categorization Assignment
ClassifiersNaïve-Bayes Multinomial was a clear winnerSVM worked well most of the time, but not as well as NBMNaive Bayes seemed to be more robust to unseen information; the kernel estimator seems to improve the default Naive Bayes settings.VotedPerceptron worked very well, but only does binary classification so people who found it did very well on diverse did not transfer it to homogenous.
8
Today
Lexicons, Semantic Nets and OntologiesThe Structure of WordNetComputing SimilaritiesAutomatic Acquisition of New Terms
9
Lexicons, Semantic Nets, and Ontologies
Lexicons are (typically) word lists augmented with some subset of:Parts-of-speechDifferent word sensesSynonyms
Semantic NetsInclude links to other termsIS-A, Part-Of, etc.Sometimes this term is used for what I call ontologies
OntologiesRepresent concepts and relationships among conceptsLanguage independent (in principle)Sometimes include inference rulesDifferent from definition in philosophy
– The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality
10Adapted from slide by W. Ceusters, www.landc.be
One approach to linking ontologies and lexicons
Formal Domain Ontology
Lexicon
Grammar
Language ALanguage A
Lexicon
Grammar
LanguageLanguage BB
Cassandra Linguistic Ontology MEDRA
ICD
SNOMED
ICPC
Others ...
Proprietary Terminologies
11Adapted from slide by W. Ceusters, www.landc.be
Example Ontological Relation Types
HAS-PARTIAL-SPATIAL-OVERLAP
IS-TOPO-
INSIDE-OF
IS-GEO-INSIDE-
OF
IS-INSIDE-
CONVEX-HULL-OF
IS-PARTLY-IN-CONVEX-
HULL-OFIS-OUTSIDE-CONVEX-HULL-OF
HAS-DISCONNECTED-
REGION
HAS-EXTERNAL-
CONNECTING-REGION
HAS-DISCRETED-REGION
HAS-TANG.-SPAT.-PART
HAS-NON-TANG.-SPAT.-PART
IS-SPAT.-
EQUIV.-OF
IS-TANG.-SPAT.-PART-
OF
IS-NON-TANG.-SPAT.-PART-
OF
HAS-PROPER-SPATIAL
-PART
IS-PROPER-
SPAT.-PART-
OF
HAS-SPATIAL
-PART
IS-SPATIAL-PART-
OF
HAS-OVERLAPPING
-REGION
HAS-CONNECTING-
REGION
HAS-SPATIAL-POINT-
REFERENCE
12Adapted from slide by W. Ceusters, www.landc.be
Example of applying an ontology: joint anatomy
joint HAS-HOLE joint spacejoint capsule IS-OUTER-LAYER-OF jointmeniscus
IS-INCOMPLETE-FILLER-OF joint spaceIS-TOPO-INSIDE joint capsuleIS-NON-TANGENTIAL-MATERIAL-PART-OF joint
joint IS-CONNECTOR-OF bone XIS-CONNECTOR-OF bone Y
synoviaIS-INCOMPLETE-FILLER-OF joint space
synovial membrane IS-BONAFIDE-BOUNDARY-OF joint space
This doesn’t include the linguistic side
13Adapted from slide by W. Ceusters, www.landc.be
Linking Lexicons and Ontologies
Having a healthcare phenomenon
Generalised PossessionHealthcare phenomenonHuman
IS-A
Has-possessor Has-
possessed
PatientIs-possessor-of
Patient at risk
IS-A Has-Healthcare-phenomenon
Risk Factor
IS-AIs-Risk-
Factor-Of
Patient at risk for osteoporosis
Risk factor for osteoporosis Osteoporosis
Has-Healthcare-phenomenon
Is-Risk-Factor-Of
IS-A IS-A IS-A
11 1
2
2
IS-A
3
3
44
14Adapted from slide by W. Ceusters, www.landc.be
Linking different lexiconsMESH-2001 : “Seizures”
MESH-2001 : “Convulsions”
Snomed-RT : “Convulsion”
Snomed-RT : “Seizure”
L&C : ConvulsionL&C : Seizure
L&C : Health crisis
L&C : Epileptic convulsion
IS-AIS-A
IS-AIS-A
IS-narrower-than ISA
Has-CCC
Has-CCC
Has-CCC
Has-CCC
15
WordNet
A big lexicon with properties of a semantic netStarted as a language project by Dr George Miller and Dr. Christiane Fellbaum at PrincetonFirst became available in 1990Now on version 2.0
17
WordNet RelationsOriginal core relations:
SynonymyPolysemyMetonymyHyponymy/HyperonymyMeronymyAntonymy
New, useful additions for NLPGlossesLinks between derivationally and semantically related noun/verb pairs. Domain/topical termsGroups of similar verbs
Others on the wayDisambiguation of terms in glossesTopical clustering.
18
Synonymy
Different ways of expressing related conceptsExamples
cat, feline, Siamese cat
Synonyms are almost never truly substitutable:
Used in different contextsHave different implications
– This is a point of contention.
19
Polysemy
Most words have more than one senseHomonym: same word, different meaning
– bank (river)– bank (financial)
Polysemy: different senses of same word– That dog has floppy ears.– She has a good ear for jazz.– bank (financial) has several related senses
the building, the institution, the notion of where money is stored
20
Metonymy
Use one aspect of something to stand for the whole
The building stands for the institution of the bank.Newscast: “The White House released new figures today.”Waitperson: “The ham sandwich spilled his drink.”
21
Hyponymy/HyperonymyISA relationRelated to Superordinate and Subordinate level categories
hyponym(robin,bird)hyponym(bird,animal)hyponym(emu,bird)
A is a hypernym of B if B is a type of AA is a hyponym of B if A is a type of B
22
Meronymy
Parts-of relationpart of(beak, bird)part of(bark, tree)
Transitive conceptually but not lexically:The knob is a part of the door.The door is a part of the house.? The knob is a part of the house ?
23
Antonymy
Lexical oppositesantonym(large, small)antonym(big, small)antonym(big, little)but not large, little
Many antonymous relations can be reliably detected by looking for statistical correlations in large text collections. (Justeson &Katz 91)
27
Using WordNet to Determine Similarity
The “meet” function in the python wordnet tool finds the closest common parent to two terms
28
Similarity by Path Length
Count the edges (is-a links) between two concepts and scale Leacock and Chodorow, 1998
lch(c1,c2) =-log [(length(c1,c2) / 2 * max-depth]
Wu and Palmer, 1994wup(c1,c2) =
2 * depth(lcs(c1,c2)) / [depth (c1) + depth (c2)]
29
Problems with Path LengthThe lengths of the paths are irregular across the hierarchiesWords might not be in the same hierarchies that should beHow to relate terms that are not in the same hierarchies?
The “tennis problem”:– Player– Racquet– Ball– Net
Are all in separate hierarchiesWordNet is working on developing such linkages
34
Similarity by Information ContentIC estimated from a corpus of text (Resnik, 1995)
IC(concept) = -log(P(concept)) Specific Concept
High IC (pitchfork)General Concept
Low IC (instrument)To estimate it:
Count occurrences of “concept” Given a word, increment count of all concepts associated with that word
– increment bank as financial institution and also as river shore.– Assume that senses occur uniformly lacking evidence to the
contrary (e.g., sense tagged text)Counts propagate up the hierarchy
35
Information Content as Similarity
Resnik, 1995res(c1,c2) = IC (lcs (c1,c2))
Jiang and Conrath, 1997jcn(c1,c2) =
1 / [2*res(c1,c2) – (IC (c1) + IC(c2))]
Lin, 1998lin(c1,c2) = 2*res(c1,c2) / [IC(c1) + IC(c2)]
All of these (and more!) are implemented in a perl package
Called SenseRelate, Pedersen et al.http://wn-similarity.sourceforge.net/
36
Rearranging WordNet
Try to fix the top-level hierarchiesParse the glosses for more informationeXtended WordNet
projecthttp://xwn.hlt.utdallas.edu/
42
Acquisition using the Web
Towards Terascale Knowledge Acquisition, Pantel and Lin ’04Use co-occurrence model and a huge collection (the Web) to find similar terms
Input: a cluster of related wordsFeature vectors computed for each word
– Catch ___– Compute mutual information between the word and the
context
“Average” the features for each class to create a grammatical template for each class
43
Acquisition using the Web
Use this template to find new examples of this class of terms (but it makes many errors)
top related