sims 290-2: applied natural language processing

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 25, 2004

Next Few Classes

This week: lexicons and ontologiesToday:

WordNet’s structure, computing term similarity

Wed: Guest lecture: Prof. Charles Fillmore on FrameNet

Next week: Enron labeling in classThe entire assignment will be due on Nov 15

Following week: Question-Answering

Text Categorization Assignment

Great job, you learned a lot!Comparing to a baselineSelecting featuresComparing relative usefulness of featuresTraining, testing, cross-validation

I learned a lot too! (from your results)(I’ll send you your feedback today)

FeaturesBoosting weights of terms in subject line is helpful.Stemming does help in some circumstances (often works well with SVM, for example), but not always.

– Counter-intuitively, stemming can increase the number of features in our implementation, because it increases how many terms pass the minimum-document-occurrence cutoff.

– An example of the porter stemmer not hiding differences when it otherwise would: converting gaseous to "gase" and so not conflating "gas" for fuel for motorcycles with "gaseous" for the science group.

FeaturesTerms with more than just the default alphabetical terms are helpful, maybe because in part getting the domain name information, but also because of getting technical terms.It's probably best to use the Weka feature selector to tell you what *kind* of features are performing well, but not to select those for use exclusively. I'm surprised that no one tried bigrams or noun-noun compounds as features.

Feature WeightingTf.idf: Almost everyone who tried it found it was raw term frequency (there were exceptions). Binary feature weights with document count minimum thresholds can be a good substitute.An interesting variation on tf.idf is to do it in a class-based manner.

– weight terms higher that only occur in one class vs. the others.

– A couple of students tried this and got good results on the diverse comparison, but less good on the homogenous. This makes sense since the measure would not help as much in distinguishing similar newsgroups that share many terms.

ClassifiersNaïve-Bayes Multinomial was a clear winnerSVM worked well most of the time, but not as well as NBMNaive Bayes seemed to be more robust to unseen information; the kernel estimator seems to improve the default Naive Bayes settings.VotedPerceptron worked very well, but only does binary classification so people who found it did very well on diverse did not transfer it to homogenous.

Lexicons, Semantic Nets and OntologiesThe Structure of WordNetComputing SimilaritiesAutomatic Acquisition of New Terms

Lexicons, Semantic Nets, and Ontologies

Lexicons are (typically) word lists augmented with some subset of:Parts-of-speechDifferent word sensesSynonyms

Semantic NetsInclude links to other termsIS-A, Part-Of, etc.Sometimes this term is used for what I call ontologies

OntologiesRepresent concepts and relationships among conceptsLanguage independent (in principle)Sometimes include inference rulesDifferent from definition in philosophy

– The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality

10Adapted from slide by W. Ceusters, www.landc.be

One approach to linking ontologies and lexicons

Formal Domain Ontology

Lexicon

Grammar

Language ALanguage A

Lexicon

Grammar

LanguageLanguage BB

Cassandra Linguistic Ontology MEDRA

SNOMED

Others ...

Proprietary Terminologies

Example Ontological Relation Types

HAS-PARTIAL-SPATIAL-OVERLAP

IS-TOPO-

INSIDE-OF

IS-GEO-INSIDE-

IS-INSIDE-

CONVEX-HULL-OF

IS-PARTLY-IN-CONVEX-

HULL-OFIS-OUTSIDE-CONVEX-HULL-OF

HAS-DISCONNECTED-

REGION

HAS-EXTERNAL-

CONNECTING-REGION

HAS-DISCRETED-REGION

HAS-TANG.-SPAT.-PART

HAS-NON-TANG.-SPAT.-PART

IS-SPAT.-

EQUIV.-OF

IS-TANG.-SPAT.-PART-

IS-NON-TANG.-SPAT.-PART-

HAS-PROPER-SPATIAL

IS-PROPER-

SPAT.-PART-

HAS-SPATIAL

IS-SPATIAL-PART-

HAS-OVERLAPPING

-REGION

HAS-CONNECTING-

REGION

HAS-SPATIAL-POINT-

REFERENCE

Example of applying an ontology: joint anatomy

joint HAS-HOLE joint spacejoint capsule IS-OUTER-LAYER-OF jointmeniscus

IS-INCOMPLETE-FILLER-OF joint spaceIS-TOPO-INSIDE joint capsuleIS-NON-TANGENTIAL-MATERIAL-PART-OF joint

joint IS-CONNECTOR-OF bone XIS-CONNECTOR-OF bone Y

synoviaIS-INCOMPLETE-FILLER-OF joint space

synovial membrane IS-BONAFIDE-BOUNDARY-OF joint space

This doesn’t include the linguistic side

Linking Lexicons and Ontologies

Having a healthcare phenomenon

Generalised PossessionHealthcare phenomenonHuman

Has-possessor Has-

possessed

PatientIs-possessor-of

Patient at risk

IS-A Has-Healthcare-phenomenon

Risk Factor

IS-AIs-Risk-

Factor-Of

Patient at risk for osteoporosis

Risk factor for osteoporosis Osteoporosis

Has-Healthcare-phenomenon

Is-Risk-Factor-Of

IS-A IS-A IS-A

Linking different lexiconsMESH-2001 : “Seizures”

MESH-2001 : “Convulsions”

Snomed-RT : “Convulsion”

Snomed-RT : “Seizure”

L&C : ConvulsionL&C : Seizure

L&C : Health crisis

L&C : Epileptic convulsion

IS-AIS-A

IS-narrower-than ISA

Has-CCC

WordNet

A big lexicon with properties of a semantic netStarted as a language project by Dr George Miller and Dr. Christiane Fellbaum at PrincetonFirst became available in 1990Now on version 2.0

WordNet

Huge amounts of research (and products) use it

WordNet RelationsOriginal core relations:

SynonymyPolysemyMetonymyHyponymy/HyperonymyMeronymyAntonymy

New, useful additions for NLPGlossesLinks between derivationally and semantically related noun/verb pairs. Domain/topical termsGroups of similar verbs

Others on the wayDisambiguation of terms in glossesTopical clustering.

Synonymy

Different ways of expressing related conceptsExamples

cat, feline, Siamese cat

Synonyms are almost never truly substitutable:

Used in different contextsHave different implications

– This is a point of contention.

Polysemy

Most words have more than one senseHomonym: same word, different meaning

– bank (river)– bank (financial)

Polysemy: different senses of same word– That dog has floppy ears.– She has a good ear for jazz.– bank (financial) has several related senses

the building, the institution, the notion of where money is stored

Metonymy

Use one aspect of something to stand for the whole

The building stands for the institution of the bank.Newscast: “The White House released new figures today.”Waitperson: “The ham sandwich spilled his drink.”

Hyponymy/HyperonymyISA relationRelated to Superordinate and Subordinate level categories

hyponym(robin,bird)hyponym(bird,animal)hyponym(emu,bird)

A is a hypernym of B if B is a type of AA is a hyponym of B if A is a type of B

Meronymy

Parts-of relationpart of(beak, bird)part of(bark, tree)

Transitive conceptually but not lexically:The knob is a part of the door.The door is a part of the house.? The knob is a part of the house ?

Antonymy

Lexical oppositesantonym(large, small)antonym(big, small)antonym(big, little)but not large, little

Many antonymous relations can be reliably detected by looking for statistical correlations in large text collections. (Justeson &Katz 91)

Using WordNet in Pythonfrom wordnet import *from wntools import *

Using WordNet in Python

from wordnet import *from wntools import *

sims 290-2: applied natural language processing

Documents

1 sims 290-2: applied natural language processing marti...

sims discover and sims multiview

the sims 4 cheats, codes, unlockables - sims online

1 sims 290-2: applied natural language processing marti...

1 sims 290-2: applied natural language processing marti...

uniudijpam.uniud.it/online_issue/201839/25-jian tang... ·...

sims discover sims multiview chris sherwood. sims discover

1 sims 290-2: applied natural language processing marti...

1 sims 290-2: applied natural language processing barbara...

1 sims 290-2: applied natural language processing marti...

sims agora - capita sims online parent payments made simple...

1 sims 256: applied natural language processing marti hearst...

sims 290-2: applied natural language processing

sims discover sims multiview

1 sims 290-2: applied natural language processing marti...

1 sims 290-2: applied natural language processing barbara...

paris - cloudinary · molly sims paris. molly sims. molly...

1 sims 290-2: applied natural language processing marti...

senior information management system...

1 sims 290-2: applied natural language processing marti...