lecture 19 lexical networks
Post on 28-Jan-2016
22 Views
Preview:
DESCRIPTION
TRANSCRIPT
Lecture 19
Lexical networks
Slides modified from Dragomir R. Radev
Social data
Blog postings News stories Speeches in Congress Query logs Movie and book reviews Scientific papers Financial reports Query logs Encyclopedia entries Email Chat room discussions Social networking sites
WHAT DO ALL OF THESE HAVE IN COMMON?
2
Natural language processing
Part of speech tagging Prepositional phrase attachment Parsing Word sense disambiguation Document indexing Text summarization Machine translation Question answering Information retrieval Social network extraction Topic modeling
3
Talk outline
Lexical networks
Semantic networks
Lexical centrality
Latent networks
Conclusion
4
Lexical networks
Lexical networks
A special case of networks where nodes are words or documents and edges link semantically related nodes
Other examples: Words used in dictionary definitions Names of people mentioned in the same story Words that translate to the same word
A semantic network consists of a set of nodes that are connected by labeled arcs.
The nodes represent concepts and The arcs represent relations between concepts.
6
Semantic network
7
The large-scale structure of semantic networks:statistical analyses and a model of semantic growthM. Steyvers, J. B. Tenenbaum (2005)Cognitive Science, 29(1)
Free word associations
Meredith yesterday apples
bought
green
Dependency network
9
Dependency network
10
Semantic Networks
So again… A Semantic Network is…
A semantic (or associative) network is a simple representation scheme which uses a graph of labeled nodes and labeled, directed arcs to encode knowledge. Labeled nodes: objects/classes/concepts.
Labeled links: relations/associations between nodes
Labels define the semantics of nodes and links
Usually used to represent static, taxonomic, concept dictionaries
Nodes and Arcs
Nodes denote objects/classes arcs define binary relationships between objects.
john 5Sue
age
mother
mother(john,sue)age(john,5)wife(sue,max)age(sue,34)...
34
age
father
Max
wifehusband
age
Common Semantic Relations
There is no standard set of relations for semantic networks, but the following relations are very common:
INSTANCE: X is an INSTANCE of Y if X is a specific example of the general concept Y.
Example: Elvis is an INSTANCE of Human
ISA: X ISA Y if X is a subset of the more general concept Y.
Example: sparrow ISA bird
HASPART: X HASPART Y if the concept Y is a part of the concept X.
Or this can be any other property
Example: sparrow HASPART tail
ISA hierarchy
The ISA (is a) or AKO (a kind of) relation is often used to link a class and its superclass.
And sometimes an instance and it’s class.
Some links (e.g. has-part) are inherited along ISA paths.
The semantics of a semantic net can be relatively informal or very formal often defined at the
implementation level
isa
isa
isaisa
Robin
Bird
Animal
RedRusty
hasPart
Wings
Inference by association
Red (a robin) is related to Air Force One by association (as directed path originated from these two nodes join at nodes Wings and Fly)
Bob and George are not related (no paths originated from them join in this network
Wings
isa
isa
isaBoeing 747
Airplane
Machine
Air Force one
Flycan-do
has-partisa
isa
isaisa
Robin
Bird
Animal
RedRusty
Has-part
can-do
owner
Bob George
passenger
Frames – A Semantic Network with properties
A frame represents an entity as a set of slots (attributes) and associated values. act, look, etc. like objects in C++ a more robust/compact version of a semantic network
Each slot may have constraints that describe legal values that the slot can take.
A frame can represent a specific entity, or a general concept.
Frames are implicitly associated with one another because the value of a slot can be another frame.
19
Semantic Networks
Rules are appropriate for some types of knowledge, but do not easily map to others.
Semantic nets can easily represent inheritance and exceptions, but are not well-suited for representing negation, disjunction,
preferences, conditionals, and cause/effect relationships.
Frames allow arbitrary functions (demons) and typed inheritance. Implementation is a bit more cumbersome.
Lexical Centrality
LexRank – Centrality in Text Graphs
Vertices
Units of text (sentences or documents)
Edges
Pairwise similarity between text
22
LexRank – Centrality in Text Graphs
Intuition
LexRank score is propagated through
edges
Central vertices are those that are similar to other central vertices
23
LexRank – Centrality in Text Graphs
Recurrence Relation
sCan guarantee solution by
allowing “jump” probability d/N.
0.5
0.3
0.80.2
0.1
0.3
0.9
0.2 0.4
24
25
NLP and network analysis
... , sagte der Sprecher bei der Sitzung .... , rief der Vorsitzende in der Sitzung .
... , warf in die Tasche aus der Ecke .
C1: sagte, warf, riefC2: Sprecher, Vorsitzende, TascheC3: inC4: der, die
[Biemann 2006] [Mihalcea et al 2004] [Mihalcea et al 2004]
[Widdows and Dorow 2002][Pang and Lee 2004]
Part of speech tagging Word sense disambiguation Document indexing
Subjectivity analysis Semantic class induction
Q
relevanceinter-similarity
Passage retrieval
[Otterbacher,Erkan,Radev05]28
MavenRank – Centrality in Speech Graphs
Vertices
Speech transcripts from a given topic
Edges
tf-idf cosine similarity (with threshold)
Hypothesis
Key speakers will have speeches with high centrality.
29
MavenRank: Example
23
1
87
6
4
5
Speaker 1Speeches
Speaker 2Speeches
Speaker 3Speeches
Speech Scores
1 0.132 0.133 0.104 0.195 0.106 0.147 0.088 0.13
Speaker Scores (mean speech score)
1 0.122 0.153 0.12
30
31
GIN: Gene Interaction NetworkMotivation:
Biomedical literature is growing rapidly. Manually curated databases cover small portion of the available information
Most protein interaction information is uncovered in biomedical articles
Approach: text mining and network analysis for
Automatic extraction of molecule interactions
Automatic article summarization
Interaction and citation networks
Inferring gene-disease associations
32
Feature Extraction from Dependency Trees
Path1: KaiC – nsubj – interacts – obj – SasA
Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA
Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB
Path4: SasA – conj_and – KaiA
Path5: SasA – conj_and – KaiB
Path6: KaiA - prep_with - SasA – conj_and – KaiB
“The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”
33
Inferring Genes Related to Prostate Cancer Hypothesis:
Genes that are interacting with many genes that are known to be related to prostate cancer are likely to be related to prostate cancer
Approach: Extract the interaction network of genes (seed genes) that are known
to be related to prostate cancer automatically from the literature Infer new genes related to prostate cancer from the network topology Use eigenvalue centrality to rank gene-prostate cancer associations
Hypothesis restatement: Genes central in the constructed network are most probably related
to prostate cancer.
34
Approach
Corpus: PMCOA (PubMed Central Open Access) – full text articles Articles in PMCOA split into sentences and sentences tagged with
GeniaTagger
Compile seed list of genes known to be related to prostate cancer 20 genes compiled from OMIM (Online Mendelian Inheritance in
Man) Database Extend seed gene list with synonyms from HGNC (HUGO Gene
Nomenclature Committee) database.
Use the automatic interaction extraction pipeline to extract the interaction network of the seed genes and their neighbors (genes interacting with the seed genes).
35
Seed Genes
Gene DescriptionAR androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease)BRCA2 breast cancer 2, early onsetMSR1 macrophage scavenger receptor 1EPHB2 EPH receptor B2KLF6 Kruppel-like factor 6MAD1L1 MAD1 mitotic arrest deficient-like 1 (yeast)TUSC3 tumor suppressor candidate 3HIP1 huntingtin interacting protein 1CBX8 chromobox homolog 8 (Pc class homolog, Drosophila)|#|chromobox homolog 8 (Drosophila Pc class)CD82 CD82 moleculeZFHX3 zinc finger homeobox 3ELAC2 elaC homolog 2 (E. coli)MXI1 MAX interactor 1PTEN phosphatase and tensin homolog (mutated in multiple advanced cancers 1)RNASEL ribonuclease L (2',5'-oligoisoadenylate synthetase-dependent)HPC1 hereditary prostate cancer 1CHEK2 CHK2 checkpoint homolog (S. pombe)HPCX hereditary prostate cancer, X-linked predisposing for prostate cancerPCAP predisposing for prostate cancer PRCA1 prostate cancer 1
20 genes that are reported in OMIM to be related to prostate cancer
36
Interactions of the seed genes(gene names normalized to their HGNC symbols)
37
Sample Extracted Interaction Sentences
A study by Jin et al. [20] indicated that the association of Tax with hsMAD1, a mitotic spindle checkpoint (MSC) protein, led to the translocation of both MAD1 and MAD2 to the cytoplasm.
PTEN is transcriptionally regulated by transcription factors such as p53, Egr-1, NFκB and SMADs,
while protein levels and activity are modulated by phosphorylation, oxidation, subcellular
localisation, phospholipid binding and protein stability [29].
Interestingly, one of these, HPC1, is linked to RNASEL [10,11].
In response to DNA damage, the cell-cycle checkpoint kinase CHEK2 can be activated by ATM
kinase to phosphorylate p53 and BRCA1, which are involved in cell-cycle control, apoptosis, and
DNA repair [1,2].
The interactions of RAD51 with TP53, RPA and the BRC repeats of BRCA2 are relatively well
understood (see Discussion).
The interaction of BRCA2 with HsRad51 is significantly more different to both RadA and RecA
(Figure 2c).
Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC
function and may play a role in insulin resistance.
Mad2 binds to Cdc20, an activator of the anaphase-promoting complex (APC), to inhibit APC
activity and arrest cells in metaphase in response to checkpoint activation.
38
Inferred Genes (evaluation of top-20 scoring genes)
6 are seed genes; 14 genes are inferred to be related to prostate cancer (Check GeneGo Pathway database; if no evidence there, check PubMed literature)
9 genes: marked as being related to prostate cancer by GeneGo Pathway Database 1 gene: Found evidence in PubMed that gene related to prostate cancer 4 genes: no evidence found
Gene Description EvidenceTP53 tumor protein p53 (Li-Fraumeni syndrome) GeneGoBRCA1 breast cancer 1, early onset GeneGoEREG epiregulin noAKT1 v-akt murine thymoma viral oncogene homolog 1 GeneGoMAPK1 mitogen-activated protein kinase 1 noTNF tumor necrosis factor (TNF superfamily, member 2) GeneGoCCND1 cyclin D1 GeneGoMYC v-myc myelocytomatosis viral oncogene homolog (avian) GeneGoAPC adenomatosis polyposis coli PubMedCDKN1B cyclin-dependent kinase inhibitor 1B (p27, Kip1) GeneGoMAPK8 mitogen-activated protein kinase 8 GeneGoNR3C1 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) noVEGFA vascular endothelial growth factor A GeneGoMDM2 mouse double minute 2, human homolog of; p53-binding protein no
39
40
Other networks
Diabetes Type I
Diabetes Type II
Bipolar Disorder
41
Properties of lexical networks
Dependency network
43
Random network
44
Analyzing networks
Properties of networks Clustering coefficient
Watts/Strogatz cc = #triangles/#triples
Power law coefficient
Diameter (longest shortest path)
Average shortest path (ASP)
Properties of nodes Centrality: degree, closeness, betweenness, eigenvector
45
Types of networks
Regular networks Uniform degree distribution
Random networks Memoryless Poisson degree distribution Characteristic value Low clustering coefficient Large asp
Small world networks High transitivity Presence of hubs (memory) High clustering coefficient
(e.g., 1000 times higher than random)
Small ASP Power law degree distribution
(typical value of between 2 and 3)
Npkk
kekP
kk
!)(
)()(
k
kP
46
Comparing the dependency graph to a random (Poisson) graph
Random Actual
n 5563 5584
M 14440 14472
Diameter 21 13
ASP 8.788 4.01
W/S cc 0.00062 0.092
n/a 2.1947
Properties of lexical networks
Entries in a thesaurus[Motter et al. 2002]
c/c0 = 260 (n=30,000)
Co-occurrence networks [Dorogovtsev and Mendes 2001, Sole and Ferrer i Cancho 2001]
c/c0 = 1,000 (n=400,000)
Mental lexicon [Vitevitch 2005] c/c0 = 278 (n=19,340)
letter
actor
character nature
universe
world
48
syntactic dependency degree distribution(loglog scale)
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
49
top related