Introduction to text miningWith a dive into structured and unstructured data
Sayali Kulkarni
October 23, 2010
Outline for today
I Quick refresher on data mining
I What is so special about text?
I Introduction to CSAW
I Annotation system
I Distributed indexing and retrieval system
I Future work
Data Mining I
I Data is useless if it does not make sense!
I Analyzing the data from different angles
I Important to know:I Data: What we getI Information: What we can useI Knowledge: How we use it
I Classes, clusters, association rules, patterns, sequences ...
Data Mining II
I Different kinds of dataI Protein sequencesI Genetic dataI Network monitoringI Text dataI Images/sound – multimedia data
I Different challenges in each case
I Scaling, noise, generalization, overfitting, incorporatingdomain knowledge
Text Mining I
I SourcesI Textual data from the webI Data collected within the organizationsI Survey data and feedback
I RepresentationI Using words as featuresI Data cleaning is a big task: spelling corrections, stop word
handling, stemmingI Weight of words depends on : importance of words in the
document and overall uniqueness of the word
I Mining TasksI SummarizationI Document clusteringI Document labellingI SearchI ...
Text Mining II
I Structure of the data is important
I Web data is diverse in natureI Completely unstructured data like news, blogs, mails forumsI Parly structured data like Wikipedia, PubMed, other domain
specific enclyclopedias and dictionariesI Data contained in the text in form of lists and tables is much
more structured
I Adding semantics to such data
I Linking the structured and unstructured data
I One of the major applications of this is sematic search
Search today: Impedance mismatch
Search Engine
�� ��������� ��� ���������� ��������� ��� �����
Our vision of next-gen search
Search Engine
������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��
Curating and Searching the Annotated Web
Our vision of next-gen search
Search Engine
������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��
Curating and Searching the Annotated Web
CSAW search paradigm IData Model
I IR indexes - limited expressiveness
I Relational databases - intricate schema knowledge
I CSAW : IR index (unstructured) + annotation and catalogindex(structured)
CSAW search paradigm II
Query CapabilitiesQuerying text with type annotations
�������������������
���� ������������ ������������
������������ ����������
���������������������� ������
����
ResponseTables of entities, quantities (special type of entities) and textfields
High level block diagram
Figure: CSAW - high level block diagram
Annotation System
Figure: Annotation Engine in CSAW
Terminologies I
Figure: A plain page from unstructured data source
Terminologies II
Spots
Figure: A spot on a page
Spot is an occurrence of text on a page that can be possibly linkedto a Wikipedia articleRelated notations:S0 All candidate spots in a Web pageS ⊆ S0 Arbitrary set of spotss ∈ S One spot, including surrounding context
Terminologies III
Possible attachments
Figure: Possible attachments for a spot
Attachments are Wikipedia entities that can be possibly linked to aspotRelated notations:Γs Candidate entity labels for spot sΓ0
Ss∈S0
Γs , all candidate labels for page
Γ ⊆ Γ0 An arbitrary set of entity labelsγ ∈ Γ An entity label value, here, a Wikipedia entity
Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
Figure: Disambiguation based on compatibility between spot and label
SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system
Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
Figure: Disambiguation based on compatibility between spot and label
SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system
Collective Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Inter-label topical coherence
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
g(γ(γ(γ(γ))))
g(γ(γ(γ(γ’ ))))
Figure: Disambiguation based on local compatibility and topicalcoherence of spots
Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls
I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels
I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation
Collective Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Inter-label topical coherence
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
g(γ(γ(γ(γ))))
g(γ(γ(γ(γ’ ))))
Figure: Disambiguation based on local compatibility and topicalcoherence of spots
Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls
I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels
I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation
Topical coherence based on entity catalog
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
Dataset for evaluation I
I Documents(IITB) crawled from popular sites
I Publicly available data from Cucerzan’s experiments (CZ)
IITB CZ
Number of documents 107 19
Total number of spots 17,200 288
Spot per 100 tokens 30 4.48
Average ambiguity per Spot 5.3 18
Figure: Corpus statistics.
Dataset for evaluation II
More on IITB dataset
I Collected a total of about 19,000 annotations
I Done by by 6 volunteers
I About 50 man-hours spent in collecting the annotations
I Exhaustive tagging by volunteers
I Spots labeled as NAwas about 40%
#Spots tagged by more than one person 1390
#NAamong these spots 524
#Spots with disagreement 278
#Spots with disagreement involving NA 218
Figure: Inter-annotator agreement.
Human Supervision
I System identifies spots and mentions
I Shows pull-down list of (subset of) Γs for each s
I User selects γ∗ ∈ Γs ∪ NA
Our Approach
I Main contributions:I Refined node features (feature design)I Using inlink based features for defining coherence score
(feature design)I Modified approach for collective inference (algorithm design)
Modeling local compatibility
I Feature vector fs(γ) ∈ Rd expresses local textual compatibilitybetween (context of) spot s and candidate label γ
I Components of fs(γ) based on Wikipedia TFIDF vectors of:
1. Snippet2. Full text3. Anchor text4. Anchor text with some tokens around it
and using similarity measures:
1. Dot-product2. Cosine similarity3. Jaccard similarity
Sense probability prior
I What entity does “Intel” refer to?I Chip design and manufacturing companyI Fictional cartel in a 1961 BBC TV serial
I Pr0(γ|s) is very high for chip maker, low for cartel
I Append element log Pr0(γ|s) to fs(γ)
Components of the objective
Node score
I Node scoring model w ∈ Rd
I Node score defined as w>fs(γ)
I w is trained to give suitable weights to different compatibilitymeasures
I During test time, greedy choice local to s would bearg maxγ∈Γs w>fs(γ)
Clique Score
I Use Milne’s relatedness formulation
Two-part objective to maximize
Node potential:
NP(y) =∏s
NPs(ys) =∏s
exp(w>fs(ys)
)
Clique potential:
CP(y) =∏
s 6=s′exp (r(ys , ys′)) = exp
∑
s 6=s′r(ys , ys′)
After taking logs and rescaling terms
1
|S0|∑
s
w>fs(ys) +1(|S0|2
)∑
s 6=s′r(ys , ys′)
Two-part objective to maximize
Node potential:
NP(y) =∏s
NPs(ys) =∏s
exp(w>fs(ys)
)
Clique potential:
CP(y) =∏
s 6=s′exp (r(ys , ys′)) = exp
∑
s 6=s′r(ys , ys′)
After taking logs and rescaling terms
1
|S0|∑
s
w>fs(ys) +1(|S0|2
)∑
s 6=s′r(ys , ys′)
ILP formulation
I Casting as 0/1 integer linear program
I Relaxing it to an LP
I Using up to |Γ0|+ |Γ0|2 variables
Variables:
zsγ = spotsisassignedlabelγ ∈ Γs ]
uγγ′ = [both γ, γ′ assigned to spots]{{
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
LP relaxation for the ILP formulation
I Relax the constraints in the formulation as :
∀s, γ : 0 ≤ zsγ ≤ 1, ∀γ, γ′ : 0 ≤ uγγ′ ≤ 1
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′
∀s :∑
γ zsγ = 1.
I Margin between objective of relaxed LP and the rounded LP isquite thin
700
800
900
1000
1 2 3 4 5 6 7 8Tuning parameter
Tot
al O
bjec
tive
LP1-rounded
LP1-relaxed
Hill climbing algorithm
I Initialization mechanismsI Label updates
Backoff strategy I
I Allow backoff from tagging some spots
I Assign a special label “NA” to mark a “no attachment”
I Reward a spot for attaching to NA– RNA
I Spots marked NAdo not contribute to clique potential
I Smaller the value of RNA, more aggresive is the tagging
How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:
maxy
1
|S0|
∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)
(NP)
+1(|A0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
Backoff strategy I
I Allow backoff from tagging some spots
I Assign a special label “NA” to mark a “no attachment”
I Reward a spot for attaching to NA– RNA
I Spots marked NAdo not contribute to clique potential
I Smaller the value of RNA, more aggresive is the tagging
How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:
maxy
1
|S0|
∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)
(NP)
+1(|A0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
Backoff strategy II
IssuesA0 depends on y and hence the resulting optimization can nolonger be written as an ILPWay around
I Treat NAas a zero topical coherence label
r(NA, ·) = r(·,NA) = r(NA,NA) = 0;
I Contribution to NPis still equal to RNA
Modified Objective
maxy
1
|S0|(∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)) (NP)
+1(|S0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
Multi-topic model
I Current clique potentials encourages a single cluster model
I The single cluster hypothesis is not always true
I Partition the set of possible attachments as C = Γ1, . . . , ΓK
I Refined clique potential for supporting multitopic model
1
|C |∑
Γk∈C
1(Γk
2
)∑
s,s′: ys ,ys′∈Γk
r(ys , ys′). (CPK)
I Using(Γk
2
)instead of
(S02
)to reward smaller coherent clusters
I Node score is not disturbed
System design of the annotation system
Evaluation of the annotation system
Evaluation measures:
PrecisionNumber of spots tagged correctly out oftotal number of spots tagged
RecallNumber of spots tagged correctly out oftotal number of spots in ground truth
F12×Recall×Precision(Recall+Precision)
Results summary
I Selection of NPfeatures is important
I Collective inference adds value
Evaluation:
Our system CZ Milne
Recall 70.7% 31.43% 66.1%
Precision 68.7% 53.41% 19.35%
F1 69.69% 39.57% 29.94%
Results summary I
Figure: Annotated page related to cricket
Results summary II
Figure: Annotated page related to finance
Query building blocks
Matcher: Word or phrase or mention of specific entity incatalog or quantity
Target: Placeholder that the engine must instantiate, e.g.,entity of given type, quantity with given unit (butpossibly just an uninterpreted token sequence)
Context: Any token segment that contains specified matchersand instantiations of targets
Predicates: Constraints over targets and context, e.g., textproximity, membership of entity in category,containment of quantity in range, . . .
Aggregators: Collects evidence in favor of candidate targetinstantiations from multiple contexts
Query example: Category targets
Tabulate French films with the number of Academy Awards thateach won
I ?f ∈+ Category:French Films,
I ?a ∈ Qtype:Number,
I InContext(?c; ?f, ?a, "academy awards", won)
I Evidence aggregator Consensus(?c)
I Resulting in an output table with two columns 〈?f, ?a〉Tabulate physicists and musical instruments they played
I ?p ∈+ Category:Physicist
I ?m ∈+ Category:Musical Instrument
I InContext(?c; ?p, ?m, played)
I Evidence aggregator Consensus(?c)
Subqueries and joins
I ?f ∈+ Category:French Film,
I ?a ∈ Qtype:Number, ?p ∈ Qtype:MoneyAmount,
I InContext(?c1; ?f, ?a, "academy awards", won),
I InContext(?c2; ?f, ?p, production cost, budget),
I Consensus(?c1, ?c2)
I Output 〈?f, ?a, ?p〉I Note that the number of academy awards and the production
cost may come from different Web pages
Distributed indexing and storage
�������������� �� ���� ���� ��
��� ����������������������������� �������������� ���������� ���� ������� � ��� ! �������
I Hadoop used for distributed storage and processing
I Distributed Index stored as Lucene posting lists
I Lucene payload carries additional data like annotationconfidence, quantities
I Adapted Katta for distributed index retrieval
Distributed search
������������������
� �� ����� ���� ��������� ����� ������ �� ! ����� ��"�� ��
I Local Ranking Engine(LRE) scores and ranks a documentwith respect to a user query
I Query Consensus Engine(QCE) aggregates evidences fromdifferent pages
Distributed query processing
���� �������� ����� ����������� ���� ��� �������� ��������� ������������������ ��������������� �����
�� �� ����������������������� !�������"�
#�����$� �� ���%���� ���% ������ � ����� ����� #����� ������
&�� �� ���� �������'�������� �$� ���� ���% %���� ���� �� ����� ����� (���������������(��
... hence completing the big picture of CSAW
�������������� �� ���� ���� ��
��� ����������������������������� �������������� ���������� ���� ������� � ��� ! ������� "��#���$������ � %&�' %���� &��(��� ������
Figure: Detailed CSAW system
Road ahead
Annotation system
I Extending collective inferencing beyond page-level boundaries
I Extending inference algorithms to multitopic models
I Associating confidence with annotations
Query system
I Enhancing the data model and query language
I Entity concensus algorithms
Others
I Ranking of entities in dropdown on the Annotation UI
I Alternative methods for storing annotations suitable forperforming interesting mining tasks
Thank you all for your interest in this topic
References I
Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan,Learning to rank for quantity consensus queries, SIGIR Conference, 2009.
S. Cucerzan, Large-scale named entity disambiguation based on Wikipediadata, EMNLP Conference, 2007, pp. 708–716.
S Dill et al., SemTag and Seeker: Bootstrapping the semantic Web viaautomated semantic annotation, WWW Conference, 2003.
Michael I. Jordan (ed.), Learning in graphical models, MIT Press, 1999.
Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and SoumenChakrabarti, Collective annotation of Wikipedia entities in Web text,SIGKDD Conference, 2009.
R Mihalcea and A Csomai, Wikify!: linking documents to encyclopedicknowledge, CIKM, 2007, pp. 233–242.
References II
Rada Mihalcea, Paul Tarau, and Elizabeth Figa, Pagerank on semanticnetworks, with application to word sense disambiguation, COLING ’04:Proceedings of the 20th international conference on ComputationalLinguistics (Morristown, NJ, USA), Association for ComputationalLinguistics, 2004, p. 1126.
David Milne and Ian H Witten, Learning to link with Wikipedia, CIKM,2008.
Support slides
I Evaluation of our system in more details
I CSAW search paradigm description
I Objective value comparison
I Scaling and performance measurement
I About Katta
I Comparison of Local, hill climbing, LP - training RNA
I Sample malformed dendrograms in category space
I Multi-topical model and dendrogram for the same
I Dendrorgam based algorithm
Effect of NP learning
50
55
60
65
70
75
Wiki fullpage,cosine
Anchortext,
cosine
Anchortext
context,cosine
Wiki fullpage,
Jaccard
Allfeatures(learn w)
F1,
%
20
40
60
80
100
0 20 40 60 80Recall, %
Pre
cisi
on, %
LocalLocal+PriorM&WCucerzan
I Learning w isbetter thancommonly-used singlefeatures
I Enough tobeatleave-one-outandanchor-basedapproaches
Effect of NP learning
50
55
60
65
70
75
Wiki fullpage,cosine
Anchortext,
cosine
Anchortext
context,cosine
Wiki fullpage,
Jaccard
Allfeatures(learn w)
F1,
%
20
40
60
80
100
0 20 40 60 80Recall, %
Pre
cisi
on, %
LocalLocal+PriorM&WCucerzan
I Learning w isbetter thancommonly-used singlefeatures
I Enough tobeatleave-one-outandanchor-basedapproaches
Benefits of collective annotation
40
50
60
70
80
90
40 50 60 70 80Recall, %
Pre
cisi
on, %
Local Local+priorHill1 Hill1+priorLP1 LP1+prior
Recall/Precision on IITB dataset
0
20
40
60
80
100
20 30 40 50 60 70Recall, %
Pre
cisi
on, %
Local
Hill1
LP1
Milne
Cucerzan
F1=63%
F1=69%
Recall/Precision on CZ dataset
I Evaluated on twodifferent data sets
I Can significantly pushrecall while preservingprecision
Is our belief about the objective correct?
20
30
40
50
0.75 0.8 0.85 0.9 0.95 1Objective (normalized)
F1,
%
doc1 doc2doc3 doc4doc5 doc6
Figure: F1 versus Objective
I As theobjective valueincreases, theF1 increases
I Validates ourbelief aboutthe objective
Effect of tuning RNA I
15
25
35
45
55
65
1 2 3 4 5 6 7 8RNA
F1,
%
Local Hill1LP1
Figure: F1 for Local, Hill and LP for differentRNAvalues
I Best RNAforLocal islesser than thebest RNAforHill1 andLP1
Effect of tuning RNA II
40
50
60
70
80
90
1 2 3 4 5 6 7 8RNA
Pre
cisi
on, %
Local Hill1LP1
Figure: Precision for different RNAvalues
102030405060708090
1 2 3 4 5 6 7 8RNA
Rec
all,
%
Local Hill1LP1
Figure: Recall for different RNAvalues
I Smaller the value ofRNA, more aggresiveis the tagging
I Precision increaseswith increase inRNAvalue
I Recall decreases withincrease in RNAvalue
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
Objective value comparison for Local, hill climbing, LP
700
800
900
1000
1 2 3 4 5 6 7 8rhoNA
Tot
al O
bjec
tive
Hill1LP1-roundedLP1-relaxed
Scaling and performance measurement
0
2
4
6
8
10
0 50 100 150 200 250# of spots
Tim
e, s
Hill1
LP1
Figure: Scaling the annotation process withnumber of spots being annotated
I Scaling ismildlyquadraticallywrt |S0|
I Hill climbingtakes about2–3 seconds
I LP takesaround 4–6seconds
About Katta
I Salient Features:I ScalableI Failure tolerantI DistributedI IndexedI Data storage
I Serves very large Lucene indexes as index shards on manyservers
I Replicates shards on different servers for performance andfault-tolerance
I Supports pluggable network topologies
I Master fail-over
I Plays well with Hadoop clusters
Comparison of Local, hill climbing, LP - training RNA
Local Hill1 LP1
no Prior 63.45% 64.87% 67.02%
+Prior 68.75% 67.46% 69.69%
Sample dendrograms I
Sample dendrograms II
Dendrogram with multitopic model
Multi-topical model
I Current clique potentials encourages a single cluster model
I The single cluster hypothesis is not always true
I Refined clique potential for supporting multitopic model
1
|C |∑
Γk∈C
1(Γk
2
)∑
s,s′: ys ,ys′∈Γk
r(ys , ys′). (CPK)
I Using(Γk
2
)instead of
(S02
), to reward smaller coherent clusters
as desired