proximity-based ranking of biomedical texts

Proximity-based Ranking of Biomedical

Texts

Rey-Long Liu* and Yi-Chih Huang*Dept. of Medical Informatics

Tzu Chi University

Taiwan

Outline

• Research background

• Problem definition

• The proposed approach: PRE

• Empirical evaluation

• Conclusion

A Proximity-based Ranker Enhancer 2

Research Background


Biomedical Information Need

• Biomedical research requires relevant evidences in the huge and ever-growing biomedical literature

• Retrieval of the evidences requires a system that – Accepts a natural language query for a biomedical

information need, and – Ranks relevant texts higher for access or processing


An Example Info Need

• Query: urinary tract infection, criteria for treatment and admission (from OHSUMED) – A disease as the target concept (i.e., urinary tract infection)

– Two concepts about the scenario of the information need (i.e., treatment and admission)

• Neither special nor related to any disease


Problem Definition


Goals• Explore how text rankers may be improved by

considering the completeness of query concepts appearing in a nearby area of the text being ranked

• Develop a technique PRE (Proximity-based Ranker Enhancer) that – Measures contextual completeness of query

concepts appearing in a nearby area in the text– Serves as a supplement to improve existing rankers


Related Work• Biomedical text ranking

– Using synonyms and considering diversity of passages, without considering term proximity

• Text ranking– Individual text scoring techniques (e.g., BM25)

and learning to rank techniques (e.g., Ranking SVM), without considering term proximity

• Improving ranking by term proximity– Term proximity is employed, but contextual

completeness was not consideredA Proximity-based Ranker Enhancer 8

The Proposed Approach: PRE


System Overview


Text Ranker Development

TrainingTesting

Underlying RankerPRE

Text Ranking TF in d

User

Query (q)

Text (d)

TF (Term Frequency) Assessment

Training Data

Ranked Texts

TF Assessment


• Three types of term proximity– Overall proximity (QTermTF)– Individual proximity (IndiP)– Collective proximity (CollP)

• A term t may get a large TF increment in d, if – Many query terms appear frequently in d– Query terms are individually near to t at some

places, and– Query terms collectively appear at a place near to t


•RTF(t,d,q) = TF(t,d)+TFincrement(t,d,q)•TFincrement(t,d,q) = QtermTF(d,q)IndiP(t,d,q)×CollP(t,d,q)•QtermTF(d,q) = Total TF of query terms in d•IndiP(t,d,q) =ΣmM －

{t}SigmoidWeight(Mindist(t,m))/ MaxIndiP•Mindist(x,y) = shortest distance between x and y in d•SigmoidWeight(dt) = 1/(1+e-((|q|-1)-dt))•CollP(t,d,q) = MaxkK{mM － {t}

SigmoidWeight(dist(t,k,m))}/MaxCollP, where K is the set positions at which t appears in d•dist(t,k,m) = Distance between t (at position k) and m

Empirical Evaluation


Experimental Data• OHSUMED

– A popular database of biomedical queries and references

– 106 queries– 348,566 references– 16,140 query-reference pairs

• Definitively relevant• Possibly relevant• Not relevant


Underlying Rankers


Baseline Ranker Enhancer

• Three state-of-the-art techniques that enhanced text rankers by term proximity– The t-function

• t() by [Tao & Zhai, 2007]

– The p-function • p() by [Cummins & O’Riordan, 2009]

– The proximity language model • PLM by [Zhao & Yun, 2009].


Evaluation Criteria

• Evaluating how relevant references are ranked higher for users to access– Mean average precision (MAP)– Normalized discount cumulative gain at x

(NDCG@X)


Results


Conclusion


• Term proximity may be comprehensively applied to improving various kinds of text rankers

• It is helpful to integrate three types of term proximity– Overall proximity– Individual proximity– Collective proximity

• Term proximity information may be encoded to re-assess TF of each term


proximity-based ranking of biomedical texts

Documents