automated identification of similar health questions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4

Automated Identification of Similar Health Questions Geoffrey W. Rutledge MD, PhD Chief Medical Information Officer

HealthTap.com

Introduction People with health questions are increasingly looking for physician answers to their health questions online. Given the repetitive nature of common questions, there is a high value in identifying previously answered questions that are semantically similar (or identical) to each new question, so that an answer can be given without delay and without waiting for a new answer from a physician. Background Previous methods to evaluate question pairs were based on sentence similarity [1,2] and are not suitable for consumer health questions, which contain many consumer-health variations and frequent misspellings of medical concepts. We developed a method to identify questions with “high semantic similarity” from a corpus of consumer health questions and answers, in which the questions and answers are character limited to 150 and 400 characters respectively. Method We compare the text of new questions to the closest matching question from the Q&A corpus. For a set of 1,000 questions and their closest match, we evaluated the sensitivity and specificity of alternative similarity criteria for the assertion of “high semantic similarity.” We first identified the most similar question within the Q&A corpus using a search engine augmented with a semantic-weight driven ontology of consumer health concepts, which includes a rich set of synonyms of consumer health terms, and frequent misspellings of consumer health terms.

The three similarity criteria tested are

1. Lexical identity after removal of all non-alphanumeric characters

2. Sum of semantic weights of all matching health concepts

3. Sum of weights of only the moderate or high weight matching health concepts

Examples of medical concepts: moderate weights: antibiotics, heart disease, sharp pain high weights: penicillin, congestive heart failure, squeezing chest pain Results We compared the three similarity criteria against an expert assessment of question pair similarity. The sensitivities and specificities for the three criteria are (1) 0.47, 1 (2) 0.61, 0.99 (3) 0.63, 0.97, as plotted on the chart of False positive versus True positive rates (ROC). The criterion with the best performance was Sum of semantic weights of all matching concepts.

Discussion The problem of identifying semantically similar health questions is complicated by the variability of consumer health language and the difficulties that consumers have in spelling medical terms. A comprehensive ontology and synonym set of consumer health terms enabled the accurate detection of a large fraction of semantically similar consumer health questions that were entered in an online health site. The automated identification of similar consumer health questions is challenging because of the common occurrence of complex, colloquial, and often misspelled medical terms in consumer health questions. We collected online health questions and their paired "nearest search result" matching questions to evaluate 3 question similarity metrics. The best performing metric was based on the sum of semantic weights for all matching health concepts from a comprehensive ontology of consumer health terms and common misspellings, with a measured sensitivity of 0.61 and specificity of 0.99. [1] The Evaluation of Sentence Similarity Measures, I.-Y. Song, J. Eder, and T.M. Nguyen (Eds.): DaWaK 2008, LNCS 5182, pp. 305–316, 2008. [2] Finding Similar Questions in Large Question and Answer Archives. Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee. CIKM’05, October 31–November 5, 2005

We are hiring [email protected]

(2) (1)

False positive rate True positive rate (3)

automated identification of similar health questions

Health & Medicine