exploiting nlp for digital disease informatics

Exploiting NLP for Digital Disease Informatics

University of Warwick, October 15th 2015

Nigel Collier, Language Technology LabDepartment of Theoretical and Applied Linguistics

Really understanding natural language is the next grand challenge

• High throughput methods have transformed biomedicine into a data-rich science

• All genes in a genome, all proteins in a proteome, all transcripts in a cell, all metabolic processes in a tissue…




• A significant portion of human health data is ‘messy data’ existing only as unstructured text

• Biomedical literature, Clinical trials data, Lab notebooks, Clinical records, Diagnostic reports, News reports, Social media messages

• Represents the most contextually grounded, high precision information about an individual’s health, attitudes and behaviours




• A significant portion of human health data is ‘messy data’ existing only as unstructured text

• Biomedical literature, Clinical trials data, Lab notebooks, Clinical records, Diagnostic reports, News reports, Social media messages

• Represents the most contextually grounded, high precision information about an individual’s health, attitudes and behaviours

• Natural language processing (NLP) is a cornerstone technology to translate ‘messy data’ into structured forms that are systematically encoded, e.g. SNOMED-CT, ICD.

Experience from personal research

(1) Global infectious disease alerting and mapping

(2) Extracting a database of phenotype terms

(3) Understanding the voice of the patient

(4) Chemical cancer risk assessment

(5) Critical hypothesis generation from literature

Typical workflow from text to knowledge

raw textdocument

sentencesegmentation

tokenization

lexical featurisation

entity recognition

triggerdetection

relationextraction

eventextraction

entitygrounding

knowledge objects

syntacticparsing

Broad Research Objectives

• Extrinsic: Robust data collection from across health-related text types: literature, patient records, news, social media (public health alerts, developing disease profiles, etc.)

• Intrinsic: Understand how NLP/ML/Ontology techniques perform and can be improved in operational settings

BIOCASTER: GLOBAL INFECTIOUS DISEASE ALERTING AND MAPPING

Case study #1

[5] Collier, N. et al. (2008). BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics, 24(24), 2940-2941.[6] Collier, N., et al. (2011). OMG U got flu? Analysis of shared health messages for bio-surveillance. J. Biomedical Semantics, 2(S-5), S9.[7] Hay, S. I., et al. (2013). Global mapping of infectious disease. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 368(1614), 20120250.

Infectious diseases spread rapidly

“We live in a world where threats to health arise from the speed and volume of air travel, the way we produce and trade food, the way we use and misuse antibiotics, and the way we manage the environment…”- Dr. Margaret Chan, DG WHO

SARS, 2003HK, world

H5N1 flu, 2003-PRC, Thailand,ROC, Vietnam

Foot & mouth, 2001United Kingdom

Ebola, 2014-Guinea, Liberia,Sierra Leone,Nigeria

Trend graphs

Event summaries

Event alerts

Ontology browsingEmail/GeoRSS alertingWatchboard, etc.

Real time Twitter analysis

Up to date news in 12 languages

Event database search

GHSI partners

USUKFRDE

WHOITJPCA

Digital epidemic surveillance with BioCaster

Example frame

<SLOT name="HAS_DISEASE" type="DISEASE" content="Anthrax" alt="" root_term="Anthrax" bid=""/> <SLOT name="HAS_LOCATION.COUNTRY" type="LOCATION" content="Morocco" alt="" root_term="Morocco" bid=""/> <SLOT name="HAS_LOCATION.PROVINCE" type="LOCATION" content="Marrakech" alt="" root_term="" bid=""/> <SLOT name="HAS_AGENT" type="micro_organism" content="Bacillus anthracis" alt="" root_term="" bid=""/> <SLOT name="HAS_SPECIES" type="animal" content="human" alt="" root_term="" bid=""/> <SLOT name="TIME.relative" type="string" content=""/> <SLOT name="INTERNATIONAL_TRAVEL" type="Boolean" content="false"/> <SLOT name="DELIBERATE_RELEASE" type="Boolean" content="false"/> <SLOT name="ZOONOSIS" type="Boolean" content="false"/> <SLOT name="DRUG_RESISTANCE" type="Boolean" content="false"/> <SLOT name="FOOD_CONTAMINATION" type="Boolean" content="false"/> <SLOT name="HOSPITAL_WORKER" type="Boolean" content="false"/> <SLOT name="FARM_WORKER" type="Boolean" content="false"/> <SLOT name="MALFORMED_PRODUCT" type="Boolean" content="false"/> <SLOT name="NEW_TYPE_AGENT" type="Boolean" content="false"/> <SLOT name="SERVICE_DISRUPTION" type="Boolean" content="false"/> <SLOT name="CATEGORY_A" type="Boolean" content="true"></EVENT>

Technical challenges

X0,000 news providers

REAL TIME SCALING 30,000-40,000 news items/day

900 on topic/day

200 events/day

4 alerts/day



MULTILINGUALITY

Avian Flu

Influenza aviaire

鳥インフルエンザ

조류인플루엔자

โรคไขห้วดันกCúm gia cầm

REAL TIME SCALING

Increased sensitivity and timeliness from multilingual news

News event counts for porcine foot-and-mouth outbreak in South Korea2010-2011



MULTILINGUALITY

REAL TIME SCALING

AMBIGUITY“Obama fever builds as Americans

await a new era”

Equine influenza in Camden

Camden (UK) Camden (AU) Camden (CA) + 19 others

Entity identification

Toponym grounding

Tajoura Tajura Tajoora…

Variant transliterations

Coreference“Two British holidaymakers fell ill… ”

“Two male pensioners died…”2 or 4 victims?

Temporal identification

“The Spanish flu outbreak…”

Semantic pipeline

Source: BioCaster

Outbreak characteristics: Early surge vs multi-modal transmissionNews event frequency over time

Looking for bursts of activity

Source: GENI-DB

21/03

/2009

26/03

/2009

31/03

/2009

05/04

/2009

10/04

/2009

15/04

/2009

20/04

/2009

25/04

/2009

30/04

/2009

05/05

/2009

10/05

/2009

15/05

/2009

20/05

/2009

25/05

/2009

30/05

/2009

04/06

/2009

09/06

/2009

14/06

/2009

0

40

80

120

160

200

0

1

ctμμ+3σGold

Alerts with the C2 test statistic:St = max(0, (Ct – (μt + 3σt))/ σt)

First English languagereports (MMWR + AP)

Understanding norms and their violations

5 detection algorithms

1. Early aberration reporting system (EARS) C2 algorithm

• captures the number of standard deviations that the current count exceeds the history mean;

• St = max(0, (Ct – (μt + kσt))/ σt)

2. EARS C3 algorithm

• similar to C2 except that C3 uses a weighted sum of the previous 3 days for the current period;

3. W2 algorithm• a modified version of C2 which ignores history counts on Saturdays and Sundays to compensate for day of week effects;

4. F statistic• compares the variance in the history window to the variance in the current window;

• St = σt 2 +σb

2

5. Exponential Weighted Moving Average (EWMA)

• provides less weight to days in the history that are further from the test day.

• St = (Yt – μt)/[σt * (λ/(2- λ))1/2], where Y1 = C1 and Yt = λCt + (1- λ)Yt-1

Model parameters were estimated based on an additional 5 epidemic data sets from ProMED-mail (data not shown)

[8] Burkom H. S. (2005), “Accessible Alerting Algorithms for Biosurveillance”. National Syndromic Surveillance Conference[9] Jackson M. L. et all (2007), “A simulation study comparing aberration detection algorithms for syndromic surveillance” Medical Informatics and Decision

Making , 7(6): BMC, DOI: 10.1186/1472-6947-7-6. [10] Madoff L. (2004), “ProMED-mail: An early warning system for emerging diseases”. Clin Infect Dis , 39(2): 227–232.

# Disease Country ProMED-alerts

1 Hand,foot,mouth

PR China 9

2 Ebola Congo 17

3 Yellow fever Brazil 28

4 Influenza USA 21

5 Cholera Iraq 5

6 Chikungunya Singapore 8

7 Anthrax USA 15

8 Yellow fever Argentina 5

9 Ebola Reston Philippines

15

# Disease Country ProMED-alerts

10 Influenza Egypt 49

11 Plague USA 8

12 Dengue Brazil 27

13 Dengue Indonesia 14

14 Measles UK 13

15 Chikungunya

Malaysia 15

16 Yellow fever Senegal 0

17 Influenza Indonesia 35

18 Influenza Bangladesh

3

14 countries and 11 infectious disease types. 366 days of news data was collected from BioCaster for each disease and country. The study period is 17th June 2008 to 17th June 2009

Creating a benchmark data set

C3 C2 W2 F-statistic EWMA

Sensitivity 0.74 0.66 0.66 0.78 0.73

(0.69-0.78) (0.61-0.72) (0.60-0.71) (0.74-0.82) (0.68-0.78)

Specificity 0.96 0.98 0.98 0.92 0.95

(0.95-0.96) (0.98-0.98) (0.98-0.99) (0.91-0.92) (0.94-0.96)

PPV 0.55 0.64 0.65 0.46 0.47

(0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.99)

NPV 0.98 0.98 0.98 0.98 0.98

(0.98-0.99) (0.98-0.99) (0.98-0.99) (0.98-0.98) (0.98-0.99)

Alarms/100 days 6.48 4.52 4.17 12.34 7.85

F-measure 0.63 0.65 0.66 0.58 0.58

Results in parentheses show 95% confidence intervals

[11] Collier, N. (2009), “What’s unusual in online disease outbreak news?”, in BMC Biiomedical Semantics, 1(2).

Comparison of 5 aberration detection algorithms

Field evaluation

• (2006-2012) Global Health Security Initiative– a unique initiative by G7+WHO+EC to bring together end-users, system providers and stakeholders to test the feasibility of open source public health intelligence systems.

[12] Barboza, P., Vaillant, L., Le Strat, Y., Hartley, D. M., Nelson, N. P., Mawudeku, A., Madoff, L. C., Linge, J. P., Collier, N., Brownstein, J. S. and Astagneau, P. (2014). Factors Influencing Performance of Internet-Based Biosurveillance Systems Used in Epidemic Intelligence for Early Detection of Infectious Diseases Outbreaks. PloS one, 9(3), e90536. [13] Barboza, P., Vaillant, L., Mawudeku, A., Nelson, N., Hartley, D., Madoff, L., Linge, J., Collier, N., Brownstein, J., Yangarber, R. and Astagneau, P. (2013), “Evaluation of epidemic intelligence systems integrated in the Early Alerting and Reporting project for the detection of A/H5N1 Influenza events”, PLoS One, 8(3):e57252.

Major findings for A/H5H1:- Detection rates for individual systems from

31% to 38%- Rising to 72% for the combined system- PPV ranged from 3% to 24%- F1 ranged from 6% to 27%- Sensitivity ranged from 38% to 72%- Average improvement in alerting over WHO or

OIE was 10.2 days

User outcomes

• Used by WHO and Japanese MoH to detect early cases during the A(H1N1) pandemic;

• Used by ECDC to monitor diseases during the Shanghai Expo 2010, London Olympics 2012;

• Used by French Institute for Public Health to monitor for human-to-human A(H5N1) transmission;

• Used by GHSI members to monitor for suspected accidental or deliberate releases;

• Used by CDC to help monitor for health impact of the Oil spill in the Gulf of Mexico;

PHENOMINER/PHENEBANK: EXTRACTING A DATABASE OF PHENOTYPE TERMS

Case study #2

[14] Collier, N., Groza, T., Smedley, D., Robinson, P., Oellrich, A. and Rebholz-Schuhmann, D. (2015). PhenoMiner: from text to a database of phenotypes associated with OMIM diseases. Database, Oxford University Press (in press).[15] Collier, N., Oellrich, A. and Groza, T. (2013), “Toward knowledge support for analysis and interpretation of complex traits”, Genome Biology 14(9):214.

What is a phenotype?

Image courtesy of Washington, Haendel, Mungall, Ashburner, Westfield and Lewis (2009), “Linking human diseases to animal models using ontology-based phenotype annotation”, PLoS Biology, 7(11):e1000247.

“… patients were selected for FOXP2 screening only ifthey fulfilled the following criteria: presence of speech articulation problems diagnosed by a clinician …”

HPO: 0009088 Speech articulation difficulties

Image courtesy of Damian Smedley, Welcome Trust Sanger Institute, Hinxton and Tudor Groza, University of Queensland, Brisbane

Coding personal terminology

SVM learn-to-rank (pairwise)Maximum entropyPriority list heuristic

“… patients were selected for FOXP2 screening only ifthey fulfilled the following criteria: presence of speech articulation problems diagnosed by a clinician”

“… patients were selected for FOXP2 screening only ifthey fulfilled the following criteria: presence of speech articulation problems diagnosed by a clinician”

Creating a benchmark data set

• Data from OMIM cited autoimmune literature (112 abstracts, 472 phenotypes, 1611 gene/gene products).

F-scores computed using ablation on various domain ontologies

F-scores using 3 hypothesis resolution strategies

[16] Collier, N., Tran, M., Le, H. Ha, Q., Oellrich, A. Rebholz-Schuhmann, D. (2013), “Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking”, PLoS One 8(10): e72965.

Lesson learnt … sampling matters

Resource Size (records)PubMed 23,765,575

GENIA 2,000

PennBioIe 1414

FSU-PRGE 3,236

Arizona corpus 2,775sentences

I2B2/VA 2010 826

M1: In domain approach

Sample B1

Learner

Knowledge

Evaluation (A)

Sample B2

Sample B

M2: Out domain approach

Sample A

Learner

Knowledge

Evaluation (B)

Sample B

M3: Mix-in approach

Sample A

Learner

Knowledge

Evaluation (B)

Sample BSample B

+

M4: stack approach

Learner

Knowledge

Evaluation (B)

Sample BSample B

Sample A

Learner

Knowledge

M5: binary class

Sample A

Learner

Knowledge

Evaluation (B)

Sample B

Sample B

+

Re-label PHEPHE-1 and PHE-2

Re-label PHE-1and PHE-2 as PHE

M6: frustratingly simple

Sample A

Learner

Knowledge

Evaluation (B)

Sample B

Sample B

+

Re-label features asSample A, Sample B and Joint

37

Near domain transfer results

How can we do domain adaptation better (with less annotations)?

[17] Collier, N., Paster, F., Campus, H., & Tran, A. M. V. (2014), “The impact of near domain transfer on biomedical named entity recognition”, Proc. 5th International Workshop on Health Text Mining and Information Analysis (LOUHI) at the European Conference on Computational Linguistics (EACL), Gothenburg, Sweden, pp. 11-20.

SIPHS: UNDERSTANDING THE VOICE OF THE PATIENT

Case study #3

[18] Limsopatham, N. and Collier, N. (2015), “Adapting phrase-based machine translation to normalise medical terms in social media messages”, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015, pp. 1675-1680. [19] Limsopatham, N. and Collier, N. (2015), “Towards the semantic interpretation of personal health messages from social media”, in Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), Workshop on Understanding the City with Urban Informatics (UCUI 2015), Melbourne, Australia, 19-23 October 2015.

What do people talk about?

Types Tweet samplesInfluenza confirmation

I got flu n coughed a lot. Now my voice is like monster’s voice. Rrr

Influenza symptoms My day: flu-like symptoms (headache, body aches, cough, chills, 100.9 fever). Swine flu not ruled out. #H1N1

Flu shots I’m still getting flu shots, nothing is worth flu turning into bronchitis into pneumonia

Self protection Cover your mouth if coughing, use a tissue, wash your hands often & get a flu shot - protect and defend your community from #H1N1

Medication Wondering why I didn’t take the flu shot, laying in bed with cough drops, medicine, and the remote

Tracking anxiety indicators have moderate-strong correlation with CDC seasonal flu tracking

Category Spearman’s Rho

P-value

A 0.66 0.020

S 0.66 0.021

I 0.58 0.048

P 0.67 0.017

A+I+P 0.68 0.008

A+I+P+S 0.67 0.01746 47 48 49 50 51 52 1 2 3 4 50

500

1000

1500

2000

2500

3000

0

50

100

150

200

250

300

350

400

450

CDC

A

S

I

P

A+I+P

A+I+P+S

Data source: CDC (2009-2010 flu season)“Cover your mouth if coughing, use a tissue, wash your hands often & geta flu shot - protect and defend your community”

“I’m still getting flu shots, nothing is worthflu turning into bronchitis into pneumonia”

“I can ignore this sore throat no longer. And, um, maybe I should have gotten that H1N1 vaccine.“

Frustratingly simple models work better

Classifying respiratory syndrome: Turning 225,000 Tweets into a high correlation influenza tracker

[22] Doan, S., Ohno-Machado, L. and Collier, N. (2012), "Enhancing Twitter data analysis with simple semantic filtering: example in tracking Influenza-Like Illnesses", in the 2nd IEEE Conference on Healthcare Informatics, Imaging and Systems Biology: Analyzing Big Data for Healthcare and Biomedical Sciences, California, USA, September 27-28.

Coding the voice of the patient in SIPHS

• Integrate the language of Social Media and Lifescience Ontologies

• ‘Voice of the patient’ – real time public health mapping/risk analysis

• Code patient-centred vocabulary and links

• Generate public health summaries, e.g. infectious diseases, ADRs

Twitter message SNOMED preferred

term

SNOMED ID

No way I’m getting any sleep 2nite Insomnia 193462001

Take _DRUG_ and can’t even focus forreal

Unable to concentrate

60032008

_DRUG_ makes u skinny Weight loss 89362005

“You shall know a word by the company it keeps” – (Firth, J. R. 1957)

• Existing work [1,2] used word vector similarity to measure the semantic similarity between texts

Performance seems depended on the used vector representation (e.g. CBOW [1], GloVe [2])

[23] Mikolov et al. Distributed representations of words and phrases and their compositionality. NIPS 2013

[24] Pennington et al. GloVe: Global vectors for word representation. EMNLP 2014

• Recent advances in deep learning technology [1,2] allowed the learned representation of terms (i.e. DWRs) that could capture the semantic similarity of terms based on their co-occurrences e.g. Continuous bag-of-words (CBOW) [1], Global Vector (GloVe) [2]

One-hotmoney [1 0 0 0 0 .. ..]cash [0 1 0 0 0 .. ..]april [0 0 1 0 0 .. ..]

DWRmoney [1.251 0.751 0.008 .. ]cash [1.100 0.830 0.010 .. ]april [-5.256 0.004 2.526 .. ]

44

Related work – Phrase-based MT

• Phrase-based MT [3]: Translate between languages by learning local term dependencies from parallel corpora

We adapt phrase-based MT to translate from social media language to formal medical language

Can’t even focus forreal no concentrate ???

[25] Koehn et al. Statistical phrase-based translation. NAACL 2003

45

Adapting Phrase-based MT for Twitter Normalisation

• We use phrase-based MT to translate social media text to formal medical text, then map the translated symptoms to a SNOMED-CT concept

Can’t even focus forreal unable to focus unable to concentrate(ID 60032008)

translate

find semantic distance

[18] Limsopatham, N. and Collier, N. (2015), “Adapting phrase-based machine translation to noramlise medical termsin social media messages”, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal, September, pp. 1675-1680.

A Twitter Phrase

Training pairs of Twitter phrases and SNOMED-CT terms

A Phrase-based MT Model

Our Mapping Approach (i.e. Sim,

rSim)

A ranking of mapped concepts

e.g. ‘No way I’m getting any sleep 2nite’

e.g. ‘no sleep week’ = ‘Insomnia’,‘so unfocussed!!!’ = ‘Unable to concentrate’

Using a phrase-based model, such as Koehn et al. (2003)

e.g. 1. Insomnia (193462001)2. Productivity at work

(224403006)

System Architecture

Experimental Setup

• Instantiations of our approach:

Sim(1): using only the best translation

Sim(5): using the top 5 translations

rSim(5): using the top 5 translations

• Baseline: Cosine similarity of vector representations of the original tweet and the description of a concept One-hot

Continuous Bags of Words (CBOW)

Global Vector (GloVe)

48

Experimental Results

• RQ1: Does our approach perform better than SOTA DWR baselines?

Baseline Sim(1) Sim(5) rSim(5)0

0.05

0.1

0.15

0.2

0.25

0.30.

1675

0.22

32

0.24

91

0.24

58

0.18

960.

1869

One-hot CBOW GloVe

Yes, all instances of our approach markedly outperformed the DWR

baselines by up to 33% MRR-5

49

Twitter message: “unable to sleep at all”

Baseline: Mapping: “unable to sleep at all” ‘unable to concentrate’

Our approach:Translation: “unable to sleep at all” “insomnia of”Mapping: “insomnia of” ‘insomnia’


• RQ2: Which types of DWRs are effective for our approach?

Baseline Sim(1) Sim(5) rSim(5)0

0.05

0.1

0.15

0.2

0.25

0.30.

1675

0.22

32

0.24

91

0.24

58

0.18

96

0.20

70

0.21

04

0.21

09

0.18

69 0.25

00

0.26

38

0.26

17

One-hot CBOW GloVe

Both Sim and rSim outperform the baseline, regardless of the

used vector representationMRR-5


• RQ3: Would the performance improve if we consider both original and translated text when mapping a concept? Performances improved

when using one-hot representationMRR-5

Sim(1) Sim(1)+ Sim(5) Sim(5)+ rSim(5) rSim(5)+0.15

0.17

0.19

0.21

0.23

0.25

0.27

0.22

32

0.24

2

0.24

91

0.25

56

0.24

58

0.25

94

0.20

7

0.19

53

0.21

04

0.21

44

0.21

09

0.20

70.25

00

0.25

32

0.26

38

0.26

00

0.26

17

0.25

09

One-hot CBOW GloVe

51

Summary

• How we exploit the base of medical evidence is changing as access to unstructured ‘messy’ data opens up new opportunities

• Data access, bias and standards

• We can expect impact in epidemic detection, pharmacovigilence, translational health, disease mapping, risk communication, rare disease profiling and many other areas.

• Encoding the data increases value through data mining, exchange and integration

• Machine learning outperforms dictionaries and hand built rules

• Finding the right lexical representation and right target form is key

Thank you

Contributions by:

Nigel [email protected] [email protected] [email protected]

Further information at the Language Technology Labhttp://ltl.mml.cam.ac.uk/

Funding:

mailto:[email protected]



http://ltl.mml.cam.ac.uk/

exploiting nlp for digital disease informatics

Health & Medicine