nlp pipeline for protein mutation knowledgebase construction jonas b. laurila, nona naderi, rené...

17
NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Upload: hector-skinner

Post on 13-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

NLP pipeline for protein mutation knowledgebase construction

Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Page 2: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Background

• Knowledge about mutations is crucial for many applications, e.g. Protein engineering and Biomedicine.

• Protein mutations are described in scientific literature.

• The amount of Information grow faster than manual database curation can handle.

• Automatic reuse of mutation impact information from documents needed.

Page 3: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Example excerpts

"Haloalkane dehalogenase (DhlA) from Xanthobacter autotrophicus GJI0 hydrolyses terminally chlorinated and brominated n-alkanes to the corresponding alcohols."

"The W125F mutant showed only a slight reduction of activity (Vmax) and a larger increase of Km with 1,2-dibromoethane."

• Directionality of impact • Protein property• Mutation

• Protein name • Gene name • Organism name

Page 4: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Mutation impact ontology

Page 5: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

NLP framework

Page 6: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Named entity recognition

• Protein-, gene- and organism names– Gazetteer lists based on SwissProt– Mappings encoded in the MGDB

• Mutation mentions– MutationFinder ~700 regular expressions– normalize into wNm-format

Page 7: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Named entity recognition

Protein Properties1. Protein functions

– Noun phrases extracted with MuNPEx– Activity, binding, affinity, specificity as

head nouns

2. Kinetic variables– Jape rules to extract Km, kcat and Km/kcat in

current implementation

Page 8: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Mutation groundingLinking mutations positionally correct to target sequence

• Important for reuse of mutation mentions

• Levels of grounding:1.

2.

3.

Page 9: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

mSTRAPviz

Structure annotation visualization

Mutations extracted from text visualized on the protein structure for which mutation grounding is a prerequisite.

Page 10: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Protein function grounding

• Mentions of protein functions are linked to correct Gene Ontology concepts.

• Previously grounded proteins and mutations provide us with hints.

• Grounding scored based on string similarity (later used during impact extraction)

Page 11: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Relation detection

• Impacts– Words describing directionality + protein

properties• Mutants

– Set of mutations giving rise to altered proteins

• Mutant – Impacts– The causal relation between mutants and

their impacts

Page 12: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

OwlExporter

• Translates GATE Annotations to OWL instances

• Application independent• Literature Specifications added

automatically

• Used here to populate our Mutation impact ontology to create a mutation knowledgebase

Page 13: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Example query

Retrieve mutations that do not have an impact on haloalkane dehalogenase activity (also retrieve the Swissprot identifier of the protein beeing mutated).

Page 14: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Example query

Retrieve mutations on Haloalkane Dehalogenase that do not impact negatively on the Michaelis Constant.

Page 15: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

Evaluation

Mutation grounding performance

Page 16: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

What’s next?

• Modularize into a set of web services

• Database (re-)creation

• Reuse in phenotype prediction algorithms, (SNAP)*

*Bromberg and Rost, 2007

Page 17: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker

NLP pipeline for protein mutation knowledgebase construction

Jonas B. LaurilaCSAS, UNB, Saint [email protected]

Nona NaderiCSE, Concordia University, Montré[email protected]é WitteCSE, Concordia University, Montré[email protected] J.O. BakerCSAS, UNB, Saint [email protected]

AcknowledgementThis research was funded in part by :

• New Brunswcik Innovation Foundation, New Brunswick, Canada

• NSERC, Discovery Grant, Canada

• Quebec -New Brunswick University Co-operation in Advanced Education - Research Program, Government of New Brunswick, Canada