mie 2008 1 biomedical knowledge confidence criteria assessment of biomedical knowledge according to...

16
MIE MIE 2008 2008 1 Assessment of Biomedical Knowledge Biomedical Knowledge According to Confidence Criteria Confidence Criteria Ines Jilani : Ines Jilani : [email protected] Natalia Grabar Pierre Meneton Marie-Christine Jaulent Wednesday, 28 th of May 2008

Upload: chester-mckenzie

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

MIE MIE 20082008

1

Assessment ofBiomedical KnowledgeBiomedical Knowledge

According toConfidence CriteriaConfidence Criteria

Ines Jilani : Ines Jilani : [email protected]

Natalia GrabarPierre MenetonMarie-Christine Jaulent

Wednesday, 28th of May 2008

MIE MIE 20082008

2

ContextContext

• Increasing number of biomedical articles in Pubmed*

• Follow-up work on automatic extraction of functional knowledge about genes/proteins from scientific articlesΔ indexed in Pubmed– Using lexico-syntactic patterns:

• Language specific automaton (grammar)

o Syntactic elements (Verb, Noun, Adjective…)

o Semantic elements (Meaning of words…)

* http://www.ncbi.nlm.nih.gov/sites/entrez

Δ Jilani I, Grabar N & Jaulent M.-C. Fitting the finite-state automata platform for mining gene functions from biological scientific literature. In SMBM in Jena (Germany) 2006

MIE MIE 20082008

3

Example of lexico-syntactic patternExample of lexico-syntactic pattern

o (Sox2; sensory organ development)

o (Hint; murine development)

MIE MIE 20082008

4

IntroductionIntroduction

• Limits of the system– Loss of context: reliability and confidence of

the claim

• Solution– Use some devices to « weight » the extracted

knowledge• In order to make more confident use of extracted

knowledge

o Hedge, modifier, qualifier

o Confidence markers

MIE MIE 20082008

5

Hedges, modifiers, qualifiers …Hedges, modifiers, qualifiers …

• Linguistic devices used by authors to qualify their assertions– Different grammatical categories: verbs, adverbs, adjectives…– “Copper deficiency is a plausible cause of Alzheimer disease

(AD). This hypothesis should be tested with a lengthy trial of copper supplementation”*

• “hedge” was first used by Lakoff Δ : “words whose job it is to make things more or less fuzzy”

• HylandΦ, and others carried out qualitative studies of these qualifiers– without modelling them– nor integrated their use for weighting any kind of information in a

knowledge extraction system

* Quoted from the abstract of the article with Pubmed Identifier 17928161

Δ Lakoff, G., (1972) : Hedges: A study of Meaning Criteria and the Logic of Fuzzy Concepts, Chicago Linguistic Society, 8, pp. 183-228

Φ Hyland, K. 1995. The Author in the Text: Hedging Scientific Writing. Hong Kong Papers in Linguistics and Language Teaching.

MIE MIE 20082008

6

ObjectivesObjectives• Work on confidence markers in scientific articles

– Their use– Their significance– Their classification– Their automatic detection in texts for knowledge weighting

purposes

• The main aim was to document the information so that it could be used confidentlyE.g. : (Sox2; sensory organ development)– Sox2 is required for sensory organ development– Sox2 might be required for sensory organ development– Sox2 is probably required for sensory organ development– Our findings suggest that Sox2 is required for sensory organ

development– Doe, et al. has demonstrated that Sox2 is required for sensory

organ development

MIE MIE 20082008

7

MaterialsMaterials

• 3 corpora obtained by querying Pubmed

• Lexical resource: WordNet®* is a large lexical database of layman English: nouns, verbs, adjectives and adverbs– Used to enrich the extracted confidence markers by identifying

their synonyms* WordNet, An Electronic Lexical Database, Christiane Fellbaum ed., (1998), The MIT Press, Cambridge, Mass

Corpus QUERY SPECIES SOURCE SPECIFICITY NUMBER of SENTENCES

CORP1 160 genes + Alzheimer disease

human Pubmed 355 abstracts 817

CORP2 160 genes + Alzheimer disease

human Pubmed Central

68 full texts 27,912

CORP3 160 genes + Alzheimer disease

worm Pubmed 348 abstracts 825

MIE MIE 20082008

8

MethodsMethods• Manual collection of confidence markers from CORP1,

CORP2 and CORP3

• Enrichment of the list of confidence markers– Using WordNet®

• Classification of confidence markers according to 2 types of classes

• Add the Impact Factor (IF) as another confidence criterion– Hypothesis: IF of a journal is subjectively related to the

reliability of the biological and medical information published

• Modeling confidence criteria: develop a formula allowing to order the triplets (representing annotations) in respect to their confidence score, and consistently

MIE MIE 20082008

9

ResultsResults

• List of 250250 manually collected confidence markers was generated

• Enrichment using WordNet® increased the number of confidence markers listed to 478478

• Classification– 4 different categories in ascending order of

confidence Type 1– 10 distinct qualifiers modifying confidence levels

within the Type 1 categories, characterizing subjectivity in texts Type 2

MIE MIE 20082008

10

1 - Interrogation or trial and error of the author: Knowledge that remains unproven and requires demonstration. e.g.: “remain to be confirmed”, “has yet to be identified”, “?”

2 - Distance suggested by the author compared to his assertions or the knowledge presented in the text: It may also correspond to a restriction of the knowledge concerned to a specific context (e.g.: the context of the article or experiment).e.g.: “our findings suggest that”, “in this case we conclude that”, ”it is possible that”

3 - Studies by other researchers, references to other works, articles or methods: We assume that if an article is cited, the information is assumed, or at worst simply believed to be true. e.g.: “previous observation”, “it is now believed that”, “it has been proposed that”

4 - Demonstration or proof given by the author: This corresponds to work carried out by the author and presented in the concerned article. e.g.: “we reveal that”, “we show here that”, “our results indicate that”…

Results: Type 1 classResults: Type 1 class

MIE MIE 20082008

11

• 10 Qualifiers representing probabilities from negation to affirmation, i.e. from the least probable to the most probable

Results: Type 2 class*Results: Type 2 class*

Confidence - - Confidence + +

* Work derived from: Ian Jacobs. 1995. English Modal Verbs

MIE MIE 20082008

12

Results: ModelingResults: Modeling

• Modeling confidence criteria for their automatic extraction– Regular expressions are used

• “we anticipate” and “we expect”we<have>*(<anticipate>+<expect>)

– Synonyms are used• “we hypothesise” and “we suspect ”

we<have>*(<hypothesise>+ <speculate>+<expect>+<predict>+<suspect>)

• “have been previously confirmed”, “is now largely confirmed” and

is “widely confirmed ”<have>*<be>(previously+now)*(largely+widely+extensively+generally)*<confirm>

We had anticipated that…

We have anticipated that…

We expect that….

We have expected that…

MIE MIE 20082008

13

Results: ApplicationResults: Application

• Context of apolipoprotein E gene

*

*

*

poin

ts

Triplets (Gene, Function, PMID)

MIE MIE 20082008

14

Results : ExplanationsResults : Explanations - ApoE allelic variability influences pupil response to cholinergic challenge and

cognitive impairment. 1

- The Apolipoprotein E (ApoE) epsilon4 allele role in LOD is controversial, while

it is still unknownit is still unknown in vascular depression. 2

- ApoE4 seemsseems to facilitate HSV-1 latency in the brain much more so than ApoE3.3

Triplets Type1 Type2 IF

ApoE/ cognitive impairment/167646771 4 10 4,091

ApoE epsilon4 allele/vascular depression /173370102 1 10 2,035

ApoE4/ HSV-1 latency/166990183 2 10 5,178

Triplets ordered in an ascending confidence orderconfidence order:

1 ; 3 ; 2

MIE MIE 20082008

15

Discussion / ConclusionDiscussion / Conclusion

• Confidence markers collected manually– Abstracts– full text articles

• They are extended with WordNet® resource• They are classified into 4 categories of Type 1 and

10 categories of Type2

• This study constitute a priming work: the confidence markers will be easily added to lexico-syntactic patterns already generated for annotating genes/proteins functionally

• Annotation already present in databases could be additionally documented with confidence markers– Gene Ontology Annotation files– Swissprot / Uniprot

• The confidence markers can be used by curators to annotate genes/proteins through a system able to detect those qualifiers

MIE MIE 20082008

16

PerspectivesPerspectives

• The users of the final system are potentially biologists, curators…

• Take into account for the confidence scoring the type of study presented in an article– Observational study (epidemiological)– Controlled experiment– Clinical essay…