papermaker, beyondthepdf, rebholzschuhmann, 19jan2011

Download PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Post on 26-Aug-2014

482 views

Category:

Technology

9 download

Embed Size (px)

DESCRIPTION

Presentation on Whatizit, LexEBI, IeXML, CALBC, SESL, PaperMaker

TRANSCRIPT

  • PaperMaker: Validation of biomedical scientificpublicationsJanuary 19th, 2011Workshop: BeyondThePdfDietrich Rebholz-Schuhmann, MD, PhDGroup Leader Rebholz GroupEuropean Bioinformatics Institute
  • Publishing is about ... Agreeing / disagreeing about current science Only peer review can judge current science ... Bringing new results Conceptual results are more difficult than new data ... Gaining new knowledge New data and new results can imply new knowledge where even the author is still unaware of ... Rewarding the scientist Count whatever you can count that could have an impact. Validating the scientists claim is the key reward. Any scientist can fool any system, but (hopefully) only short-term2 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • Future of biomedical text mining Working towards ... ... Literature integration to have it full fledged as part of bioinformatics data resources ... Cross-domain support to deliver the content to different scientific communities. ... Provenance to carry credit of findings into analytical biomedical research ... Inference & Reasoning to make use of the full semantic support in the scientific literature3 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • Literature content in the Semantic Web4 20.01.2011 Literature and Text Mining
  • Terminologies vs. Ontologies Ontological resources Database type Resource building Explicit semantics Terminologies, collection of terms Manual generation Automatic generation Consistency, inference, reasoning Exploitation of terminological features Interoperability with all semantic Standardisation of TM solutions resources Interoperability with database Working towards a reasoning resources infrastructure5 Literature and Text Mining
  • Efforts in the Rebholz group towards interoperability of literature with bioinformatics Whatizit infrastructure Biomedical NER as a public, large-scale service LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U) Biomedical terminological resource, standardisation of semantics IeXML (BioLink SIG 2006, Brasil) Put the annotations into the document (inline annotations) CALBC project Collaborative annotation of a large-scale biomedical corpus UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL) Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public SESL project Joint project with pharma & publishers, literature content in a triple store PaperMaker Validation of the scientific literature against the above6 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 1 Whatizit7 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • Integrating biomedical literature and data Rebholz-Schuhmann, D., et al. Text Processing through Web Services: Calling Whatizit. Bioinformatics 24, no. 2 (2008): 296-98.8 20.01.2011 Literature and Text Mining
  • 2 BioLexicon LexEBI9 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • LexEBI: content # Labels # Variants Total Total / # Unique Uniq. T. / Labels terms Labels Prot. Gene GP 7.0 516,113 4,005,040 4,521,153 8.76 1,726,853 3.35 / GP 6.0 488,577 3,389,316 3,877,893 7.94 1,564,436 3.20 Jochem 278,578 1,691,980 1,970,558 7.07 1,527,752 5.48 Chemi- cals ChEBI 19,645 94,748 114,393 5.82 101,307 5.16 ChEBI (all) 549,838 1,187,322 1,737,160 3.16 Enzymes 4,905 8,082 12,987 2.65 12,377 2.52 Other Species 643,280 199,130 842,410 1.31 838,135 1.30 Interpro 20,671 0 20,671 1.00 20,671 1.00 Antineuro., 4,718 6,488 11,206 2.38 Neo Bio. Act. 54,148 87,209 141,357 2.61 UMLS Enzymes 26,065 56,332 82,397 3.16 Lipid, Carb. 11,518 9,770 21,288 1.85 Pharm. Act. 104,201 123,840 228,041 2.19 Vit., Horm. 6,877 10,258 17,135 2.4910 20.01.2011 Literature and Text Mining
  • 3 IeXML11 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • IeXML: Annotating entities in text Inline annotations to any part of the document with the annotations No hassle with character or byte counts or layout modifications to the document Alignment of annotated documtents to Compare annotations Validate annotations Harmonise annotations (SESL project)12 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 4 CALBC13 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • The challenge 150,000 documents or more ... Test set for all systems Assessment, benchmarking14 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • CALBC Challenge II(1) 75,000 documents training data(2) 175,000 testing data(3) Additional 700,000 testing data September 13th 2010: Second harmonized corpus available for CALBC Challenge II December 15th, 2010: Challenge II closes March 2011: CALBC Workshop II June 30th, 2011: Final harmonized corpus available Literature and Text Mining BioCreative III, Rebholz
  • 5 Ukpmc/Elixir16 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 17 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • UKPMC ~ 10 % the size of PubMed18 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • 6 sesl19 20.01.2011 Literature and Text Mining BioCreative III, Rebholz
  • SESL Project: from publisher to pharma Multiple Consumers Disease Knowledge Dossier Applications Service Layer (RDF, Web 2.0) Std PublicOpen Common Assertions, SPARQL, Triple Store VocabulariesStan- Service Integration, Inf