eccb 2014: extracting patterns of database and software usage from the bioinformatics literature

26
Extrac’ng pa,erns of database and so3ware usage from the bioinforma’cs literature Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert Stevens The University of Manchester, UK h,p://www.cs.man.ac.uk/~duckg/ h,p://bionerds.sourceforge.net/networks/

Upload: geraintduck

Post on 16-Aug-2015

75 views

Category:

Science


1 download

TRANSCRIPT

Page 1: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Extrac'ng  pa,erns  of  database  and  so3ware  usage  from  the  bioinforma'cs  literature    

Geraint  Duck,  Goran  Nenadic,  Andy  Brass,  David  L.  Robertson  and  Robert  

Stevens    

The  University  of  Manchester,  UK  h,p://www.cs.man.ac.uk/~duckg/  h,p://bionerds.sourceforge.net/networks/  

Page 2: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Introduc'on  

•  Methods  are  fundamental  to  science  –  Judgement    – Replica'on  – Extension  

•  Methods  in  bioinforma'cs:  –  In  silico:  Data  and  tools  – Workflows    

•  Objec've  representa'on  •  Sharing  and  reuse  

2  

Page 3: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Bioinforma'cs  

•  Resource  focused  domain:  “Resourceome”  – Our  research  suggests:  

•  Around  200,000  unique  resources  in  the  literature    •  Over  4  million  men'ons  •  …  and  s'll  growing!  

•  Resource/method  search  and  selec'on…  – Best-­‐prac'ce  – Common-­‐prac'ce  

•  What  are  the  main  pa,erns  in  bioinforma'cs  resources,  and  associated  methods?   3  

Page 4: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Approach  

•  Use  bioinforma'cs  literature  (to  answer  this  ques'on)  

•  Extract  database  and  so3ware  men'ons  •  Combine  resources  to  form  pairs  •  Combine  pairs  to  forms  pa,erns  – Common-­‐prac'ce  – Method?  

4  

PHYLIPClustalW

ModellerBLAST PROCHECK

Page 5: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Document  Collec'on  

•  PubMed  Central  open-­‐access  full-­‐text  ar'cles  •  Bioinforma2cs[MeSH]  •  22,376  ar'cles  •  67  journals  •  3  journals  were          >  50%  of  total          documents  

5  

!"

#!!"

$%!!!"

$%#!!"

&%!!!"

&%#!!"

'%!!!"

'%#!!"

(%!!!"

(%#!!"

$))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$("

!"#

$%&'(

)'*(+"#

%,-.'

/%0&'

Page 6: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

bioNerDS  

•  bioNerDS  –  Bioinforma'cs  named  en'ty  recogniser  for  databases  and  so3ware  

–  Full-­‐text;  Men'on  level  –  Rule-­‐based    –  F-­‐score  63-­‐91%  –  Previously  compared  resource  usage  in:  •  Genome  Biology  •  BMC  Bioinforma'cs  

•  Networks  filter:  –  702,937  total  men'ons  –  167,697  document  level  men'ons  

–  31,053  unique  names  –  93%  single  men'on  

•  Duck  et  al.  (2013)          BMC  Bioinforma'cs  

6  h,p://bionerds.sourceforge.net/  

Page 7: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

bioNerDS  

Genome  Biology  •  “Biological”  focus  

–  GenBank  –  Ensembl  –  GEO  –  GO  

BMC  Bioinforma6cs    •  “Resource”  focus  

–  R  –  PDB  –  PubMed  

7  h,p://bionerds.sourceforge.net/  

Page 8: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Men'on  Filtering  

•  Filter  resources  not  men'oned  within  a  minimum  of  2  documents  – Removed  25%  of  men'ons  – Removes  less  likely  names  

•  Generic  resources  – R  – Bioconductor  

•  Categorise  to  database/so3ware  – Removed  some  ‘unknown’  resources    

8  

Page 9: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Methods  Sec'ons  

•  Removed  resources  not  in  the  methods  sec'on  – Method  or  non-­‐method  

•  Regular  expression  based  'tle  detec'on    – Tested  on  100  ar'cles  – Precision:  97%;  Recall:  79%  

•  Resul'ng  in:  – 69,466  database  men'ons  (1,711  unique)    – 65,451  so3ware  men'ons  (3,289  unique)  

9  

Page 10: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Extrac'ng  Pairs  •  Co-­‐occurrence  within  text  •  Two  sets  of  pairs:  – So3ware  only  pairs  – Database  and  so3ware  pairs  (any  combina'on  of)  

•  This  provided  us  with:  – 22,880  so3ware  pairs  (13,965  unique)  – 54,562  database/so3ware  pairs  (29,066  unique)  

•  Removed  pairs  only  within  a  single  document  – 53%  of  the  so3ware  pairs  – 46%  of  the  database/so3ware  pairs   10  

Page 11: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Common  Pairs  

•  With  sufficient  data,  the  most  common  order  of  a  pairing  is  the  correct  one…  

•  Binomial  test  –  each  order  is  equally  likely  •  Two  confidence  thresholds:  – 95%  

•  2,518  so3ware  pairs  (145  unique)  •   7,001  database/so3ware  pairs  (297  unique)  

– 99%  •  1,450  so3ware  pairs  (55  unique)  •  3,383  database/so3ware  pairs  (95  unique)     11  

Page 12: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Most  Common  Pairs  

SoAware  only  pairs  Directed  Pair   Count   %  

BLAST      è    ClustalW   205   14.1  

BLAST      è    PSI-­‐BLAST   103   7.1  

Phred      è    Phrap   89   6.1  

ClustalW    è    MEGA   77   5.3  

Cluster        è    Tree  View   75   5.2  

Phrap            è    Consed   51   3.5  

Modeller    è    PROCHECK   44   3.0  

BLAST            è    ClustalX   43   3.0  

ClustalW    è    PHYLIP   41   2.8  

BLAST        è    MUSCLE   40   2.8  

SoAware  and  database  pairs  Direct  Pair   Count   %  

 GO                            è      KEGG   350   10.3  

 BLAST                  è      GO   195   5.8  

 BLAST                  è      ClustalW   150   4.4  

 GEO                        è      GO   129   3.8  

 Phred                  è      Phrap   89   2.6  

 BLAST              è      PSI-­‐BLAST   87   2.6  

 PDB                è      Modeller   85   2.5  

 Swiss-­‐Prot    è      TrEMBL   82   2.4  

 Ensembl          è      BioMart   82   2.4  

 ClustalW          è      MEGA   77   2.3  

12  

Page 13: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

13  

Page 14: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

14  

Page 15: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

15  

Page 16: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

16  

Page 17: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

17  

Page 18: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

18  

Page 19: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Resource  Pa,erns  

Databases  •  Data  sources  

 •  GO  is  an  excep'on  

–  Major  sink  –  Data  Annota'on  

•  Numerous  ‘same’  links    –  Enumera'on  in  text?    

SoAware    •  Data  sinks  

   

•  Represents  the  primary  in  silico  pipeline(s)  

•  Again,  sequence  alignment  is  central  

19  

Page 20: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Pa,erns  through  Time  

2004  to  2006    

2007  to  2009  

20  

Page 21: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Pa,erns  through  Time  

21  

2010  to  2012    

Page 22: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Phylogene'cs  Pa,erns  

•  Case-­‐study…    •  Eales  et  al.  (2008)  BMC  Bioinforma2cs,  9,  359  – Mapped  phylogene'cs  methods  into  4  steps:  

•  Sequence  Alignment  •  Tree  Inference  •  Sta's'cal  Tes'ng  •  Tree  Visualisa'on    

– Using  the  same  corpus  selec'on,  we  built  a  network…  •  PubMed  search  for  “phylogen*”  in  'tles  or  abstracts  

22  

Page 23: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Phylogene'cs  Pa,erns  

23  

Page 24: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Phylogene'cs  Pa,erns  

•  Our  automated  extrac'on  can  recreate  these  steps  – Given  some  ambiguous  resources    

•  Encouraging…  – Viable  in  silico  pa,ern  extrac'on  – “Common  prac'ce”  

•  Next  step:  Apply  this  to  other  (sub-­‐)domains  

24  

Page 25: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Conclusion  

•  Can  extract  pa,erns  of  resource  usage  – Can  we  describe  the  method  through  these?  

•  High  level  overview  of  common-­‐prac'ce  – With  lower  thresholds,  can  access  resources  specific  (but  “common”)  to  different  subdomains  

– Not  best-­‐prac'ce…  •  Workflows?  – Requires  increased  granularity    – Could  help  inform  their  crea'on  

25  

Page 26: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

Thank-­‐you  •  Acknowledgements  –  Co-­‐authors  

– Manchester  IT  Services  •  Computa'onal  facili'es    

–  Funding:    

–  Travel:  

26