eccb 2014: extracting patterns of database and software usage from the bioinformatics literature

Extrac'ng pa,erns of database and so3ware usage from the bioinforma'cs literature

Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert

Stevens

The University of Manchester, UK h,p://www.cs.man.ac.uk/~duckg/ h,p://bionerds.sourceforge.net/networks/

Introduc'on

•  Methods are fundamental to science –  Judgement – Replica'on – Extension

•  Methods in bioinforma'cs: –  In silico: Data and tools – Workflows

•  Objec've representa'on •  Sharing and reuse

2

Bioinforma'cs

•  Resource focused domain: “Resourceome” – Our research suggests:

•  Around 200,000 unique resources in the literature •  Over 4 million men'ons •  … and s'll growing!

•  Resource/method search and selec'on… – Best-‐prac'ce – Common-‐prac'ce

•  What are the main pa,erns in bioinforma'cs resources, and associated methods? 3

Approach

•  Use bioinforma'cs literature (to answer this ques'on)

•  Extract database and so3ware men'ons •  Combine resources to form pairs •  Combine pairs to forms pa,erns – Common-‐prac'ce – Method?

4

PHYLIPClustalW

ModellerBLAST PROCHECK

Document Collec'on

•  PubMed Central open-‐access full-‐text ar'cles •  Bioinforma2cs[MeSH] •  22,376 ar'cles •  67 journals •  3 journals were > 50% of total documents

5

!"

#!!"

$%!!!"

$%#!!"

&%!!!"

&%#!!"

'%!!!"

'%#!!"

(%!!!"

(%#!!"

$))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$("

!"#

$%&'(

)'*(+"#

%,-.'

/%0&'

bioNerDS

•  bioNerDS –  Bioinforma'cs named en'ty recogniser for databases and so3ware

–  Full-‐text; Men'on level –  Rule-‐based –  F-‐score 63-‐91% –  Previously compared resource usage in: •  Genome Biology •  BMC Bioinforma'cs

•  Networks filter: –  702,937 total men'ons –  167,697 document level men'ons

–  31,053 unique names –  93% single men'on

•  Duck et al. (2013) BMC Bioinforma'cs

6 h,p://bionerds.sourceforge.net/

bioNerDS

Genome Biology •  “Biological” focus

–  GenBank –  Ensembl –  GEO –  GO

BMC Bioinforma6cs •  “Resource” focus

–  R –  PDB –  PubMed

7 h,p://bionerds.sourceforge.net/

Men'on Filtering

•  Filter resources not men'oned within a minimum of 2 documents – Removed 25% of men'ons – Removes less likely names

•  Generic resources – R – Bioconductor

•  Categorise to database/so3ware – Removed some ‘unknown’ resources

8

Methods Sec'ons

•  Removed resources not in the methods sec'on – Method or non-‐method

•  Regular expression based 'tle detec'on – Tested on 100 ar'cles – Precision: 97%; Recall: 79%

•  Resul'ng in: – 69,466 database men'ons (1,711 unique) – 65,451 so3ware men'ons (3,289 unique)

9

Extrac'ng Pairs •  Co-‐occurrence within text •  Two sets of pairs: – So3ware only pairs – Database and so3ware pairs (any combina'on of)

•  This provided us with: – 22,880 so3ware pairs (13,965 unique) – 54,562 database/so3ware pairs (29,066 unique)

•  Removed pairs only within a single document – 53% of the so3ware pairs – 46% of the database/so3ware pairs 10

Common Pairs

•  With sufficient data, the most common order of a pairing is the correct one…

•  Binomial test – each order is equally likely •  Two confidence thresholds: – 95%

•  2,518 so3ware pairs (145 unique) •  7,001 database/so3ware pairs (297 unique)

– 99% •  1,450 so3ware pairs (55 unique) •  3,383 database/so3ware pairs (95 unique) 11

Most Common Pairs

SoAware only pairs Directed Pair Count %

BLAST è ClustalW 205 14.1

BLAST è PSI-‐BLAST 103 7.1

Phred è Phrap 89 6.1

ClustalW è MEGA 77 5.3

Cluster è Tree View 75 5.2

Phrap è Consed 51 3.5

Modeller è PROCHECK 44 3.0

BLAST è ClustalX 43 3.0

ClustalW è PHYLIP 41 2.8

BLAST è MUSCLE 40 2.8

SoAware and database pairs Direct Pair Count %

GO è KEGG 350 10.3

BLAST è GO 195 5.8

BLAST è ClustalW 150 4.4

GEO è GO 129 3.8

Phred è Phrap 89 2.6

BLAST è PSI-‐BLAST 87 2.6

PDB è Modeller 85 2.5

Swiss-‐Prot è TrEMBL 82 2.4

Ensembl è BioMart 82 2.4

ClustalW è MEGA 77 2.3

12

Resource Pa,erns

Databases •  Data sources

•  GO is an excep'on

–  Major sink –  Data Annota'on

•  Numerous ‘same’ links –  Enumera'on in text?

SoAware •  Data sinks

•  Represents the primary in silico pipeline(s)

•  Again, sequence alignment is central

19

Pa,erns through Time

2004 to 2006

2007 to 2009

20

Pa,erns through Time

21

2010 to 2012

Phylogene'cs Pa,erns

•  Case-‐study… •  Eales et al. (2008) BMC Bioinforma2cs, 9, 359 – Mapped phylogene'cs methods into 4 steps:

•  Sequence Alignment •  Tree Inference •  Sta's'cal Tes'ng •  Tree Visualisa'on

– Using the same corpus selec'on, we built a network… •  PubMed search for “phylogen*” in 'tles or abstracts

22


23


•  Our automated extrac'on can recreate these steps – Given some ambiguous resources

•  Encouraging… – Viable in silico pa,ern extrac'on – “Common prac'ce”

•  Next step: Apply this to other (sub-‐)domains

24

Conclusion

•  Can extract pa,erns of resource usage – Can we describe the method through these?

•  High level overview of common-‐prac'ce – With lower thresholds, can access resources specific (but “common”) to different subdomains

– Not best-‐prac'ce… •  Workflows? – Requires increased granularity – Could help inform their crea'on

25

Thank-‐you •  Acknowledgements –  Co-‐authors

– Manchester IT Services •  Computa'onal facili'es

–  Funding:

–  Travel:

26

eccb 2014: extracting patterns of database and software usage from the bioinformatics literature

Science