free software and bioinformatics

Free software and biomedical research

Alberto Labarga

[email protected]

When Craig Venterwas asked, “Whatmakes you think youcan do a better jobwith life and geneticsthan God?”, he answered…

we have computers

and software too

¡free software!

biology is a data intensive science

Scientificinformationavailable in 2010 will double every72 hours

data mining

my data is mine!

and your data is mine, too!

open sourceopen dataopen access

open science

Comparative genomicsSequence (DNA/RNA)

& phylogeny

Regulation of gene expression; transcription factors & micro RNAs

Protein sequence analysis &evolution

Protein families, motifs and domains

Protein structure & function: computational crystallography

Protein interactions & complexes: modelling and prediction

Chemical biology

Pathway analysis

Systems modelling

Image analysis

Data integration & literature mining

The first Atlas of Protein Sequence and Structure, presented information about 65 proteins.

In 1981 the EMBL Nucleotide Sequence Data Library is created. Version 2 was composed of 811 secuences, around 1 millionbases introduced by hand.

Smith TF, Waterman MS (1981). "Identification of common molecular subsequences.". J Mol Biol. 147 (1): 195‐7.

S.F. Altschul, et al. (1990), "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403‐10, 1990. 15,306 citations

J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment. Nuc. Acids. Res. 22, 4673 ‐ 4680

In 1995 the European bioinformatics institute is created.

EMBOSS (The European Molecular Biology Open Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community.

First requirements based on a list of long‐standing problems in existing commercial software (GCG), and the need for public source code

Within EMBOSS you will find around 200 programs (applications).

Current version is 6.0.1

http://emboss.sourceforge.net/

Main Programs in EMBOSS

Retrieve sequences from databaseSequence alignmentNucleic gene finding and translationProtein secondary structure predictionRapid database searching with sequence patternsProtein motif identification, including domain analysisNucleotide sequence pattern analysis, for example to identify CpG islands or repeats.Codon usage analysis for small genomesRapid identification of sequence patterns in large scale sequence setsPresentation tools for publication

open‐bio.org

• The Open Bioinformatics Foundation is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics.

• Its main activities are:– Underwriting and supporting the BOSC conferences

– Organizing and supporting developer‐centric "hackathon" events (Bio*)

2

O’Reilly Books and Conferences

http://www.ensembl.org

30http://www.uniprot.org

GenericModel Organism Database projecthttp://gmod.org

DAS Concept

Reference server

Annotation server BAnnotation server A Annotation server C

Clienthttp://www.biodas.org

DAS Server

• DAS request to retrieve features on a segment:

• http://das.ensembl.org/das/ens_36_omim_genes/features?segment=1:1,1000000

• Result:

Das viewer

http://www.ebi.ac.uk/dasty/

Applied Biosystems ABI 3730XL

Illumina / Solexa Genetic Analyzer

Applied BiosystemsSOLiD

Roche / 454 Genome Sequencer

1 Mb/day 100 Mb/run 3000 Mb/run

Sequencing Fragment assembly problem The Shortest Superstring Problem Velvet (Zerbino, 2008)

Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)

Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods PSI‐ BLAST (Altschul et. al., 1997) (SSAHA, 2001) (MUMmerGPU, 2008)

Genomes

Nucleotides

Proteins

Structures

Other molecules

Interactions

Experiments

Literature

Ontologies

http://www.ebi.ac.uk/Databases/

Curso práctico de base de datos e integración de información biológica

Challenges of Data Integration

• Different types of data (sequence, function, literature etc.)

• Different data formats (FASTA, EMBL, Genbank, tab delimited etc.)

• Different storage formats (ASCII flatfile, XML, RDBMS)

• No standard formats for common fields (citations, descriptions, dates etc.)

• Volume and size of data

BioMart is a simple and robust data integration system for large scale data querying, providing researchers with fast and flexible access to biological databases

http://www.biomart.org/

Web Services

http://www.ebi.ac.uk/Tools/

Challenges when using tools in unison

• Manually transfer data from one application to another

• Understand disparate data formats

• Convert file formats where appropriate

• Manage and understand disparate application environments e.g. web browser, desktop application

dataflow workflow

ws ws ws ws ws

curation

submission

REST: REpresentational State Transfer

http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human

GET, POST

HTML,XML,PNG

RESTful web services

Any web page is a web servicehttp://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=alk1_human&style=html&format=default

Friendly URL and XML documents

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfecth/uniprot/slpi_human/xml

• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human/fasta

Biomart query

<Query virtualSchemaName="central_server_1"><Dataset name="hsapiens_gene_ensembl" >

<Attribute name="ensembl_gene_id"/><Attribute name="ensembl_transcript_id"/><Filter name="chromosome_name" value="1"/><Filter name="band_end" value=”p36.33"/><Filter name="band_start" value=”q44"/>

</Dataset><Dataset name="msd">

<Attribute name="pdb_id"/><Attribute name=”experiment_type"/><Filter name="experiment_type" value=”NMR"/>

</Dataset></Query>

SOAP: Simple Object Access ProtocolfetchData(uniprot,wap_rat,default,xml)

SOAP services

fetchData (db, id, format, style)

entry

wsdbfetch

Perl client

use SOAP::Lite;

my $WSDL='http://www.ebi.ac.uk/Tools/webservices/wsdl/WSDbfetch.wsdl'; my $soap = SOAP::Lite->service($WSDL);

# fetchData dbName:id <format> <style>

my $result = $soap->fetchData(‘uniprot’, ‘default’, ‘raw’); die $soap->call->faultstring if $soap->call->fault;

foreach my $i (@$result) { print "$i\n"; }

EBI web services (analysis tools)

jobid

getResults (jobid)

results available

checkStatus (jobid)

status

run(params, data)

poll (jobid, type)

result file

use SOAP::Lite;

my $WSDL = 'http://www.ebi.ac.uk/Tools/webservices/wsdl/WSFasta.wsdl'; my $fasta_client = SOAP::Lite->service($WSDL);

my %params=(); $params{'program'}='fasta3'; $params{'database'}='uniprot';$params{'email'}='[email protected]';$params{‘async'}= 1;

$data={type=>"sequence",content=>"MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECI"};

# $data={type=>"sequence",# content=>“uniprot:slpi_human"};

my $jobid = $fasta_client >runFasta(SOAP::Data->name('params')->type(map=>\%params), SOAP::Data->name( content => [$data]));

print $fasta_client->poll($jobid);

Perl client

Perl client (cont.)# set a loop for checking job submission status # RUNNING, NOT_FOUND, ERROR, DONE

my $status = $fasta_client ->checkStatus($jobid); while (status eq "RUNNING") {sleep 10; $status = $fasta_client->checkStatus($jobid); }

# when job is done, poll for the results

my $result = $fasta_client ->poll($jobid) if ($status eq "DONE") ;

print $result;

http://taverna.sourceforge.net/

http://www.myexperiment.org/users/471

high throughput genomics

data management

https://carmaweb.genome.tugraz.at/

http://base.thep.lu.se/

Why must support standards?

• Unambiguous representation, description and communication– Final results and metadata

• Interoperability – Data management and analysis

• Integration of OMICS system biology

What to standarize?

• CONTENT: Minimal/Core Information to be reported ‐> MIBBI (http://www.mibbi.org)

• SEMANTIC: Terminology Used ‐> Ontologies, OBI (http://obi‐ontology.org)

• SYNTAX: Data Model, Data Exchange ‐>Fuge (http://fuge.sourceforge.net/) ISA‐TAB, MAGE‐TAB, PRIDE

MIBBI: Standard Content

Promoting Coherent Minimum Reporting Requirements for Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotechnology

data analysis

Microarray

RT‐PCR

Biological question

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Prediction

Expression quantification Pre‐processing

Analysis

r‐project.org

• R is an open source implementation of the S Language • Many statistical and machine learning algorithms• Good visualization capabilities• Possible to write scripts that can be reused• Sophisticated package creation and distribution system• Supports many data technologies: XML, DBI, SOAP• Interacts with other languages: C; Perl; Python; Java• R is largely platform independent: Unix; Windows; OSX• R has an active user community

cran.r‐project.org

BioConductor

• Access wide range of powerful statistical and graphical tools• Facilitate the integration of biological metadata in the analysis of

experimental data• Allow the rapid development of extensible, scalable, and

interoperable software; • Promote high‐quality documentation and reproducible research.• Provide training in computational and statistical methods for the

analysis of genomic data.

http://www.bioconductor.org/

Bioconductor Packages/libraries

Two releases each year that follow the biannual releases of R

294 software packages

490 Metadata packages

>700 citations

Release 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 ‐> 294 packages

No. softw

are packages

Bioconductor for Microarray Analysis

• Quickly becoming the accepted approach

• Open source

• Flexible

• (fairly) simple to use ‐ intuitive

• Wide applications – many packages

affy packagePre-processing oligonucleotide chip data:• diagnostic plots, • background correction, • probe-level normalization,• computation of expression measures.

imageplotDensity

plotAffyRNADeg

barplot.ProbeSet

heatmap

mva package

proteomics

http://www.agml.org/

Trans‐Proteomic Pipeline (TPP) is a collection of integrated tools for MS/MS proteomics

http://tools.proteomecenter.orghttp://proteowizard.sourceforge.nethttp://www.thegpm.org/TANDEM

Bioclipse

View

View

Editor

ConsoleProperties

http://www.bioclipse.net/

Work with spectra: Spectrum plugin

Work with sequences: BioJava plugin

CMLRSS plugin: Chemistry on the web

cytoscape

http://www.cytoscape.org

pyMol

http://www.pymol.org

image processing

Open Microscopy Environment

• OME is a multi‐site collaborative effort among academic laboratories and a number of commercial entities that produces open tools to support data management for biological light microscopy.

• The original OME server is an application written in Perl running under Apache. It is accessed using a Web User Interface, via a Java API, or using a plugin for ImageJ.

• The server can support images in a wide range of file formats. This model is also extendable allowing custom data to be stored in the server.

• It supports multiple users and provides appropriate security for private research and collaboration.

http://openmicroscopy.org

beyond software

At $150,000, the Polonator is the cheapestinstrument on the market, says Harvard University's George Church, whose labdeveloped the technology in conjunctionwith Dover Systems, Plus, the tool uses five‐fold less reagents than other platforms, and is the smallest instrument available.

http://www.polonator.org/

http://www.igem.org

http://www.bioparts.org/

where is the stuff

http://bioinformatics.oxfordjournals.org

http://nar.oxfordjournals.org

http://www.biomedcentral.com/bmcbioinformatics/

http://genomebiology.com/software/

the future

Growth of open access scientistsdigital natives, always online, hybrids

catalysts for change

[Phil Bourne]

• Making scientific research “re‐useful”—We help people and organizations open and mark their research and data for reuse.

• Enabling “one‐click” access to research materials—We help streamline the materials‐transfer process so researchers can easily replicate, verify and extend research.

• Integrating fragmented information sources—We help researchers find, analyze and use data from disparate sources by marking and integrating the information with a common, computer‐readable language.

free software and bioinformatics

Technology

anywebpageisawebservice

biological databases

das concept annotation

data intensive science

large scale data querying

robust data integration

flexible access

s egment