free software and bioinformatics
DESCRIPTION
an overview of the free software philosophy as it has been applied on the bioinformatics fieldTRANSCRIPT
When Craig Venterwas asked, “Whatmakes you think youcan do a better jobwith life and geneticsthan God?”, he answered…
we have computers
and software too
¡free software!
biology is a data intensive science
Scientificinformationavailable in 2010 will double every72 hours
data mining
my data is mine!
and your data is mine, too!
open sourceopen dataopen access
open science
Comparative genomicsSequence (DNA/RNA)
& phylogeny
Regulation of gene expression; transcription factors & micro RNAs
Protein sequence analysis &evolution
Protein families, motifs and domains
Protein structure & function: computational crystallography
Protein interactions & complexes: modelling and prediction
Chemical biology
Pathway analysis
Systems modelling
Image analysis
Data integration & literature mining
The first Atlas of Protein Sequence and Structure, presented information about 65 proteins.
In 1981 the EMBL Nucleotide Sequence Data Library is created. Version 2 was composed of 811 secuences, around 1 millionbases introduced by hand.
Smith TF, Waterman MS (1981). "Identification of common molecular subsequences.". J Mol Biol. 147 (1): 195‐7.
S.F. Altschul, et al. (1990), "Basic Local Alignment Search Tool," J. Molec. Biol., 215(3): 403‐10, 1990. 15,306 citations
J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment. Nuc. Acids. Res. 22, 4673 ‐ 4680
In 1995 the European bioinformatics institute is created.
EMBOSS (The European Molecular Biology Open Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community.
First requirements based on a list of long‐standing problems in existing commercial software (GCG), and the need for public source code
Within EMBOSS you will find around 200 programs (applications).
Current version is 6.0.1
http://emboss.sourceforge.net/
Main Programs in EMBOSS
Retrieve sequences from databaseSequence alignmentNucleic gene finding and translationProtein secondary structure predictionRapid database searching with sequence patternsProtein motif identification, including domain analysisNucleotide sequence pattern analysis, for example to identify CpG islands or repeats.Codon usage analysis for small genomesRapid identification of sequence patterns in large scale sequence setsPresentation tools for publication
open‐bio.org
• The Open Bioinformatics Foundation is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics.
• Its main activities are:– Underwriting and supporting the BOSC conferences
– Organizing and supporting developer‐centric "hackathon" events (Bio*)
2
O’Reilly Books and Conferences
http://www.ensembl.org
30http://www.uniprot.org
GenericModel Organism Database projecthttp://gmod.org
DAS Concept
Reference server
Annotation server BAnnotation server A Annotation server C
Clienthttp://www.biodas.org
DAS Server
• DAS request to retrieve features on a segment:
• http://das.ensembl.org/das/ens_36_omim_genes/features?segment=1:1,1000000
• Result:
Das viewer
http://www.ebi.ac.uk/dasty/
Applied Biosystems ABI 3730XL
Illumina / Solexa Genetic Analyzer
Applied BiosystemsSOLiD
Roche / 454 Genome Sequencer
1 Mb/day 100 Mb/run 3000 Mb/run
Sequencing Fragment assembly problem The Shortest Superstring Problem Velvet (Zerbino, 2008)
Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)
Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods PSI‐ BLAST (Altschul et. al., 1997) (SSAHA, 2001) (MUMmerGPU, 2008)
Genomes
Nucleotides
Proteins
Structures
Other molecules
Interactions
Experiments
Literature
Ontologies
http://www.ebi.ac.uk/Databases/
Curso práctico de base de datos e integración de información biológica
Challenges of Data Integration
• Different types of data (sequence, function, literature etc.)
• Different data formats (FASTA, EMBL, Genbank, tab delimited etc.)
• Different storage formats (ASCII flatfile, XML, RDBMS)
• No standard formats for common fields (citations, descriptions, dates etc.)
• Volume and size of data
BioMart is a simple and robust data integration system for large scale data querying, providing researchers with fast and flexible access to biological databases
http://www.biomart.org/
Web Services
http://www.ebi.ac.uk/Tools/
Challenges when using tools in unison
• Manually transfer data from one application to another
• Understand disparate data formats
• Convert file formats where appropriate
• Manage and understand disparate application environments e.g. web browser, desktop application
dataflow workflow
ws ws ws ws ws
curation
submission
REST: REpresentational State Transfer
http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human
GET, POST
HTML,XML,PNG
RESTful web services
Any web page is a web servicehttp://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=alk1_human&style=html&format=default
Friendly URL and XML documents
• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human
• http://www.ebi.ac.uk/Tools/webservices/rest/dbfecth/uniprot/slpi_human/xml
• http://www.ebi.ac.uk/Tools/webservices/rest/dbfetch/uniprot/slpi_human/fasta
Biomart query
<Query virtualSchemaName="central_server_1"><Dataset name="hsapiens_gene_ensembl" >
<Attribute name="ensembl_gene_id"/><Attribute name="ensembl_transcript_id"/><Filter name="chromosome_name" value="1"/><Filter name="band_end" value=”p36.33"/><Filter name="band_start" value=”q44"/>
</Dataset><Dataset name="msd">
<Attribute name="pdb_id"/><Attribute name=”experiment_type"/><Filter name="experiment_type" value=”NMR"/>
</Dataset></Query>
SOAP: Simple Object Access ProtocolfetchData(uniprot,wap_rat,default,xml)
SOAP services
fetchData (db, id, format, style)
entry
wsdbfetch
Perl client
use SOAP::Lite;
my $WSDL='http://www.ebi.ac.uk/Tools/webservices/wsdl/WSDbfetch.wsdl'; my $soap = SOAP::Lite->service($WSDL);
# fetchData dbName:id <format> <style>
my $result = $soap->fetchData(‘uniprot’, ‘default’, ‘raw’); die $soap->call->faultstring if $soap->call->fault;
foreach my $i (@$result) { print "$i\n"; }
EBI web services (analysis tools)
jobid
getResults (jobid)
results available
checkStatus (jobid)
status
run(params, data)
poll (jobid, type)
result file
use SOAP::Lite;
my $WSDL = 'http://www.ebi.ac.uk/Tools/webservices/wsdl/WSFasta.wsdl'; my $fasta_client = SOAP::Lite->service($WSDL);
my %params=(); $params{'program'}='fasta3'; $params{'database'}='uniprot';$params{'email'}='[email protected]';$params{‘async'}= 1;
$data={type=>"sequence",content=>"MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECI"};
# $data={type=>"sequence",# content=>“uniprot:slpi_human"};
my $jobid = $fasta_client >runFasta(SOAP::Data->name('params')->type(map=>\%params), SOAP::Data->name( content => [$data]));
print $fasta_client->poll($jobid);
Perl client
Perl client (cont.)# set a loop for checking job submission status # RUNNING, NOT_FOUND, ERROR, DONE
my $status = $fasta_client ->checkStatus($jobid); while (status eq "RUNNING") {sleep 10; $status = $fasta_client->checkStatus($jobid); }
# when job is done, poll for the results
my $result = $fasta_client ->poll($jobid) if ($status eq "DONE") ;
print $result;
http://taverna.sourceforge.net/
http://www.myexperiment.org/users/471
high throughput genomics
data management
https://carmaweb.genome.tugraz.at/
http://base.thep.lu.se/
Why must support standards?
• Unambiguous representation, description and communication– Final results and metadata
• Interoperability – Data management and analysis
• Integration of OMICS system biology
What to standarize?
• CONTENT: Minimal/Core Information to be reported ‐> MIBBI (http://www.mibbi.org)
• SEMANTIC: Terminology Used ‐> Ontologies, OBI (http://obi‐ontology.org)
• SYNTAX: Data Model, Data Exchange ‐>Fuge (http://fuge.sourceforge.net/) ISA‐TAB, MAGE‐TAB, PRIDE
MIBBI: Standard Content
Promoting Coherent Minimum Reporting Requirements for Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotechnology
data analysis
Microarray
RT‐PCR
Biological question
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Prediction
Expression quantification Pre‐processing
Analysis
r‐project.org
• R is an open source implementation of the S Language • Many statistical and machine learning algorithms• Good visualization capabilities• Possible to write scripts that can be reused• Sophisticated package creation and distribution system• Supports many data technologies: XML, DBI, SOAP• Interacts with other languages: C; Perl; Python; Java• R is largely platform independent: Unix; Windows; OSX• R has an active user community
cran.r‐project.org
BioConductor
• Access wide range of powerful statistical and graphical tools• Facilitate the integration of biological metadata in the analysis of
experimental data• Allow the rapid development of extensible, scalable, and
interoperable software; • Promote high‐quality documentation and reproducible research.• Provide training in computational and statistical methods for the
analysis of genomic data.
http://www.bioconductor.org/
Bioconductor Packages/libraries
Two releases each year that follow the biannual releases of R
294 software packages
490 Metadata packages
>700 citations
Release 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 ‐> 294 packages
No. softw
are packages
Bioconductor for Microarray Analysis
• Quickly becoming the accepted approach
• Open source
• Flexible
• (fairly) simple to use ‐ intuitive
• Wide applications – many packages
affy packagePre-processing oligonucleotide chip data:• diagnostic plots, • background correction, • probe-level normalization,• computation of expression measures.
imageplotDensity
plotAffyRNADeg
barplot.ProbeSet
heatmap
mva package
proteomics
http://www.agml.org/
Trans‐Proteomic Pipeline (TPP) is a collection of integrated tools for MS/MS proteomics
http://tools.proteomecenter.orghttp://proteowizard.sourceforge.nethttp://www.thegpm.org/TANDEM
Bioclipse
View
View
Editor
ConsoleProperties
http://www.bioclipse.net/
Work with spectra: Spectrum plugin
Work with sequences: BioJava plugin
CMLRSS plugin: Chemistry on the web
cytoscape
http://www.cytoscape.org
pyMol
http://www.pymol.org
image processing
Open Microscopy Environment
• OME is a multi‐site collaborative effort among academic laboratories and a number of commercial entities that produces open tools to support data management for biological light microscopy.
• The original OME server is an application written in Perl running under Apache. It is accessed using a Web User Interface, via a Java API, or using a plugin for ImageJ.
• The server can support images in a wide range of file formats. This model is also extendable allowing custom data to be stored in the server.
• It supports multiple users and provides appropriate security for private research and collaboration.
http://openmicroscopy.org
OMERO
OMERO
beyond software
At $150,000, the Polonator is the cheapestinstrument on the market, says Harvard University's George Church, whose labdeveloped the technology in conjunctionwith Dover Systems, Plus, the tool uses five‐fold less reagents than other platforms, and is the smallest instrument available.
http://www.polonator.org/
http://www.igem.org
http://www.bioparts.org/
where is the stuff
http://bioinformatics.oxfordjournals.org
http://nar.oxfordjournals.org
http://www.biomedcentral.com/bmcbioinformatics/
http://genomebiology.com/software/
the future
Growth of open access scientistsdigital natives, always online, hybrids
catalysts for change
[Phil Bourne]
• Making scientific research “re‐useful”—We help people and organizations open and mark their research and data for reuse.
• Enabling “one‐click” access to research materials—We help streamline the materials‐transfer process so researchers can easily replicate, verify and extend research.
• Integrating fragmented information sources—We help researchers find, analyze and use data from disparate sources by marking and integrating the information with a common, computer‐readable language.