presentation of the crg bioinformatics core facility jean-françois taly
TRANSCRIPT
Presentation of the
CRG Bioinformatics Core facility
Jean-François Taly
People in the BioCore
Jean-Francois Luca Toni•@CRG 2009•@BioCore 2012•Acting head•Structur. bioinfo.•MSA•NGS analyst•Galaxy server•Training
•@BioCore 2010•NGS analyst•Small ncRNA prediction•Motif analysis•Training
•@Biocore 2009•Wikis•Web/DB dev.•DB Mirrors•Struct. bioinfo.•Training
•@Biocore 2014•Micro-arrays•NGS analyst•Galaxy•Training
Sarah
Our mission
• Expertise in bioinformatics• Service• Consultation
• Trainings • Internal and external
• Support in infrastructures• In collaboration with the SIT and TIC
• Part of the CRG bioinformaticians network• 83 @ bioinformatics retreat• Many more in PRBB/CNAG
Our services
Analysis Microarray Chip-seq RNA-seq DE and assembly Genome assembly Variant calling
Informatics support Wiki WEB Server API
Trainings Galaxy, Perl, Linux, advanced bioinformatics
Fee per service
Item PRBB fees Public fees without VAT
Manual data analysis 13.12 €/hour 39.36 €/hour
Automated data analysis (CPU time)
2.38 €/hour 7.16 €/hour
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output interpretation
Project conclusions
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output interpretation
Project conclusions
Apply a definedprocedures
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output interpretation
Project conclusions
CustomizedAnalysis
CRG bioinformatics community
Big Data WG• EGA initiative• Data Engineering• NoSQL• HPC
NGS Tech. Sem.• RNA-seq• G. assembly• Variant Annot.• Metagenomics
Other topics• Integrated -omics• Good practice in
code dev.• Galaxy dev.• …
source: Creative Commons, Wikipedia
Gene expression array data analysis:• Background correction and normalization• Differential expression analysis• Gene Ontology and pathway analysis• Various graphics / plots
Additional array-based technologies the Bioinformatics unit supports include:• qPCR arrays• Comparative Genomics Hybridization arrays
Main tools are based on the R / Bioconductor environment
Micro-arrays
RNA-seq
RNA-seq
DNA-seq
DNA-seq
Pevzner P A et al. PNAS 2001;98:9748-9753
Chip-seq
Chip-seq
Growing to the next level
From gene DE to transcripts DE Users have now access to longer reads and deeper coverage
Metagenomics 16S Ribosomal amplicon sequencing with MiSeq
Data integration framework Combining different data types into one single analysis
RNAseq DE Histone marks Metabolomics data Proteomics
Data analysis workflow on Galaxy Leave the basic processing to users and focus on advanced analysis
Databases mirroring
Biological file sources ENSEMBL UCSC NCBI Blast DBs UniProt PDB Igenomes (Illumina, only Human but the rest is upcoming)
All Indexed and formated for NCBI BLAST+ (makeblastdb for proteins and nucleic acids) Bowtie & Bowtie2 BWA Fastaindex (Exonerate) GEM faTo2bit
Where are they stored?
In CRG common storage: /db
More information: http://biocore.crg.cat/wiki/Category:Mirrors
IMPORTANT: /db/seq (former /seq) IS DEPRECATED
WEB and Database services
Applications Data and project management Platforms for big data analysis and complex information
querying Promotion and publication of scientific results
WEB and Database services Example
Superfly for Yogi Jaëger Visual catalogue of gene embryo development of different fly
species.
WEB and Database services Example
PRGDB with Walter Sanseverino Wiki-based Database of plant resistance genes.
Activity per category in 2014
Presentation of the Galaxy platform
Jean-François Taly Bioinformatics Core Facility
CRG (Barcelona, Catalonia, Spain)September 18th 2014
EMBO Global Exchange CoursePasteur Institute of Tunis, Tunisia
Biologists: Linux-free data analysis with a graphical
interface
Bioinformaticians: Insure reproducibility when sharing analysis
and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are
not familiar of
Software Developers: Diffuse their tools on a standardized platform
Why Should I Use Galaxy?
The Galaxy Team
Galaxy is developed by :• The Nekrutenko lab in the center for
Comparative Genomics and Bioinformatics at Penn State University
• The Taylor lab at Johns Hopkins University• The community
https://wiki.galaxyproject.org/GalaxyTeam
Rationale behind GalaxyFrom Goeks et al. Genome Biol. 2010.
“Computation has become an essential tool in life science research. This is exemplified in genomics, where first microarrays and now massively parallel DNA sequencing have enabled a variety of genome-wide functional assays, such as ChIP-seq and RNA-seq (and many others), that require increasingly complex analysis tools. However, sudden reliance on computation has created an 'informatics crisis' for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communicated well and hence reproducible is challenging. Galaxy helps to address this crisis by providing an open, web-based platform for performing accessible, reproducible, and transparent genomic science. “
Biologists: Linux-free data analysis with a graphical
interface
Bioinformaticians: Insure reproducibility when sharing analysis
and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are
not familiar of
Software Developers: Diffuse their tools on a standardized platform
Why Should I Use Galaxy?
Makes bioinformatics accessible
From a command line …
… to a graphical interface
One step
Multi-step protocol1
2
3
4
5
Workflow
Galaxy Tutorials https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
https://wiki.galaxyproject.org/Learn
NGS in a laptop• MinION brings NGS to your laptop
• http://youtu.be/UtXlr19xTh8
Biologists: Linux-free data analysis with a graphical
interface
Bioinformaticians: Insure reproducibility when sharing analysis
and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are
not familiar of
Software Developers: Diffuse their tools on a standardized platform
Why Should I Use Galaxy?
Reproducibility
Bioinformaticians suffer that too!• Results can change in function of
• Libraries and software versions• Genome annotations
• Results published without the code
Want to share your findings with everybody?
• Froze an environment in a Virtual Machine• Use an application controller (Docker) • Prepare a Galaxy workflow
Improve the visibility of a paper
“A Galaxy workflow and the corresponding wrappers are available to download at https://mylab.com. A virtual machine containing a pre-set up server can be download at the same address “
Why not having as well?
Galaxy Workflows
Biologists: Linux-free data analysis with a graphical
interface
Bioinformaticians: Insure reproducibility when sharing analysis
and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are
not familiar of
Software Developers: Diffuse their tools on a standardized platform
Why Should I Use Galaxy?
Wrapping software
Software
The wrapper prepare the command line
XML file
Simple wrapper example
venn_diagram.sh Wrapper can launch scripts
TopHat wrapper (1) XML file describing tophat parameters
TopHat wrapper (2) XML file describing tophat parameters
Community Tools/Wrappers
Galaxy Public servers Good points
Free No IT tasks Comes with reference genomes and
workflows
Bad points Offer Limited Resources (Disk/CPUs) Data transfer may be long Give access to the tools they want Data security may not be respected
Should I install Galaxy?
Galaxy Public Servers https://wiki.galaxyproject.org/PublicGalaxyServers
Galaxy Local Server Good points
Total control on data and tools Your own disk and CPU limitation Some companies sell a ready-to-use
infrastructure Tool shed helps to install wrappers and
software
Bad points Cost of installation and maintenance Need IT supports if you need a multi-users
advanced set up
Should I install Galaxy?
Get Galaxy https://wiki.galaxyproject.org/Admin/GetGalaxy
Can be installed only in Linux or Mac
NFS:/software
HPC
User
/scratch
Sequences Indexes
Files, Back-up, tmp
FTP
NFS
NFS:/db
Galaxy server
Tools
DATA Software
30 days max.
Files > 2Gb
Database engine Galaxy team recommend postgreSQL but can it be
MySQL Store users details and data information
Tools = wrappers File describing all possible parameters of a software Script preparing the correct command line
Apache server
Shared file system NFS (2Pb)
10 €/Tb/Group/Month Access to the shared biological resources
Ensembl, UCSC Genomes and indexes Uniprot, pfam, smart, PDB
Access to the shared software repository
High Performance Computing 7 cores 8 CPUS each (56 tot) 47 Gb memory
FTP server Proftpd for the server side I recommend Filezila for the client (multiplatform)
Upload from Galaxy Files are moved to the shared file system
Galaxy is an open, web-based platform for computational biomedical research.
Accessible: Users without programming experience can run tools and workflows
Reproducible: Galaxy captures analysis details Transparent: Users can share and publish
analyses
WIKI: https://wiki.galaxyproject.org/FrontPage
Summary