analysis of high-throughput sequencing data using galaxy platform€¦ · why galaxy? 4 •simple...

23
Analysis of high-throughput sequencing data using Galaxy platform Centre for Digital Scholarship, the UQ library; May 9, 2018 Igor Makunin UQ RCC

Upload: others

Post on 14-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Analysisofhigh-throughputsequencingdatausingGalaxyplatform

CentreforDigitalScholarship,theUQlibrary;May9,2018

IgorMakuninUQRCC

Page 2: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

High-throughputsequencing,orNGS

Bigscalesequencing• 100,000,000ssequences,orreads,perexperiment• sequencingofa(random)library• lowcostpernucleotide

Populartechnologies:• illumina• ion/proton• PacBio

Emergingtechnologies• OxfordNanopore MinION

AnalysisofNGSdataBigdatasetsComputationallyintensiveDedicatedtoolsanddatatypesExtensiveuseofpublicdata

2

Storage

Computationalresources

Publicdata

Knowledgeandskills

Tools

Page 3: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Galaxy:howdoesitlooklike

Tools

Workingwindow

HistoryTopmenu

UploadHistorymenu

Galaxyisaweb-basedplatformforanalysisofgenome-scaledata

Page 4: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

WhyGalaxy?

4

• Simpleintuitiveplatform• Publicserverswithpre-installedtoolsandstorage• Built-inpublicdata,eg alignerindices• Directimportfrompublicrepositories• 1000stoolsareavailable• Datavisualisation options• Datasharing• Bigcommunity• Easyregistration

Page 5: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Galaxyisaworkflowengine

5

Selecttoolorinputdataset

Addname,comments

ToolboxNoodle

Input

AGalaxyworkflowisaseriesoftoolsanddatasetactionsthatruninsequenceasabatchoperation

Emailnotification

Page 6: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Galaxytoolshed

6

NewtoolscanbeinstalledbyGalaxyadminsfromGalaxytoolsheds.

Themaintoolshed: toolshed.g2.bx.psu.eduTesttoolshed:testtoolshed.g2.bx.psu.edu

Page 7: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

PublicGalaxyservers

7

Advantageoftheregistration:• accesstohistoriesoverlongtime• multiplehistories• abilitytouseGalaxyfromdifferentdevices• biggerquotas(onsomeservers)• ftp

• IndependentregistrationoneveryGalaxyserver

• Differenttools,differentuserpolicy

• DatacanbemovedbetweenGalaxyservers

Galaxyservers:usegalaxy.orgusegalaxy.eu

galaxy-tut.genome.edu.au

galaxy-qld.genome.edu.au(GalaxyAustralia)

Page 8: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

NeCTARcomputercloudandstorage

GVLimage

TheGVLprojectwasstartedin2012AnalysisofnextGen sequencingdataisabottleneck(infrastructure,skills)GenomicsVirtualLab:taketheIToutofBioinformatics

- DIYbioinformaticsenvironment(advancedusers)- web-basedresources(biologists-friendly)- tutorialsandtrainingmaterials:gvl.org.au

GVLadvantages:- publicresource(nochargestousers)- availableimmediatelytoanyone

Afgan etal.GenomicsVirtualLaboratory:apracticalbioinformaticsworkbenchforthecloud.PLoS One.2015Oct26;10(10):e0140829.doi:10.1371/journal.pone.0140829

GenomicsVirtualLaboratory

8

Page 9: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

GVLactivitiesinBrisbane

9

galaxy-qld.genome.edu.au

GVLRStudio server

1000GenomesProjectMirror

1000genomes.genome.edu.au

FTPandSFTPAccessBrowseRepositories

RDSBeacontutorialserver

Sharehumandatawithoutsensitiveinformation

Blog

genomicsvirtuallab.wordpress.com

FAQs,‘howto’info+news

Announcementsandnews@GVL_QLD

Training,usersupport

Sponsors:

Services DataUserengagement

gvl-rstudio.genome.edu.au rdsbeacon.genome.edu.au

Page 10: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Galaxy Australia

10Lessjobsonweekends

Jobsperday

Masternode16CPUs,64GBRAM

Workernodes:16CPUs,64GBRAM

49TbVolumestorage(userdata)1TbVolumestorageforindices

galaxy-qld.genome.edu.au

Page 11: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Tools

GVLGalaxyinQueensland:galaxy-qld.genome.edu.au/galaxyTools:- BWA,bowtie2- Velvet,SPAdes- Trinity- tophat2,RNA_STAR,HiSAT2- DESeq,edgeR,Cufflinks,StringTie- GATK2,variantdetectiontools- Metagenomics tools- MACS2,SPP- SAMtools- Picard- deepTools

Topics:üRNA-SeqüChIP-SeqüVariantdetectionüGenomeassemblyüTranscriptomeüMetagenomics

Page 12: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

• Registeredusers:100Gb• Australianusers:600Gb• UQusers:1Tb

Ø NoexternalbackupforuserdataØ DownloadresultsassoonasconvenientØ Deleteandpurgeunneededdatasetsandtemporaryfiles

Wedonotendorse:• alongtermdatastorageontheserver• multipleregistrations

Userdataandquotas

Page 13: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

PublicdataonGalaxy

FilesimportedfromDataLibrariesarenotcountedtowardsuserquota.Weaddpublicdataondemand fromusers.

Genomeindicesandassemblies

SnpEff databases Datalibraries

BLASTdatabase

Page 14: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

SupportforGalaxy-qld

IgorMakuninUQRCC

Usersupport,training

DerekBensonUQRCC&IMB

Systemadministrator

GVL-Qld announcements:twitter.com/GVL_QLDGVL-Qld blog:genomicsvirtuallab.wordpress.com

GVLFAQpageatgvl.org.au/faqgenomicsvirtuallab.wordpress.com/getting-started 14

Page 15: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

FASTQformat

15

@SRR3145.19ILLUMINA-C32_FC:3:1:80:12/1TAGCAGCACATCATGGTTTACATCGTATGC+IIHIDIIIIIIIIIIIIIHIHIIIIIDGIB

Namealwaysstartswith@SequenceAlwaysstartswith+;mayhavenameEncodedPhred qualityscore

single-endreads paired-endreads

Terminology: read isasequencewithqualityscorevaluesproducedbyasequencingmachine

Commonoutputformat:FASTQ compressedwithgzip,e.g.SRR3145_1.fq.gz

MultiplereadsinasingleFASTQfileEachreadisdescribedbyfourlines

Page 16: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

FASTQPhred qualityscore

16

Quality+Offset

39+33=72

ASCII(72):H

Range:~0to~40

Phred 10:accuracy90%Phred 20:accuracy99%Phred 30:accuracy99.9%Phred 40:accuracy99.99%

Valuesareencodedbycharacters

Advantage:asinglecharacterisusedinsteadofatwo-digitnumber

APhredqualityscoreisameasureofthequalityoftheidentificationforeverynucleotide.

@S391ILLUMINA_FC:3:80:12/1TAGCAGCACATCATGGTTTAC+IIHIDIIIIIIIIIIIIIHIH

Page 17: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

ASCIItable

17

Page 18: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Phred qualityscoreencoding

18

Offset33- SangerOffset64- oldillumina

Source:https://en.wikipedia.org/wiki/FASTQ_format

Page 19: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

FASTQqualityscoreinGalaxy

19

Manyoldillumina datasetshaveaproprietarydataencoding(offset64)CurrentlymostNGSdatasetsusetheSangerencoding(offset33)

GalaxyBydefaultGalaxyassign‘fastq’datatypetouploadedFASTQfiles.Inthiscasetheoffsetisnotspecified,andmanytoolsdonotrecognizethedata

fastqillumina – oldillumina qualityscoreencoding(offset64,illumina 1.3+)fastqsanger – newillumina 1.8+/SangerqualityscoreencodingSometoolsinGalaxynowworkonlywithfastqsanger datatype

Solution:- specifyfastqsanger orfastqillumina datatype duringupload- changetheformatviaAttributes>Datatype- useNGS:QCandmanipulation>FASTQGroomertool

Page 20: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Acknowledgmentsandusefullinks

GenomicsVirtualLab:gvl.org.auGalaxyfortutorials:galaxy-tut.genome.edu.auGalaxyAustralia:galaxy-qld.genome.edu.au

Contributorsandparticipants:

20

Page 21: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Galaxydemo:RNA-Seq analysis

21

ImportfromadatalibraryMappingRNA-Seq readstoareferencegenomeusingtophat2alignerAlignmentvisualisation withIntegrativeGenomicsViewerIdentificationofdifferentiallyexpressedgenesusingCuffdiffDatafilteringGalaxyworkflow

Page 22: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

Differentialgeneexpressionanalysis

22

NextGen sequencingdatacanbeusedforanalysisofgeneexpressiononagenomescale.

Assumption: numberofreadsmappedtoagenecorrelateswiththetranscriptabundance.

GeneexpressionRed 4Brown 2Green 1

readcount&

stats

Reference-basedanalysisLibrary

single-endreads

mRNA

Page 23: Analysis of high-throughput sequencing data using Galaxy platform€¦ · Why Galaxy? 4 •Simple intuitive platform •Public servers with pre-installed tools and storage •Built-in

RNA-Seq withtheCufflinkspackage

23

Visualisealignments

Filter