working towards multi-omics integration: tools and...
TRANSCRIPT
Working towards multi-omics integration:
Tools and workflows within Galaxy-P Platform
Pratik JagtapGalaxy-P Team
University of Minnesota
December 6, 2019
Minnesota Supercomputing InstituteJames JohnsonThomas McGowanMichael Milligan
Ira Cooke and Maria DoyleMelbourne , Australia
University of Minnesota
Timothy Griffin PIPraveen KumarCandace GuerreroSubina MehtaAdrian Hegeman (Co-I)Art EschenlauerRay SajulgaCaleb EasterlyAndrew Rajczewski
Biologists / collaboratorsLaurie ParkerJoel RudneyManeesh BhargavaAmy SkubitzChris WendtBrian CrookerSteven FriedenbergKevin VikenKristin BoylanMarnie PetersonSomiah AfiuniBrian SandriAlexa PragmanWanda WeberAmy Treeful
Harald Barsnes Marc Vaudel University of Bergen, Norway
University of Freiburg,Freiburg, Germany
VIB, UGhent, Belgium
Judson HerveyNaval Research InstituteWashington, D.C.
Matt ChambersNashville, TN
Alessandro TancaPorto Conte Ricerche, Italy
Carolin KolmederUniversity of Helsinki, Finland
Thilo MuthBernhard RenardRobert Koch Institut
Thomas DoakJeremy Fisher Haixu Tang Sujun LiIndiana University
Josh EliasStanford University
Brook NunnU of Washington
Lennart Martens (Co-I)Bart MesuereRobbert G Singh
Bjoern GrueningBérénice Batut
Lloyd Smith (Co-I)Michael ShortreedUW-Madison
Anamika KrishanpalPriyabrata PanigrahiPersistent Systems Limited
Stephan KangIntero Life Sciences
FundingAcknowledgements
Magnus Øverlie ArntzenFrancesco DeloguNMBU,Oslo, Norway
galaxyp.org
Proteogenomics: A primer
+TOF MS: 24 MCA scans from Myo_tryptic.wiff Max. 5191.0 counts.
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000m/z, amu
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5191
Inte
nsi
ty,
cou
nts
1360.7892
1606.8892
1938.0629
1815.9397
1378.8696
2316.30921506.9692
1886.06721271.6925 1661.8925
1001.4584 1983.10711589.8688
1343.7703 1798.92161071.6147 2298.26431959.0339 2505.3460 2602.5045
MS1
MS2
Matching amino acid sequences to MS/MS data
Detecting protein variants via proteogenomics
Comprehensive
Database
(Sample-specific, all
possible sequences)
UCGAUCAGGGCAAUTCGATCAGGGCAATAGCTAGTCCCGTTA
RNA sequences (e.g. RNA-seq)
(3-frame translation)
DNA sequences
(6-frame translation)
In-silico translation
Proteogenomic outcomes
Confirms translation of variants
Direct evidence of potential functional variants
Applications in neoantigendiscovery (immuno-oncology)
VOL.11 |NO.11 | NOVEMBER 2014 | NATURE METHODS
Bringing proteogenomics to the masses: informatics challenges
J. Proteome Res., 2014, 13, pp 5898–5908
• Many software tools, integration, automation….
• RNA-Seq assembly and analysis• Customized protein dB generation• Matching sequences to MS/MS data• Filtering and QC!• Interpretation! Beyond a list....
PROTEOGENOMICS & ITS CHALLENGES
Ruggles et al. Mol Cell Proteomics 2017;16:959-981
© 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
Challenges• Large search database sizes• False-positive sources and their elimination. • Validation of novel peptide identification. • PSM Quality Evaluation • Targeted proteomics of identified peptides. • Genomic localization.
• Disparate tools and numerous processing steps.
Galaxy Platform
• A web-based bioinformatics data analysis platform.• Software accessibility and usability. • Share-ability of tools, workflows and histories. • Reproducibility and ability to test and compare results after using multiple parameters.• Ability to assimilate disparate software into integrated workflows.
https://galaxyproject.org/
Solution: Galaxy Platform
For example, Protein Database Downloader
downloads UniProt protein FASTA databases
of various organisms.
Software tools can be used in a sequential manner to generate analytical workflows that can be reused, shared and creatively modified.
Workflow #1
RNA-Seq to Variant
FASTA database
Proteogenomics Workflows In GalaxyRNASeq Data
GTF File
HISAT Alignment tool
FreeBayesVariant calling
CustomProDBVariant annotation & Genome mapping
StringTIERNA-Seq to Transcripts
GFF Compare Compares assembly with
annotated transcripts
Genome Mapping Files
PROTEIN SEQUENCE FASTA
10th Annual Meeting of Proteomics Society, India, 2018
UniProt FASTA
RNASeq Data
GTF FileGTF File
Proteogenomics Workflows In Galaxy
HISAT Alignment tool
FreeBayesVariant calling
CustomProDBVariant annotation & Genome mapping
StringTIERNA-Seq to Transcripts
GFF Compare Compares assembly with
annotated transcripts
PROTEIN SEQUENCE FASTA
Workflow #2
Database Searching
Using MS/MS data
RAW Files
SearchGUI and PeptideShaker
Peptides for BLAST Search
PSM Report
mz to SQLite
10th Annual Meeting of Proteomics Society, India, 2018
Proteogenomics Workflows In Galaxy
HISAT Alignment tool
FreeBayesVariant calling
CustomProDBVariant annotation & Genome mapping
StringTIERNA-Seq to Transcripts
GFF Compare Compares assembly with
annotated transcripts
SearchGUI and PeptideShaker
Peptides for BLAST Search
PSM Report
mz to SQLite
Workflow #3
Identifying Novel Variants
And Visualization
Summary of peptides
10th Annual Meeting of Proteomics Society, India, 2018
PROTEOGENOMICS WORKFLOW
Proteo-transcriptomics workflows within Galaxy are used to determine protein expression and detect variant proteins expressed.
Transcriptomics workflows within are used to generate
customized protein databases; estimate gene expression &
detect variant genes expressed.
Quantitative proteotranscriptomics
Kumar P, Panigrahi P, Johnson J, Weber WJ, Mehta S, Sajulga R, Easterly C, Crooker BA, HeydarianM, Anamika K, Griffin TJ, Jagtap P. J Proteome Res. 2019 18:782-790.
Praveen Kumar(Krishanpal Anamika/Priyabrata Panigrahi)
QuanTP: interactive visualization of RNA-protein response
Distribution
Transcriptome Data Proteome Data
QuanTP: interactive visualization of RNA-protein response
Differential Expression
18
Transcriptome Data Proteome Data
QuanTP: interactive visualization of RNA-protein response
Principal component analysis
Transcriptome Data Proteome Data
20
QuanTP: interactive visualization of RNA-protein response
Cluster Analysis
Correlation of RNASeq and proteomics data
21
QuanTP: interactive visualization of RNA-protein response
Correlation
Cook’s Distance Analysis
22
QuanTP: interactive visualization of RNA-protein response
Influential Points
Correlation of RNASeq and proteomics data
Multi-Omics Visualization Platform:
Characterizing the nature of detected variants
• HTML-based Galaxy plugin• Interactive reading of mzsqlite dB
https://www.biorxiv.org/content/10.1101/842856v2.abstract
Tom McGowan
MULTI-OMICS VISUALIZATION PLATFORM FOR
VISUALIZING NOVEL PROTEOFORMS
SPECTRAL QUALITY VISUALIZATION (Lorikeet Viewer)
GENOMIC LOCALIZATION (Integrated Genomics Viewer)
https://www.biorxiv.org/content/10.1101/842856v2.abstract
CRAVAT-P: Assessing potential impact of variants
Sajulga R, Mehta S, Kumar P, Johnson JE, Guerrero CR, Ryan MC, Karchin R, Jagtap PD, Griffin TJ. J Proteome Res. 2018 ,17:4329-4336
Cancer-Related Analysis of Variants Toolkit (cravat.us) developed by Rachel Karchin and Michael Ryan
Assessing potential impact of protein-level variants: CRAVAT-P
• Intersection of transcript variants and confirmed protein variants
Ray Sajulga
Unleashing the power of CRAVAT on proteogenomic results
Sajulga R, Mehta S, Kumar P, Johnson JE, Guerrero CR, Ryan MC, Karchin R, Jagtap PD, Griffin TJ. J Proteome Res. 2018 ,17:4329-4336
ndexbio.org
https://jraysajulga.github.io/cravatp-galaxy-docker/
• HTML-based Galaxy plugin
• Interactive viewer
COMING SOON
• PepQuery Tool uses a peptide-centric approach for validation by a) competitive filtering; b) statistical evaluation; c) unrestricted modification search and d) visualization of peptides corresponding to novel proteoforms.
Wen et al Genome Res. (2019); 29(3): 485–493. doi: 10.1101/gr.235028.118
• Extend MVP, QuanTP and CRAVAT-P tools
• Integrate newer tools from our collaborators to extend the existing workflows.
Accessing the Multi-omic Workflows
PUBLIC INSTANCES
Proteogenomics Gateway: z.umn.edu/proteogenomicsgateway
Step-by-step instructions: z.umn.edu/pginnov18
Metaproteomics Gateway: z.umn.edu/metaproteomicsgateway
Step-by-step instructions: z.umn.edu/suppS1
Tools and Workflows also available on : https://proteomics.usegalaxy.eu/
ALSO AVAILABLE ON:
GitHub: https://github.com/galaxyproteomics
Galaxy Toolshed: https://toolshed.g2.bx.psu.edu/
Docker: https://jraysajulga.github.io/cravatp-galaxy-docker/
Training Workflows also available on : https://training.galaxyproject.org
Accessing the Multi-omic Workflows
Conclusions
• Proteogenomics workflows that generate quantitative peptide and protein-level values are available within Galaxy platform.
• Post-search analysis tools such as QuanTP, MVP and CRAVAT-P help understand the biological context of the data. We plan to extend these tools.
• There is a need to integrate statistical tools and methods to offer a much more comprehensive perspective of proteogenomics data.
We can be Reached at :
Published Manuscripts: z.umn.edu/galaxypreferences
Galaxy-P Presentations: http://galaxyp.org/conference-presentations
Contact: http://galaxyp.org/contact/
Twitter: twitter.com/usegalaxyp
galaxyp.org
Acknowledgements
Funding
galaxyp.org/contact
Follow us on: twitter.com/usegalaxyp
The Galaxy-P Team at University of Minnesota