proteomics myriad bioinformatics industrial applications in high throughput proteomics...
TRANSCRIPT
proteomicsmyriad
Bioinformatics Industrial Applications in Bioinformatics Industrial Applications in High Throughput ProteomicsHigh Throughput Proteomics
Alan F. James
Director of Software DevelopmentMyriad Proteomics, Inc., Salt Lake City
proteomicsmyriad
What is Proteomics?
• Proteomics refers to the study of the protein Proteomics refers to the study of the protein
constituents and protein activities of a cell, a constituents and protein activities of a cell, a
tissue or an organism.tissue or an organism.
• Proteomics may be seen from several viewpoints:Proteomics may be seen from several viewpoints:
– Protein ExpressionProtein Expression
– Protein Interaction (Interactome)Protein Interaction (Interactome)
– ……
proteomicsmyriad
Challenges in Proteomics
• Proteins are all different- some degrade easily, some are sticky, many require accessory factors
• Proteins are more complex than DNA- there are several protein forms per gene- proteins are post-translationaly modified
• There isn’t really ONE proteome in humans• Proteins change:
• with cell type• during differentiation• during development• in response to stimuli• with cell cycles
• So which Proteome do you study?
proteomicsmyriad
– Expression,Expression,
Abundance, Abundance,
Distribution Distribution
– Structural GenomicsStructural Genomics
– Protein-Protein Interaction AnalysisProtein-Protein Interaction Analysis
• Yeast two-hybrid systemYeast two-hybrid system
• Mass spectrometry ofMass spectrometry of
protein complexesprotein complexes
Normalcell
Cancercell
PDIRP5
OS-9
MPO-XYZ
novel
novel
NCALD
CASP3
Methods of Analyzing Proteomes
proteomicsmyriad
Methods of Analyzing Proteomes by Comprehensive Surveys of Protein-Protein
Interactions
Mass SpectrometryAllows identification of the proteins in a complex of many proteins (2-100) that carry out some cellular function.
Yeast two-hybrid (Y2H)• Measures association between two proteins.• Allows very high throughput.
proteomicsmyriad
Y2H Background Information:Gene Activity in Yeast
Yeast transcription factor are composed of a DNA Binding Domain and aTranscriptional Activation Domain.
TranscriptionFactor
ActivationDomain
DNABindingDomain
Yeast Gene
ActivationActivationActivationActivation
ActivationActivation
The DNA Binding Domain recruits the Activation Domain to the yeast gene,which allows the yeast gene to be active.
proteomicsmyriad
HumanProteins
1. The DNA Binding Domain is separated from the Transcriptional Activation Domain of a transcription factor.
Yeast Gene
ActivationDomain
DNABindingDomain
ActivationDomain
DNABindingDomain
HumanProteinsHuman
Proteins
ActivationDomain
HumanProtein
X
HumanProtein
Y
DNABindingDomain
HumanProtein
X DNABindingDomain
ActivationDomain
HumanProtein
Y
1.
1.
2.
2.
2.
2.
3.
3.
Principles of the Yeast Two-Hybrid System
2. Libraries of human proteins are fused to both domains to create “hybrid” proteins.
3. The recruitment of the Activation Domain to the yeast gene is now mediated by interactions of the human proteins.
ActivationActivationActivationActivation
ActivationActivation
HumanProteins
HumanProteins
proteomicsmyriad
Yeast Two-Hybrid Screens: Assay for Interactions
Reporter Gene
DNABindingDomain
HumanProtein
X
Bait
HumanProtein
Z
ActivationDomain
Prey
Reporter Gene
DNABindingDomain
Bait
ActivationDomain
Prey
( No Reporter Gene Activity )
Scenario A: Human Proteins X and Y do not Interact
Scenario B: Human Proteins X and Z do Interact
Readout:No growth of yeast colonies
Readout:Yeast colonies grow
HumanProtein
Y
HumanProtein
X
proteomicsmyriad
Directed vs. Random Approach
Directed:Directed:selecting specific proteins as baits for specific proteins as baits for
Y2H analysis.Y2H analysis.
The random approach can be used to rapidly generate large amounts of interaction data.
Random: using individual baits picked at random
from libraries of baits.
proteomicsmyriad
Random Two-Hybrid (R2H) Process Overview
Amplify Human DNAAmplify Human DNA
Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries Produce DNA Binding Domain (BD) and Activation Domain (AD) libraries from cDNA synthesized from mRNA libraries using random primers.from cDNA synthesized from mRNA libraries using random primers.
Library ConstructionLibrary Construction
Pick BD-ColoniesPick BD-Colonies
Mating w/AD-LibraryMating w/AD-Library
Selection PlatingSelection Plating
IncubationIncubation
Pick Growing YeastPick Growing Yeast
DNA SequencingDNA Sequencing
Put yeast colonies containing BD-hybrid proteins into 96-well culture plates
Add yeast containing the AD-hybrid proteins to the 96-well plates with the yeast colonies picked in (2.); allow yeast mating to occur.
Plate yeast matings onto dishes containing selective medium that allows yeast to grow only if the human hybrid proteins interact.
Allow several days for yeast that contain interacting human proteins to grow.
Pick yeast colonies containing interacting human proteins (“Positives”) and put them into 96-well culture plates.
Amplify the human DNA that encodes the interacting proteins by PCR.
Sequence the amplified DNA and identify the interacting proteins.
1.
2.
3.
4.
5.
6.
7.
8.
proteomicsmyriad
Vital tasks that in cells are often performedby Multi-Protein Complexes (MPC)
Mass Spectrometry
proteomicsmyriad
..
..
.
.
.
.
.
Gene Cloning Cell Biology
Protein “Preys”
Mass Spectrometry
Protein Purification
pENTR
pDEST1
pDEST3
pDEST4
pDEST5
pDEST2
Protein “Baits”
Handles(Affinity Tags)
Mass Spectrometry
proteomicsmyriad
..
..
.
.
.
.
.
cDNA Cloning Cell Biology
Protein “Preys”
Mass Spectrometry
Protein Purification
pENTR
pDEST1
pDEST3
pDEST4
pDEST5
pDEST2
Protein “Baits”
Handles
Mass Spectrometry
proteomicsmyriad
Pulldown Assay
Bait Protein
PurificationTag
Complex formation
Associated Proteins
Affinity Beads
Non-binding Proteins
Separate proteins
Identify by Mass
SpectrometryElute
Incubate with cell extract
proteomicsmyriad
..
..
.
.
.
.
.
cDNA Cloning Cell Biology
Protein “Preys”
Mass Spectrometry
Protein Purification
MPC
Mass Spectrometry
pENTR
pDEST1
pDEST3
pDEST4
pDEST5
pDEST2
Protein “Baits”
Handles
proteomicsmyriad
Purified protein complex
Protein separation Protein digestion Mass Spec. analysis
Mass spectrum Database Searching (Peptide Mass Fingerprint Search)
Protein ID
Mass Spectrometry Procedure
proteomicsmyriad
Summary of Protein-Protein Interaction Summary of Protein-Protein Interaction Analysis MethodsAnalysis Methods
A
B
C D
E
F GH I
J
K
L
A
B
G H
IK
CK
Mass Spectrometry:Yields sets of n-ary associations among proteins (that may represent protein complexes).
Random Yeast Two-Hybrid:Yields sets of binary associations between protein fragments (that may represent protein-protein interactions).
proteomicsmyriad
The Goal: Biological RelevanceThe Goal: Biological Relevance
Underlying Pathway Adopted from http://www.kegg.com
fibril formation,deposition
Amyloid Plaque,Neurofibrillary
Tangle Formation
APOPTOSIS
New Protein-Protein InteractionKnown Protein-Protein InteractionTransduction PathwayKnown Pathway MemberIdentified InteractorNovel TranscriptTraditional “Drugable” EnzymeOther Enzymes
proteomicsmyriad
Knowledge
Information
Data
Data Collection, Analysis, and Interpretation
LIMS
Base Calling
Blast/PMF Searches
Identification of Loci/Domains/Proteins
Identification of binary and n-ary interactions
Identification of participation in protein complexes
Identification of protein interaction networks
Identification of participation in diseasepathway
Identification as potential drug target
Data Collection
Automated DataReduction
Automated DataAnalysis
Manual/Experimental Data
Analysis
Biology
Computational Biology
Software Development
Data Warehousing
Mass Peak List Determination
Role of Bioinformatics in ProteomicsRole of Bioinformatics in Proteomics
proteomicsmyriad
• Robot programmingRobot programming• Software engineeringSoftware engineering• Database modeling and designDatabase modeling and design• Data warehouses and Data MartsData warehouses and Data Marts• Database federationDatabase federation• Grid ComputingGrid Computing• Information VisualizationInformation Visualization• Graph analysis, graph layout and displayGraph analysis, graph layout and display• Hidden Markhov ModelsHidden Markhov Models• Bayesian networksBayesian networks• Statistical modelsStatistical models• Signal ProcessingSignal Processing• Algorithm developmentAlgorithm development
• ……
Bioinformatics Techniques Used in ProteomicsBioinformatics Techniques Used in Proteomics
proteomicsmyriad
Objectives of Bioinformatics in ProteomicsObjectives of Bioinformatics in Proteomics
1.1. Automate and manage high-Automate and manage high-throughput laboratory processes.throughput laboratory processes.
2.2. Retrieve, collect, and store Retrieve, collect, and store experimental interaction data.experimental interaction data.
3.3. Analyze, reduce, and extend Analyze, reduce, and extend experimental interaction data.experimental interaction data.
4.4. Mine and visualize interaction analysis Mine and visualize interaction analysis results.results.
proteomicsmyriad
Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes
Laboratory AutomationLaboratory Automation• High-throughputHigh-throughput proteomics is not possible proteomics is not possible
without a high degree of laboratory without a high degree of laboratory automation.automation.
• Instruments and robotics Instruments and robotics
must interact directly andmust interact directly and
reliably with LIMS reliably with LIMS
(Laboratory Information (Laboratory Information
Management System).Management System).
proteomicsmyriad
Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes
Laboratory Management Information System (LIMS)Laboratory Management Information System (LIMS)• High-throughputHigh-throughput proteomics is not possible without a sophisticated proteomics is not possible without a sophisticated
LIMS.LIMS.• The LIMS provides the foundation for all automated data collection, The LIMS provides the foundation for all automated data collection,
reduction, and analysis.reduction, and analysis.• Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene Multiple LIMS systems are required (e.g., Y2H, Sequencing, Gene
Cloning, Protein Pull-down, Mass Spec., etc.Cloning, Protein Pull-down, Mass Spec., etc.• May collect very large amounts of data.May collect very large amounts of data.• Fast runtime performance of the LIMS is essential to deal with the Fast runtime performance of the LIMS is essential to deal with the
high volume of transactions and possible near real-time interactions high volume of transactions and possible near real-time interactions between the LIMS and robotics and instruments.between the LIMS and robotics and instruments.
• High availability of the LIMS and supporting computer systems is High availability of the LIMS and supporting computer systems is required to support production laboratories and time-critical required to support production laboratories and time-critical operations.operations.
• May be one of the most (if not the most) labor intensive May be one of the most (if not the most) labor intensive (programming, database management, and system management) (programming, database management, and system management) and expensive software systems in the enterprise.and expensive software systems in the enterprise.
proteomicsmyriad
Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes
Functions of the Laboratory Management Information System (LIMS)Functions of the Laboratory Management Information System (LIMS)• Track samples consistently through a protocol so that each sample:Track samples consistently through a protocol so that each sample:
– Is identified.Is identified.– Is linked to the appropriate results.Is linked to the appropriate results.– Is linked to the protocol used to process the sample.Is linked to the protocol used to process the sample.– Is linked to any related samples, reagents, etc.Is linked to any related samples, reagents, etc.– Can be located physically.Can be located physically.
• Manage and enforce the protocol used to process a sample. Manage and enforce the protocol used to process a sample. • Capture laboratory quality control information and provide displays, reports, Capture laboratory quality control information and provide displays, reports,
statistical analyses, etc. to allow management and quality control of the statistical analyses, etc. to allow management and quality control of the laboratory.laboratory.
• Provide interfaces for laboratory personnel, robotics, and instruments to Provide interfaces for laboratory personnel, robotics, and instruments to support high-throughput operations.support high-throughput operations.
• Capture results directly from laboratory instruments.Capture results directly from laboratory instruments.• Provide experimental results in a format suitable for analytical programs.Provide experimental results in a format suitable for analytical programs.• Provide the interface between analytical systems and instruments (such as Provide the interface between analytical systems and instruments (such as
Mass Spectrometers) that require real-time (or near real-time) analysis during Mass Spectrometers) that require real-time (or near real-time) analysis during operation.operation.
• Manage laboratory personnel work lists, incident alerting, reporting and Manage laboratory personnel work lists, incident alerting, reporting and correction, etc.correction, etc.
proteomicsmyriad
Automate and Manage Laboratory ProcessesAutomate and Manage Laboratory Processes
LIMS ArchitectureLIMS Architecture
LIMS SERVER(Java Socket Application)
SQL Net
Lab Workstation(Java Application)
Web-basedManagement Client(Servlets, JSP, CGI Script)
LIMSDatabase(s)
Lab Workstation(Java Application)...
Lab Workstation(Java Application)
Robot or Instrument Robot or Instrument Robot or Instrument...
...
LIMS DataWarehouse(s)
(ODS)
AnalysisDatabases
Web Application Server
XM
L
Web-basedManagement Client(Servlets, JSP, CGI Script)
Web-basedManagement Client(Servlets, JSP, CGI Script)
proteomicsmyriad
Collect, Store, and Retrieve Experimental Data
Yeast two-hybrid Data• Electropherograms for sequence forward and reverse reads• Sequences and sequence quality scores from base-calling• Robot/Instrument Operational Parameters• Quality control data
– Distributions of positive colonies within a search– Distributions of sequencing reaction success/failure within a
plate.
Yeast two-hybrid Data Collection Challenges• Transmission of electropherograms from remote sequencing
facility and associated error handling.• Relating/correlating data received from remote sequencing
facility with LIMS data.
• Archival of electropherograms.
• Retrieval of archived electropherograms.
proteomicsmyriad
Collect, Store, and Retrieve Experimental Data
Mass Spectrometry Data• Spectrograms
– Multiple Instruments (MALDI-TOF, Electrospray/Ion Trap, etc.)– Multiple spectrogram types (MS, MS/MS)– Individual samples may be analyzed with multiple instruments, mass
spectrogram types.– False Positive/Contamination Control Sample Spectrograms
• Mass Peak Lists derived from spectrograms• Mass Spectrometry Instrument Operational Parameters
Mass Spectrometry Data Collection Challenges• Individual experiments will generate many spectrograms.• Interfacing with instrument to retrieve spectrograms and mass
peak lists.• Archival of spectrograms and mass peak lists• Retrieval of archived spectrograms and mass peak lists
proteomicsmyriad
Collect, Store, and Retrieve Experimental Data
External Data Sources• NCBI LocusLink, RefSeq, GenBank, …• SwissProt, PFAM, …SwissProt, PFAM, …• Gene Ontology, …Gene Ontology, …• KEGG, …KEGG, …• PubMed, Manually curated papers, …PubMed, Manually curated papers, …
External Data Sources Challenges• Wide variety of data formats.Wide variety of data formats.• Integrating or federating disparate data sources with internal Integrating or federating disparate data sources with internal
data bases.data bases.• Sometimes questionable quality of data.Sometimes questionable quality of data.• Data sources frequently change/evolveData sources frequently change/evolve
– Changes may invalidate previous analysis results.Changes may invalidate previous analysis results.
– May require analysis databases to support May require analysis databases to support versioningversioning of results. of results.
proteomicsmyriad
Analyze, Reduce, and Extend Experimental Data
• The goal of data analysis is to extract or discover biological The goal of data analysis is to extract or discover biological relevance from the raw data.relevance from the raw data.
• Raw data must be “cleaned”, filtered, and transformedRaw data must be “cleaned”, filtered, and transformed– Vector/adaptor identification & clippingVector/adaptor identification & clipping– Sequence assemblySequence assembly– Consensus sequence identificationConsensus sequence identification– Peptide mass fingerprint (PMF) searchingPeptide mass fingerprint (PMF) searching– False positive detection/filtering.False positive detection/filtering.
• Data representations must be modeled and developed.Data representations must be modeled and developed.– How to represent interaction data?How to represent interaction data?
• Sequences? Electropherograms? Mass Peak Lists?Sequences? Electropherograms? Mass Peak Lists?• Interactions? Pathways? Sequence Annotations?Interactions? Pathways? Sequence Annotations?• ManyMany other biological concepts / processes / functions? other biological concepts / processes / functions?
– How to organize data structures to enable querying (analysis) How to organize data structures to enable querying (analysis) involvinginvolving
• Many Many tables tables • >1 million rows in some tables>1 million rows in some tables• filtering, aggregation, and computation of datafiltering, aggregation, and computation of data
• Analysis algorithms must be developed/adapted.Analysis algorithms must be developed/adapted.• Statistical models must be developed/validated.Statistical models must be developed/validated.
proteomicsmyriad
Send/ReceiveLab Sequence
Perform Basecalling
Perform QC andClean Lab Sequence
Annotate/Identify LabSequences
Construct Interaction Pair
Construct Interaction Map
Integrate ExternalEvidence
Y2H Laboratory
Track SequenceSubmittedVersioning
Sequence StringQuality ScoreQuality Matrix
Failed RequeueVector ClippingRepeat MaskingLow Quality Filter
BLAST, Parameters, VersionHomologous Seqs, Splice VariantsDomain Search
Frequency of InteractionConfidence LevelCollect False Positive, Self Activators
VisualizationQueryCompare Difference
Gene ExpressionPathwayDisease
Perform Downsteam Analysis
Example: Y2H Data Analysis Process Flow
proteomicsmyriad
Dealing with False Positives
• False positives will always be generated.False positives will always be generated.– Y2HY2H
• ““Self-activating” baits.Self-activating” baits.• ““Promiscuous” preys.Promiscuous” preys.
– Mass SpectrometryMass Spectrometry• Proteins that interact directly with affinity beads.Proteins that interact directly with affinity beads.• Proteins that interact directly with affinity tags.Proteins that interact directly with affinity tags.• Contaminants.Contaminants.
• False positives are False positives are veryvery hard to detect and distinguish from hard to detect and distinguish from real positives.real positives.
• False positives must be addressed both biologically and False positives must be addressed both biologically and informatically:informatically:
– Known false positives can be “subtracted” from Y2H AD/BD Known false positives can be “subtracted” from Y2H AD/BD libraries before experiments.libraries before experiments.
– Mass spectrometry control experiments with affinity beads, Mass spectrometry control experiments with affinity beads, affinity tags, and background contaminants can be “subtracted” affinity tags, and background contaminants can be “subtracted” from results.from results.
– Known false positives can be “subtracted” during analysis.Known false positives can be “subtracted” during analysis.– Statistical tests can be developed to help identify possible false Statistical tests can be developed to help identify possible false
positives during analysis.positives during analysis.
proteomicsmyriad
Mine and Visualize the Results of AnalysisMine and Visualize the Results of Analysis
• Proteomics-specific Proteomics-specific data mining tools are required to extract are required to extract meaningful knowledge from massive amounts of data.meaningful knowledge from massive amounts of data.
– Flexible Flexible searching capabilities. capabilities.– Flexible Flexible filters to reduce the amount of data. to reduce the amount of data.– Multiple Multiple views of the data. of the data.– Ad-hoc query tools for unanticipated data mining needs. tools for unanticipated data mining needs.– Data warehouses and/or data martsData warehouses and/or data marts are required to support data are required to support data
mining without impacting performance sensitive LIMS and mining without impacting performance sensitive LIMS and analytic systems.analytic systems.
• Visualization Visualization tools are required to visually organize the data tools are required to visually organize the data and reveal meaningful patterns.and reveal meaningful patterns.
– Quality control visualizations.visualizations.– Interaction network visualizations.visualizations.– Interaction network visualizations with experimental data visualizations with experimental data
overlays.overlays.– Disease and metabolic pathway visualizations with interaction visualizations with interaction
network overlays.network overlays.
proteomicsmyriad
Scatter Plot
SEARCHID2414824774 26485 27617 28532 36250 37511 39413
0
20
40
60
80
100
120
140
160
Quality Control Visualization (1)Quality Control Visualization (1)
proteomicsmyriad
Plate-by-plate Sequencing Purity Monitor
well
RA0000055
RA0000109
RA0000128
RA0000151
RA0000166
RB0000047
RB0000059
RB0000072
RB0000106
RA0000058
RA0000110
RA0000129
RA0000152
RA0000169
RB0000048
RB0000060
RB0000073
RB0000109
RA0000059
RA0000113
RA0000130
RA0000154
RA0000170
RB0000050
RB0000061
RB0000074
RB0000110
RA0000060
RA0000119
RA0000131
RA0000155
RA0000171
RB0000051
RB0000063
RB0000075
RB0000113
RA0000061
RA0000122
RA0000132
RA0000156
RA0000172
RB0000052
RB0000064
RB0000076
RB0000119
RA0000101
RA0000123
RA0000133
RA0000158
RB0000041
RB0000053
RB0000066
RB0000077
RB0000122
RA0000103
RA0000124
RA0000134
RA0000161
RB0000043
RB0000054
RB0000067
RB0000078
RB0000123
RA0000104
RA0000125
RA0000135
RA0000162
RB0000044
RB0000055
RB0000068
RB0000101
RB0000124
RA0000106
RA0000126
RA0000149
RA0000164
RB0000045
RB0000056
RB0000069
RB0000103
RB0000127
RA0000108
RA0000127
RA0000150
RA0000165
RB0000046
RB0000058
RB0000070
RB0000104
RB0000136
1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24 1 9 16 24
AFKP
AFKP
AFKP
AFKP
AFKP
AFKP
AFKP
AFKP
AFKP
Quality Control Visualization (2)Quality Control Visualization (2)
proteomicsmyriad
Interacting preys highlighted with their pronet annotation
prey
interacting baits highlighted with their pronet annotation
prey38 577 3691 6421 9090 10814 23469 55216 84619
20
1198
4343
6670
9114
11244
26289
58528
Prey Annotated Bait Annotated
Y2H Interaction Map with Curated Promiscuous Protein AnnotationY2H Interaction Map with Curated Promiscuous Protein Annotation
Quality Control Visualization (3)Quality Control Visualization (3)
proteomicsmyriad
Interaction Network Sub-Graph VisualizationInteraction Network Sub-Graph Visualization
proteomicsmyriad
loc2
2
loc2
5lo
c23
loc2
4
loc2
1
Y2H Interaction Network Sub-Graph Y2H Interaction Network Sub-Graph Visualization with Protein Pull-down OverlayVisualization with Protein Pull-down Overlay
proteomicsmyriad
Pathway with Interaction Network AnnotationPathway with Interaction Network Annotation
fibril formation,deposition
Amyloid Plaque,Neurofibrillary
Tangle Formation
APOPTOSIS
Underlying Pathway Adopted from http://www.kegg.com
New Protein-Protein InteractionKnown Protein-Protein InteractionTransduction PathwayKnown Pathway MemberIdentified InteractorNovel TranscriptTraditional “Drugable” EnzymeOther Enzymes