distributed approach for peptide identification
TRANSCRIPT
![Page 1: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/1.jpg)
Distributed Approach for Peptide
IdentificationBy
Naga Venkata Krishna Abhinav Vedanbhatla
![Page 2: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/2.jpg)
Outline
• Background• About C-Ranker
• Problem Statement• Proposed Solution
• Architecture• Implementation• Execution Environment
• Results• Conclusions & Future Scope
![Page 3: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/3.jpg)
Background
![Page 4: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/4.jpg)
Protein
• Bio-Molecules consists of one or multiple chains of Amino Acids• Proteins differ from one another primarily in their sequence of
amino acids• A protein is characterized by the sequence of amino acids as they
occur in the protein
![Page 5: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/5.jpg)
Proteins (Cont.…)
• Proteins perform a vast array of functions within living organisms.• A protein contains at least one long polypeptide• Proteins are involved in almost every biological process
happening in an organism’s body. • Important part of drug development to target specific metabolic
pathways.
![Page 6: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/6.jpg)
Peptide
• Short chain of amino acids
• Peptides are distinguished from proteins on the basis of size
• A protein is first digested into peptides and then each peptide is identified individually to infer the protein identity.
![Page 7: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/7.jpg)
Finding Peptide and Protein Relationship?
• Peptide Identification using mass spectrometry
![Page 8: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/8.jpg)
Determine the sequence of Peptides
• Peptide mass fingerprinting (PMF) in MS spectra
![Page 9: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/9.jpg)
Peptide Identification Using MS/MS Spectra
• Sequence database searching (for the large-scale dataset)
• de novo sequencing (new protein discovery)
• Post sequence database searching (extension of sequence database search)
![Page 10: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/10.jpg)
Peptide Identification (cont…)
• Mass Spectrometry (MS) strategy
• Sequence database searching
• Combination of both: dominant method for peptide identification
• Which results in more spectra from MS
![Page 11: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/11.jpg)
Sequence Database searching algorithms.
• SEQUEST
• Mascot
![Page 12: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/12.jpg)
Post Database Algorithms
• Machine learning algorithms are proposed to identify the peptide spectrum matches (PSM)
![Page 13: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/13.jpg)
Post Database Search Algorithms
• PeptideProphet• Learns distribution of scores and properties
• Percolator• Search scores considered reliable for high values and low values
• CRanker• Fuzzy SVM and silhouette index.
![Page 14: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/14.jpg)
CRanker
![Page 15: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/15.jpg)
C-Ranker
• Is to identify correct PSMs output from the database(Peptide).
• Developed in Matlab and C
![Page 16: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/16.jpg)
Why CRanker?
• Based on research by Dr. Zhonghang Xia, it is the best among the other.
• Easy to parallelize to make it work on a network of computers rather than on a single computer to work on a larger scale
![Page 17: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/17.jpg)
Why CRanker? (cont…)
Data PeptideProphet CRanker Overlap
UPS1 582 576 509
pbmc 34035 34273 32243
Overlapping of aggregate PSMs distinguished by PeptideProphet and CRanker are 88.4 % and 94.8% on UPS1, and PBMC, respectively
![Page 18: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/18.jpg)
CRanker Execution: Step 1
InputFileName.txt
InputFileName.mat
C-Ranker Read Stage
Loads raw PSM data into main
memoryreads load
generates
![Page 19: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/19.jpg)
CRanker Execution: Step 2
InputFileName.txt
InputFileName_score.mat
C-Ranker Solve Stage
Loads PSM records into
main memoryreads load
creates
![Page 20: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/20.jpg)
CRanker Execution: Step 3
InputFileName.txt
InputFileName_score.mat
C-Ranker Write Stage
Loads PSM scores into main
memory
OutputFile.txt
readsreads
reads
reads
load
creates
![Page 21: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/21.jpg)
Problem Statement
![Page 22: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/22.jpg)
Problem Statement
• C-Ranker need a computer with high computation power
• Dataset having about 400,000 PSM records, it may cost about 5 to 8 on normal PC
• Poor Resource Management
• Need to address future big data sets
![Page 23: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/23.jpg)
Can’t we change C-Ranker?
• Research going on to optimize C-Ranker.
• Distributed approach of C-Ranker.
• Needs to re-write the complete code!
![Page 24: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/24.jpg)
Brainstorming
• Can we divide(who will divide) the 400,000 PSM records across 4 machines and do the job??
• Increase computational power?
![Page 25: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/25.jpg)
Constraints!
• Restrictions on changing the C-Ranker design and code (I am not well experienced to do so..)
• Should not change the execution flow of C-Ranker.
![Page 26: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/26.jpg)
Shortlisted Approach
• Fundamental Distributed Approach
![Page 27: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/27.jpg)
Why Distributed Framework?
• It can handle bigger datasets than it would be able to in a centralized setting.
• Requires less memory per computer and each computer can have commodity hardware.
• Cheaper to have multiple commodity hardware computers than having a single high-performance high-end system capable of achieving similar goals.
![Page 28: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/28.jpg)
Job Execution in Distributed Approach
![Page 29: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/29.jpg)
Proposed Solution
![Page 30: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/30.jpg)
Proposed Solution
• A framework to execute C-Ranker on distributed node.• Design such that it may work with other post database searching
algorithms like C-Ranker with minimal changes• Compare the time-taken of generate distributed output of C-Ranker
with actual output• Make sure C-Ranker algorithm is well executed on the set of
predefined nodes
![Page 31: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/31.jpg)
Architecture
![Page 32: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/32.jpg)
Implementation
![Page 33: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/33.jpg)
Data Flow Details in the Original Single-Threaded C-Ranker
![Page 34: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/34.jpg)
Data Flow Details in Distributed C-Ranker (Dividing)
![Page 35: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/35.jpg)
Data Flow For a Single Worker Host
![Page 36: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/36.jpg)
Data Flow Details in Distributed C-Ranker (Merging)
![Page 37: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/37.jpg)
Execution Environment
![Page 38: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/38.jpg)
Execution Environment
• JAVA
• MATLAB MCR environment
• Apache Tomcat web server
![Page 39: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/39.jpg)
Input Data used
![Page 40: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/40.jpg)
Hardware Used to Observe Results
Servers Server_1 Server_2 Server_3 Server_4
Memory 8GB 4GB 4GB 4GB
Processor i5 i5 i5 i5
Operating System Windows 7 Windows Vista Windows 7 Windows 7
![Page 41: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/41.jpg)
Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster1
PBMC data (KB) C-Ranker Execution time in hrs (Cluster1 Hadoop)
Distributed approach for C-Ranker executiion time in hrs
11221 6.5 3.56
12816 9.9 8.1
31422 10.2 8.25
48486 15. 2 9.2
![Page 42: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/42.jpg)
Results
![Page 43: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/43.jpg)
Results for testData.xls (409 KB)
![Page 44: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/44.jpg)
Results for Pbmc_orbit_mips.xls (11221 KB)
![Page 45: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/45.jpg)
Results for Pbmc_orbit_nomips.xls (12816 KB)
![Page 46: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/46.jpg)
Results for Pbmc_velos_mips.xls (31422 KB)
![Page 47: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/47.jpg)
Results for Pbmc_velos_nomips.xls (48486 KB)
![Page 48: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/48.jpg)
Memory usage for testData.xls (409 KB)
![Page 49: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/49.jpg)
Memory Usage for Pbmc_orbit_mips.xls (11221 KB)
![Page 50: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/50.jpg)
Memory Usage for Pbmc_orbit_nomips.xls (12816 KB)
![Page 51: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/51.jpg)
Memory Usage for Pbmc_velos_mips.xls (31422 KB)
![Page 52: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/52.jpg)
Memory usage for Pbmc_velos_nomips.xls (48486 KB)
![Page 53: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/53.jpg)
Memory Usage
![Page 54: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/54.jpg)
Difference in Memory Usage
![Page 55: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/55.jpg)
Upgraded hardware to compare with cluster2 HadoopServers Server_1 Server_2 Server_3 Server_4
Memory 12GB 8GB 12GB 8GB
Processor i7 i5 i7 i7
Operating System
Windows 8 Windows 7 Windows 7 Windows 7
![Page 56: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/56.jpg)
Comparison of C-Ranker on distributed approach with C-Ranker on an Apache Hadoop Framework Cluster 2
PBMC data (KB) Distributed approach for C-Ranker execution time in hrs (new results)
CRanker Execution time in hrs(Cluster2 Hadoop)
11221 1.7 1.312816 3.82 1.5831422 4.1 3.448486 5.93 4.5
![Page 57: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/57.jpg)
Cost Calculation of Apache Hadoop Cluster 1 and Cluster 2
PBMC Data Size(KB) Cost of Hadoop Cluster 1($)
Cost of Hadoop cluster 2($)
11221 3.4581 0.6916
12816 5.2668 0.8410
31422 5.4264 1.8088
48486 8.8064 2.3941
![Page 58: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/58.jpg)
Conclusion and Future Scope
![Page 59: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/59.jpg)
Conclusion
• Reduces the execution time
• Absolutely cost free (no need of high computing machines)
• No need to change the current structure of C-Ranker
![Page 60: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/60.jpg)
Conclusion (Cont.…)
• Better Resource Management. For example: Memory
• No need to change the implementation of CRanker
![Page 61: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/61.jpg)
Future Scope
• The same distributed approach can be used with Percolator and PeptideProphet to see how well they perform
• Additionally, once can use an ensemble method to combine the results of the three tools.
![Page 62: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/62.jpg)
Questions??
![Page 63: Distributed approach for Peptide Identification](https://reader036.vdocuments.us/reader036/viewer/2022062522/5883a0601a28ab2b568b7697/html5/thumbnails/63.jpg)
Thank you