overcoming the curse of dimensionality in a statistical geometry based computational protein...
TRANSCRIPT
![Page 1: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/1.jpg)
Overcoming the Curse of Dimensionality in a Statistical Geometry Based
Computational Protein Mutagenesis
Majid Masso
Bioinformatics and Computational Biology
George Mason University, Manassas, Virginia, USA
BioDM Workshop, IEEE ICDM 2010
![Page 2: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/2.jpg)
Delaunay Tessellation of Protein Structure
D3
A22
S64
L6
F7
G62
C63
K4
R5
Aspartic Acid
(Asp or D)Abstract every amino acid residue to a point
Atomic coordinates – Protein Data Bank (PDB)
center of mass (CM)
Delaunay tessellation: 3D “tiling” of space into non-overlapping, irregular tetrahedral simplices. Each simplex objectively identifies a quadruplet of nearest-neighbor amino acids at its vertices.
![Page 3: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/3.jpg)
Delaunay Tessellation of T4 Lysozyme
• Ribbon diagram (left) based on PDB file 3lzm (164 residues)
• Each amino acid residue represented as a CM point in 3D space
• Tessellation of the 164 CM points (right) performed using a 12Å edge-length cutoff, for “true” residue quadruplet interactions
![Page 4: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/4.jpg)
Four-Body Statistical Potential
PDBTraining set: 1,375 diverse high-resolution x-ray structures
1bniAbarnase
1jliIL-3
1efaBlac repressor
Tessellate
Pool together all simplices from the tessellations, and compute observed frequencies of simplicial quadruplets
…1rtjA
HIV-1 RT
![Page 5: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/5.jpg)
Four-Body Statistical Potential
![Page 6: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/6.jpg)
Computational Mutagenesis: Residual Profiles
ribbon CM trace
tessellation
10 simplices share N163 vertex, and 10 total vertices; in the structure, N163 has 9 neighbors
environmentalchange (EC)
nonzero components identify the mutated position 163 and its 9 neighbors
![Page 7: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/7.jpg)
Computational Mutagenesis: Residual Profiles
• Nonzero ECs identify mutated position 163 and its 9 neighbors
• So, the 19 mutants (N163A, N163C, etc.) at 163 will have nonzero ECs at the same 10 positions only, but nonzero values will differ
• Each position has a different number of structural neighbors (min of 6, max of 19), which can be located throughout the sequence
• Number of neighbors and their locations (position numbers) are dependent on the position being mutated
![Page 8: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/8.jpg)
Experimental Data: Mutant T4 Lysozyme Activity
• 2015 mutants synthesized by introducing the same 13 amino acids as replacements at 163 positions (all except the first)Rennell, D., Bouvier, S.E., Hardy, L.W. & Poteete, A.R. (1991) J. Mol. Biol. 222, 67-88.
• Each position yields either 12 or 13 mutants, depends on whether or not native amino acid there is also one of the 13 replacements
• Mutant activity is based on plaque sizes on Petri dishes, 2 classes: “unaffected” = large plaques (same as native T4 lysozyme) “affected” = medium, small, or no plaques
• 1377 “unaffected” and 638 “affected” T4 lysozyme mutants
![Page 9: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/9.jpg)
Computational Mutagenesis: Feature Vectors
• Approach 1 – represent mutants by 164D residual profile vectors; training set consists of all 2015 T4 lysozyme mutants
• Approach 2 (dimensionality reduction) – select and order the 6 closest neighbors to the mutated position; create 7D vector of nonzero EC scores for mutated position and 6 closest neighbors
• Approach 3 (subspace modeling) – segregate mutants by position number, consider each subset as a separate training set for classification, and combine the results; can be applied to 164D or 7D feature vectors
![Page 10: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/10.jpg)
Supervised Classification
• Algorithms: decision tree (DT), neural network (NN), support vector machine (SVM), and random forest (RF)
• Testing: leave-one-out cross-validation (LOOCV)
• Evaluation of performance:• Overall accuracy, or proportion of correct predictions: Q
• Sensitivity and precision for both classes: S(U), P(U), S(A), and P(A)
• Balanced error rate: BER
• Matthew’s correlation coefficient: MCC
![Page 11: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/11.jpg)
Results
• Full training set with 164D surpasses 7D due to loss of implicit structural information (i.e., location of nonzeros in 164D vector)
• Subspace modeling (SM) improves performance due to dramatic increase in S(A); 164D and 7D SM results are equal
• SM with 164D vectors amounts to dimensionality reduction that uses the entire neighborhood of mutated position (unlike 7D, which uses only the 6 closest neighbors)
![Page 12: Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649e745503460f94b749bb/html5/thumbnails/12.jpg)
Conclusion and Future Directions• Residual profile vectors provide a natural way to introduce
subspace modeling and achieve improved performance
• Current work focused on inductive learning, future project could apply transductive learning to the dataset
• Transduction allows us to also use vectors of all remaining mutants not classified experimentally – wet-lab collaborations can then validate our predictions
• These techniques could be applied to a similarly comprehensive experimental dataset: 4041 mutants of lac repressor protein
• Contact: [email protected] available at: http://binf.gmu.edu/mmasso