transcription factor binding positions ... - cae...
TRANSCRIPT
![Page 1: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/1.jpg)
TRANSCRIPTION FACTOR BINDING POSITIONS PREDICTION WITH CNN
Bowen Hu
![Page 2: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/2.jpg)
TRANSCRIPTION FACTORS (TF)
➤ Protein that binds to a specific DNA sequence.
➤ Key in regulating (turning on or off) gene expression.
➤ Understanding TF binding locations will help us understand which part of gene is expressed in specific cell lines(skin tissue, brain tissue…).
➤ Together with genotype and expression analysis, we can take a step forward on understanding biological processes and disease states.
![Page 3: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/3.jpg)
CHIP-SEQ DATA
➤ ChIP-sequencing is a method used to analyze protein interactions with DNA.
➤ Finding all possible combination of TFs and cell lines is expensive and time consuming.
➤ Method for precisely predicting whether a TF will bind to some sequence is necessary.
![Page 4: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/4.jpg)
DATA PREPARATION
➤ Chip-seq data are downloaded from ENCODE, https://www.encodeproject.org, experiment ENCSR101FJT.
➤ TF in the experiment is ZNF143-human with sample size 21679.
➤ Data cleaning:
Remove missing values
Generate negative samples.
Truncate sequence with fixed length 60 so that it can be expressed as image.
![Page 5: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/5.jpg)
DATA TRANSFORMATION
➤ One-hot encoding.
➤ DNA sequence “ACTA” will be expressed as:
➤ The first DNA sequence input AAAGAATCCAGCTTAAATCGAis shown next page to illustrate CNN model.
![Page 6: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/6.jpg)
Convolution Neural network
![Page 7: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/7.jpg)
WHY CNN?
➤ Traditional methods for predicting TF binding position are based on position weight matrices (PWMs) or motifs.
➤ People use likelihood ratio test or score test to make decision.
➤ These methods are not using Chip-seq information directly, but summary statistics.
➤ There are some other biological features influencing binding behavior as well.
➤ I would expect higher accuracy if apply a model built on Chip-seq data directly (CNN).
![Page 8: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/8.jpg)
CNN TRAINING
➤ Hyper-parameter selection.
Sequence length: 60;
Feature matrix (motif) length: 10;
Number of features: 600;
Window size of max pooling layer: 60;
Fully connection layer size: 50;
➤ Data seperation:
70% for training, 20% for validation, 10% for testing.
![Page 9: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/9.jpg)
CNN VISUALIZATIONConvolution layer
Max pooling
![Page 10: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/10.jpg)
RESULTS
➤ The accuracy rate of predicted value is 60% with CNN.
➤ Comparing to 90% accuracy rate of a current prevailing method gkm-SVM, it is not a desirable result.
![Page 11: TRANSCRIPTION FACTOR BINDING POSITIONS ... - CAE Usershomepages.cae.wisc.edu/~ece539/project/f17/Hu_presentation.pdf · CHIP-SEQ DATA ChIP-sequencing is a method used to analyze protein](https://reader031.vdocuments.us/reader031/viewer/2022040910/5e832fca868a1b36326a807b/html5/thumbnails/11.jpg)
DISCUSSION
➤ Possible reasons for worse accuracy rate of CNN:
Misleading negative samples.
CNN failed to capture the non-linear feature due to limit of layers.
Hyper-parameter can be improved.
Failed to involve tissue information.
➤ Improvement & future work
Generate new negative samples.
Add more layers of CNN.