Download - Development of a Chicken Unigene Database
![Page 1: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/1.jpg)
Development of a Chicken Unigene Database
Project No. 9
Mentors: Dr. Wellington Martins - Dr. Joan Burnside
Animal Science Dept.University of Delaware
Jianshan Tang Ruoming Jin
Department of CIS
University of Delaware
Lilian Lacoste
DBI - French National School of Aeronautics
and Space
![Page 2: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/2.jpg)
Results
2815 contigs 6390 singlets
17,090 ESTsPhrap
9,205 cluster
Phrap Clustering Result:
![Page 3: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/3.jpg)
Second clustering method : using BLAST output
Contig 1
BLASToutput1
Contig 2
BLASToutput2
FilteringParsing
Comparing
Similarity function
Similarity matrix
![Page 4: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/4.jpg)
Whats gbc?
Graph Based Clustering Clustering, a process of partitioning a set of data (or
objects) in a set of meaningful sub-classes, called clusters. Graph, the relation of the data could be expressed as
graph If there is a relation of two nodes, one edge connects them
Working in bioinformatics Protein sequence clustering EST clustering A lot of other applications!
Objective of "gbc" Support different input format Efficiently support very large sparse graph clustering Flexible to use by user
![Page 5: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/5.jpg)
How to use gbc
Output Cluster number, and all the nodes belongs
to the cluster Clique clustering
a clique is a completely connected subgraph each maximal clique in the graph becomes a cluster clusters many overlap generally produces small but very tight clusters
Single-link clustering A maximal connected subgraph becomes a cluster produces larger but weaker clusters
![Page 6: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/6.jpg)
A little about Implementation Works
Two clustering algorithm Single-link Clique
Graph Classes Efficiently support dense/sparse
graph Provide the same interface without
modifying clustering code
![Page 7: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/7.jpg)
Analysis program
Reset BLAST output
Change matrix thresholdReset semantics
Run analysisNew contig set
Number ofcontigs
Comparisonalgorithm
Clusteringalgorithm
Resultsoutput
Analysis tools
Processlog output
![Page 8: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/8.jpg)
Analysis tools : contig information
Display the BLAST output :- sequences references- sequences annotations- percentage of matching basepairs
Display the list of contigs sortedaccording to their best matching percentage in the BLAST output
![Page 9: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/9.jpg)
Analysis tool : EST selector
Display :- frequency vs length (in ESTs)of contigs- list of ESTs in a contig
Allows to select the best representative EST accordingto length and tissue type
![Page 10: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/10.jpg)
First results
On a set of 400 contigs representing 1000 ESTs
Contig number :79Contig size :743Best matching fraction :0.43587786259541983gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 571 e-160gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 143 2e-31ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 143 2e-31ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 143 2e-31emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 143 2e-31dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11gb|AC009623.6|AC009623 Homo sapiens chromosome 8, clone RP11-219... 40 1.7
Contig number :133Contig size :740Best matching fraction :0.9413109756097561gb|AF178529.1|AF178529 Gallus gallus Rad54b (RAD54B) mRNA, compl... 1235 0.0gb|BC001965.1|BC001965 Homo sapiens, RAD54, S. cerevisiae, homol... 184 5e-44ref|XM_005161.3| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44gb|AF112481.1|AF112481 Homo sapiens RAD54B protein (RAD54B) mRNA... 184 5e-44ref|NM_012415.1| Homo sapiens RAD54, S. cerevisiae, homolog of, ... 184 5e-44emb|AL133578.1|HSM801429 Homo sapiens mRNA; cDNA DKFZp434J1672 (... 184 5e-44dbj|AP003534.1|AP003534 Homo sapiens genomic DNA, chromosome 8q2... 76 3e-11gb|AC084633.1|CBRG45G04 Caenorhabditis briggsae cosmid G45G04, c... 44 0.11dbj|AB018110.1|AB018110 Arabidopsis thaliana genomic DNA, chromo... 44 0.11
![Page 11: Development of a Chicken Unigene Database](https://reader036.vdocuments.us/reader036/viewer/2022062314/568137a4550346895d9f4b1b/html5/thumbnails/11.jpg)
References
Gene Index analysis of the human genome estimates approximately 120,000 genes. Liang-Feng; Holt-Ingeborg, Pertea-Geo, Karamycheva-Svetlana, Salzberg-Steven-L, Quackenbush-John Nature-Genetics. June, 2000; 25 (2): 239-240.
The TIGR Gene Indices: Reconstruction and representation of expressed gene sequences Quackenbush-John, Liang-Feng, Holt-Ingeborg, Pertea-Geo, Upton-Jonathan Nucleic-Acids-ResearchJan. 1, 2000; 28 (1): 141-145
IMAGEne I: Clustering and ranking of I.M.A.G.E. cDNA clones corresponding to known genes. Cariaso-M, Folta-P , Wagner-M, Kuczmarski-T, Lennon-G Bioinformatics-Oxford. Dec., 1999; 15 (12): 965-973.
R. Larson, M. Hearst : Content analysis - Lecture from University of California , Berkeley School of information management and systems 1998. http://www.sims.berkeley.edu/courses/is202/f98/Lecture16/sld001.htmGib
T. Ono, H. Hishigaki, A. Tanigami, T. Takagi - Automated extraction of information on protein-protein interaction from biological literature. Bioinformatics vol 17 no 2 - Oxford University Press 2001.
I. Iliopoulos, A.J. Enright, C.A. Ouzounis - TEXTQUEST: document clustering of medline abstracts for concept discovery in molecular biology. EMBL Cmabridge Outstation, Cambridge CB10 ISD, UK.