xinbin dai, ph. d. affymetrix probeset mapping and medicago genome annotation (mt4.0 rc1)
TRANSCRIPT
Xinbin Dai, Ph. D.
Affymetrix Probeset Mapping and Medicago Genome Annotation (Mt4.0
RC1)
• About Affymetrix Medicago GeneChip
• Mapping Algorithm and Tool
• Bioinformatics Resources for Medicago Truncatula
Agenda
Affymetrix GeneChip Probes
5’ UTR EXON-I EXON-II EXON-III 3’ UTR
mRNA
Probeset: 11 Probes
Target Transcript
25-mer
1 255 10 15 20
1 255 10 15 20
Perfect match - PM
Mismatch - MM
• id_at:Designates probe sets that uniquely recognize target transcripts
• id_a_at:Designates probe sets that recognize alternative transcripts from the
same gene.• id_s_at:
Designates probe sets with common probes among multiple transcripts from different genes.
• id_x_at: Designates probe sets where it was not possible to select either a
unique probe set or a probe set with identical probes among multiple transcripts. Rules for cross-hybridization were dropped in order to design the _x probe sets. These probe sets share some probes identically with two or more sequences and, therefore, these probe sets may cross-hybridize in an unpredictable manner.
GeneChip® Expression Analysis Data Analysis Fundamentals.
Probeset Types
About Medicago GeneChip
Type Num of probe sets
Percent in the Mtr. set
Notes
Unique probe sets: e.g. Mtr.10097.1.S1_at
44182 86.80 Unique to one gene
Alternative (_a_), e.g.: Mtr.10267.1.S1_a_at
116 2.28 Alternative probe sets to one gene
Shared (_s_), e.g. Mtr.10146.1.S1_s_at
4793 9.42 Common to multiple genes
Others (_x_), e.g.:Mtr.10093.1.S1_x_at
1809 3.55 Other probe sets with complicated mapping
Total 50900 100
Reference sequences: early version of IMGAG, DFCI GeneIndex and alfalfa EST
• Gene transcripts were matched to corresponding Affymetrix probe sets using a position-weighted scoring index in which mismatches near the middle of a probe were most heavily penalized as follows:
A perfect match for a probe set yields a score of 45
• Matches were declared when at least 8 of 11 probes had scores of 43 or higher.
Cutoff for matching: 43x8=344
Mapping Algorithm and Tool
1 255 10 15 20
[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1]
Originated from Affymetrix, Inc.
AffyProbeMapping: An Online Affymetrix Probeset Mapping Tool
http://bioinfo3.noble.org/affymap/
Input sequence:
• Transcript
• cDNA
• EST/Unigene
• CDS
Output of AffyProbeMapping:
AffyProbeMapping also supports Affymetrix chips for other species:
Lotus Japonica, Arabidopsis thaliana, rice, soybean, maize, populus, cotton and tomato
Bioinformatics & Data Resources for Medicago Truncatula
Originated from Affymetrix, Inc.
Data Sources:• Mt3.5v4(2011, version for Nature paper):
optical mapping 44,124 BAC-based gene loci + 18,264 illumina (nr) gene model
• Mt3.5v5(2012, minor changes): 45,859 BAC-based gene loci + 18,264 illumina gene model
• Mt4 RC1(2013, PAG 2013 conference): anchored illumina contigs onto pseudochromosomes. 84,993 gene loci (BAC+illumina). Chr sequences frozen; some of gene models might be removed.
• DFCI Gene index Release 11 294k ESTs/ETs 68,814 Unigenes
Statistics on Mt3.5v4 vs. Probesets Mapping Results using AffyProbeMapping
Num of cDNA Matching probe_set Percent
37,385 0 59.92
18,354 1 29.42
6,649 >=2 10.66
62,388 Total 100
Statistics on Mt4RC1 vs. Probesets Mapping Results using AffyProbeMapping
Num of cDNA Matching probe_set Percent
58,660 0 69.02
20,257 1 23.83
6,076 >=2 7.15
84,993 Total 100
Statistics on GeneIndex R11 vs. Probesets Mapping Results using AffyProbeMapping
Num of cDNA Matching probe_set Percent
29,722 0 43.2
32,848 1 47.7
6,244 >=2 9.1
68,814 Total 100
Mapping between Medicago genome vs. AffyMedicago Chip
http://bioinfo3.noble.org/affymap/Dataset.gy
Bioinformatics Tools For Medicago
• Sequence Search and Annotation– DOBLAST --- http://bioinfo3.noble.org/doblast/ , a parallel computing
accelerated BLAST search tool
Features:o Preload many Medicago
data resourceo Capable of handling
big dataseto “Tab-delimited bioparser
output format” works friendly with Excel
Bioinformatics Tools For Medicago
• Sequence Download and Cut by Coordinates.
– “Sequence Download” page of DOBLAST --- batch download sequences or cut sequences by Coordinates
o Preload many Medicago data resources
o Batch download
o Get a fragment of sequence by coordinates
DOBLAST sequence download page
Bioinformatics Tools For Medicago
• LegumeIP: An Integrative Platform to Study Gene Function and Genome Evolution in Legumes.
• Features:– Synteny analysis among model legumes– Phylogenetic analysis for gene family– Gene to gene association analysis– Gbrowser
o http://plantgrn.noble.org/LegumeIP/o We are updating to Version 2
LegumeIP: Synteny analysis for Medicago genome
LegumeIP: Phylogenetic analysis for Medicago gene family
LegumeIP: Gene association network analysis for Medicago gene