data types in bioinformatics - university of hawaiistrev/ics614/materials/bioinformatics...•...
TRANSCRIPT
Data Types in Bioinformatics• Patient diagnostics
– Karyotypes– Fluorescent In Situ
Hybridization– Polymorphisms
• Microarrays, from pictures to interpretation
• Genetic sequence, from raw trace files to base-calls to protein
• Sample annotation
Family History• Pedigree information will increasingly need to be stored
• Companies like Progeny offer client-server pedigree input, query and storage
• Pedigree can span multiple institutions, multiple consents
Karyotypes• Karyotypes used for diagnosing gross chromosomal
abnormalities• Typically not digitized• Text reports saved
Fluorescent In Situ Hybridization• Test performed looking for specific known areas of
chromosomes• Dyes are used to light up pattern, if present• Typically, images may not be saved• Text report of diagnosis saved
Polymorphism• Essentially, any arbitrary short region of a DNA sequence
can be sequenced, costing on the order of $40• Saving the raw trace files is necessary for quality analysis• One human’s raw trace files (images) will occupy 300
terabytes• A single base pair can be sequenced for $0.10 to $0.50,
once the infrastructure is established
Genetic Sequence• Though the trace files are large, the readings take up much
less space• FASTA format: simple text file format with base calls or
amino acids• Lowest common denominator between proprietary systems• The entire genome can be downloaded in FASTA format
>TC30326 s1 TC63997 TC16407 TC21735 TC23192 TC30327 TC50687 TC59470GAGCCTCTGGGTCCCGTCTAGGTACACTTTCTGCATTTCGAGCCCGGGCAGGTGAGGTGCGACAGGTAAATTTAACACAATGGATTTCTCCAAGCTACCCAAAATCCGAGATGAGGATAAAGAAAGTACATTTGGTTATGTGCATGGAGTCTCAGGGCCTGTGGTTACAGCCTGTGACATGGCGGGCGCTGCCATGTACGAGCTGGTGAGAGTGGGGCACAGCGAGCTGGTTGGAGAAATTATTCGATTGGAAGGTGACATGGCCACCATTCAGGTGTATGAAGAAACTTCTGGTGTCTCTGTTGGAGACCCCGTACTCCGCACTGGTAAACCTCTCTCGGTCGAGCTGGGTCCCGGGATTATGGGAGCCATTTTTGATGGTATACAGAGACCTCTGTCGGATATCAGCAGTCAGACCCAAAGTATCTACATCCCCAGAGGAGTCAATGTGTCTGCTCTCAGCAGAGATATCAAATGGGAGTTTATACCCAGCAAAAACCTACGGGTTGGTAGTCATATCACTGGTGGAGACATTTATGGGATTGTCAATGAGAACTCCCTCATCAAACACAAAATCATGTTGCCCCCACGTAACAGAGGAAGCGTGACTTACATCGCGCCGCCTGGGAATTATGATGCATCCGATGTCGTCCTGGAGCTTGAGTTTGAAGGTGTGAAGGAGAAGTTCAGCATGGTCCAAGTGTGGCCTGTGCGGCAGGT
Microarrays• Raw TIFF images from a single chip can take 10-100 MB• File of expression measurements is 0.5-1 MB• MIAME: Minimum Information about a Microarray Experiment• MGED: Microarray Gene Expression Database• Affymetrix microarrays are made 40 chips per wafer• A single wafer has 60 million probes; wafer imaging 5-10 TB
Sample Annotations• The least common denominator• How to describe the context of the sample one is
measuring?• Equivalent to the medical records problem
IT Challenges• High bandwidth data collection
– We are used to physiological measurements with high sample rates, but these are not saved
– Our challenge is higher density microarrays: 10-100 MB each
• Data storage– We are used to 15% US population getting imaging, equals 200
million multiGB images– Our challenge is raw sequencing trace files for one human =
300 terabytes
IT Challenges• Measurement Noise
– We are used to artifacts in physiological measures– Our challenge is poor expression measurement
reproducibility
• Data Models– We are used to lack of standards in medical records
• HL7, HIPAA
– Our challenge is too many standards in bioinformatics• Gene Expression Markup Language (GEML)• Gene Expression Omnibus (GEO)• Microarray Markup Language (MAML)
– Medical record as sample annotation
Common Challenges• Comparing new signals to old
Common Challenges• Continued development of
controlled vocabularies• Knowledge management
HL7
Common Challenges• Security
HL7
• Privacy• Ethics
American Medical Informatics Association
www.amia.org
Bio+medical InformaticsOne Discipline
November 9-13, 2002San Antonio, Texas
Bioinformatics and Integrative Genomics
big.chip.org
NIH FundedNew PhD training
program in bioinformatics for quantitative individuals
Includes training in wet-and dry-biology, clinical medicine
First class Fall 2002
Microarrays for an Integrative Genomics
• Upcoming book in press at MIT Press• Available Spring 2002
Microarrays for an Integrative Genomics
Collaborators and Support• Collaborations
– Scott Weiss / Channing LaboratoryNHLBI Program of Genomics ApplicationsNurses Health StudyPhysicians Health StudyNormative Aging Study
– Seigo Izumo / Beth Israel NHLBI Program of Genomic ApplicationsFramingham Heart Study
– Jeff Drazen / Brigham and Women’sNIGMS Pharmacogenetics
– David Rowitch / Dana FarberNINDS Innovative Technologies
– Morris White / Joslin Diabetes Center, Howard Hughes Medical Institute
– Tovia Libermann / Beth IsraelNIDDK Biotechnology Center
– Terry Strom / Beth IsraelNIAID Immune Tolerance Network
– Louis Kunkel / Children’s HospitalMuscular Dystrophy
– C. Ron Kahn / Joslin Diabetes CtrDiabetes Genomic Anatomy Project
– Mary Elizabeth Patti / JoslinDiabetes
– Andrea Dunaif / Brigham and Women’s HospitalPolycystic Ovarian Syndrome
• Support– NIH: NLM, NINDS, NHLBI, NIDDK,
NIAID– Lawson Wilkins NovoNordisk Award– Merck / MIT Fellowship– Genentech Foundation Fellowship– Endocrine Fellow Foundation
Children’s Hospital Informatics ProgramBioinformaticswww.chip.org
• Isaac Kohane, Director• Ling Bao• Atul Butte• Sangeeta Barnabas English• Ade Dosunma• Steven Greenberg• Aaron Homer• Janet Karlix• Alvin Kho• Ju Han Kim• Winston Kuo• Kyungjoon Lee
• Voichita Marinescu• Ashish Nimgaonkar• Peter Park• Marco Ramoni• Alberto Riva• Yao Sun• Zoltan Szallagi• Alex Turchin• Eric Tsung• Mark Whipple• Maneesh Yadav