data types in bioinformatics - university of hawaiistrev/ics614/materials/bioinformatics...•...

18
Data Types in Bioinformatics Patient diagnostics – Karyotypes Fluorescent In Situ Hybridization – Polymorphisms Microarrays, from pictures to interpretation Genetic sequence, from raw trace files to base-calls to protein Sample annotation

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Data Types in Bioinformatics• Patient diagnostics

– Karyotypes– Fluorescent In Situ

Hybridization– Polymorphisms

• Microarrays, from pictures to interpretation

• Genetic sequence, from raw trace files to base-calls to protein

• Sample annotation

Page 2: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Family History• Pedigree information will increasingly need to be stored

• Companies like Progeny offer client-server pedigree input, query and storage

• Pedigree can span multiple institutions, multiple consents

Page 3: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Karyotypes• Karyotypes used for diagnosing gross chromosomal

abnormalities• Typically not digitized• Text reports saved

Page 4: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Fluorescent In Situ Hybridization• Test performed looking for specific known areas of

chromosomes• Dyes are used to light up pattern, if present• Typically, images may not be saved• Text report of diagnosis saved

Page 5: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Polymorphism• Essentially, any arbitrary short region of a DNA sequence

can be sequenced, costing on the order of $40• Saving the raw trace files is necessary for quality analysis• One human’s raw trace files (images) will occupy 300

terabytes• A single base pair can be sequenced for $0.10 to $0.50,

once the infrastructure is established

Page 6: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Genetic Sequence• Though the trace files are large, the readings take up much

less space• FASTA format: simple text file format with base calls or

amino acids• Lowest common denominator between proprietary systems• The entire genome can be downloaded in FASTA format

>TC30326 s1 TC63997 TC16407 TC21735 TC23192 TC30327 TC50687 TC59470GAGCCTCTGGGTCCCGTCTAGGTACACTTTCTGCATTTCGAGCCCGGGCAGGTGAGGTGCGACAGGTAAATTTAACACAATGGATTTCTCCAAGCTACCCAAAATCCGAGATGAGGATAAAGAAAGTACATTTGGTTATGTGCATGGAGTCTCAGGGCCTGTGGTTACAGCCTGTGACATGGCGGGCGCTGCCATGTACGAGCTGGTGAGAGTGGGGCACAGCGAGCTGGTTGGAGAAATTATTCGATTGGAAGGTGACATGGCCACCATTCAGGTGTATGAAGAAACTTCTGGTGTCTCTGTTGGAGACCCCGTACTCCGCACTGGTAAACCTCTCTCGGTCGAGCTGGGTCCCGGGATTATGGGAGCCATTTTTGATGGTATACAGAGACCTCTGTCGGATATCAGCAGTCAGACCCAAAGTATCTACATCCCCAGAGGAGTCAATGTGTCTGCTCTCAGCAGAGATATCAAATGGGAGTTTATACCCAGCAAAAACCTACGGGTTGGTAGTCATATCACTGGTGGAGACATTTATGGGATTGTCAATGAGAACTCCCTCATCAAACACAAAATCATGTTGCCCCCACGTAACAGAGGAAGCGTGACTTACATCGCGCCGCCTGGGAATTATGATGCATCCGATGTCGTCCTGGAGCTTGAGTTTGAAGGTGTGAAGGAGAAGTTCAGCATGGTCCAAGTGTGGCCTGTGCGGCAGGT

Page 7: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Microarrays• Raw TIFF images from a single chip can take 10-100 MB• File of expression measurements is 0.5-1 MB• MIAME: Minimum Information about a Microarray Experiment• MGED: Microarray Gene Expression Database• Affymetrix microarrays are made 40 chips per wafer• A single wafer has 60 million probes; wafer imaging 5-10 TB

Page 8: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Sample Annotations• The least common denominator• How to describe the context of the sample one is

measuring?• Equivalent to the medical records problem

Page 9: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

IT Challenges• High bandwidth data collection

– We are used to physiological measurements with high sample rates, but these are not saved

– Our challenge is higher density microarrays: 10-100 MB each

• Data storage– We are used to 15% US population getting imaging, equals 200

million multiGB images– Our challenge is raw sequencing trace files for one human =

300 terabytes

Page 10: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

IT Challenges• Measurement Noise

– We are used to artifacts in physiological measures– Our challenge is poor expression measurement

reproducibility

• Data Models– We are used to lack of standards in medical records

• HL7, HIPAA

– Our challenge is too many standards in bioinformatics• Gene Expression Markup Language (GEML)• Gene Expression Omnibus (GEO)• Microarray Markup Language (MAML)

– Medical record as sample annotation

Page 11: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Common Challenges• Comparing new signals to old

Page 12: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Common Challenges• Continued development of

controlled vocabularies• Knowledge management

HL7

Page 13: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Common Challenges• Security

HL7

• Privacy• Ethics

Page 14: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

American Medical Informatics Association

www.amia.org

Bio+medical InformaticsOne Discipline

November 9-13, 2002San Antonio, Texas

Page 15: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Bioinformatics and Integrative Genomics

big.chip.org

NIH FundedNew PhD training

program in bioinformatics for quantitative individuals

Includes training in wet-and dry-biology, clinical medicine

First class Fall 2002

Page 16: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Microarrays for an Integrative Genomics

• Upcoming book in press at MIT Press• Available Spring 2002

Microarrays for an Integrative Genomics

Page 17: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Collaborators and Support• Collaborations

– Scott Weiss / Channing LaboratoryNHLBI Program of Genomics ApplicationsNurses Health StudyPhysicians Health StudyNormative Aging Study

– Seigo Izumo / Beth Israel NHLBI Program of Genomic ApplicationsFramingham Heart Study

– Jeff Drazen / Brigham and Women’sNIGMS Pharmacogenetics

– David Rowitch / Dana FarberNINDS Innovative Technologies

– Morris White / Joslin Diabetes Center, Howard Hughes Medical Institute

– Tovia Libermann / Beth IsraelNIDDK Biotechnology Center

– Terry Strom / Beth IsraelNIAID Immune Tolerance Network

– Louis Kunkel / Children’s HospitalMuscular Dystrophy

– C. Ron Kahn / Joslin Diabetes CtrDiabetes Genomic Anatomy Project

– Mary Elizabeth Patti / JoslinDiabetes

– Andrea Dunaif / Brigham and Women’s HospitalPolycystic Ovarian Syndrome

• Support– NIH: NLM, NINDS, NHLBI, NIDDK,

NIAID– Lawson Wilkins NovoNordisk Award– Merck / MIT Fellowship– Genentech Foundation Fellowship– Endocrine Fellow Foundation

Page 18: Data Types in Bioinformatics - University of Hawaiistrev/ICS614/materials/Bioinformatics...• Maneesh Yadav Title Module Final Author Atul Butte Created Date 2/14/2002 6:29:40 PM

Children’s Hospital Informatics ProgramBioinformaticswww.chip.org

• Isaac Kohane, Director• Ling Bao• Atul Butte• Sangeeta Barnabas English• Ade Dosunma• Steven Greenberg• Aaron Homer• Janet Karlix• Alvin Kho• Ju Han Kim• Winston Kuo• Kyungjoon Lee

• Voichita Marinescu• Ashish Nimgaonkar• Peter Park• Marco Ramoni• Alberto Riva• Yao Sun• Zoltan Szallagi• Alex Turchin• Eric Tsung• Mark Whipple• Maneesh Yadav