cse/beng/bimm 182: biological data analysis
DESCRIPTION
CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa. www.cse.ucsd.edu/classes/fa09/cse182. Is this on the test? Can I get an extension on my homework?. Today. We will explore the syllabus through a series of questions? Please ASK - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/1.jpg)
CSE/Beng/BIMM 182: Biological Data Analysis
Instructor: Vineet BafnaTA: Nitin Udpa
www.cse.ucsd.edu/classes/fa09/cse182
![Page 2: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/2.jpg)
Today
• We will explore the syllabus through a series of questions?
• Please ASK• All logistical information will be given at
the end Is this on the test?Can I get an extension on my homework?
![Page 3: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/3.jpg)
Introduction to the class:Databases
• Biological databases are diverse– Often, little more than large text files
• Database technology is about formally representing data and the inter-relationships among the data objects.
• This course is not about databases, but about the data itself.
• We will ‘look’ at many biological databases (keep a count!) but not at their formal structure. Instead, we will ask:– How can we represent the data?– How can we query this data?
• In order to understand the data, we need to know a little Biology.
![Page 4: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/4.jpg)
Life begins with Cell
• A cell is a smallest structural unit of an organism that is capable of independent functioning
• All cells have some common features
![Page 5: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/5.jpg)
All life depends on 3 critical molecules• Protein
– Form enzymes, send signals to other cells, regulate gene activity.
– Form body’s major components (e.g. hair, skin, etc.).• DNA
– Hold information on how cell works• RNA
– Act to transfer short pieces of information to different parts of cell
– Provide templates to synthesize into protein
![Page 6: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/6.jpg)
The molecules of Life and Bioinformatics
• DNA, RNA, and Proteins can all be represented as strings!
• DNA/RNA are string over a 4 letter alphabet(A,C,G,T/U).
• Protein Sequences are strings over a 20 letter alphabet.
• This allows us to store and query them as text.
![Page 7: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/7.jpg)
History of Genbank
• In 1982 Goad's efforts were rewarded when the National Institutes of Health funded Goad's proposal for the creation of GenBank, a national nucleic acid sequence data bank. By the end of 1983 more than 2,000 sequences (about two million base pairs) were annotated and stored in GenBank.
Walter Goad, 1942-2000
![Page 8: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/8.jpg)
Sequence data
![Page 9: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/9.jpg)
![Page 10: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/10.jpg)
How do we query a sequence database?
• By name• By sequence• ‘Relational’
queries are barely applicable
![Page 11: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/11.jpg)
Quiz:DNA sequence databases
Suppose you have a 100nt sequence, and you want to know if it is human, what will you do?
How much time will it take? Or, how many steps? (Query=m, Database = n)
• What if you were interested in identifying the human homolog of a mouse sequence ( 85% identical)? How much time will it take? What if the query was 10Kbp? What if it was the entire genome?
ACGGATCGGCGAATCGAATCGTGGGCCTTAdatabase
AATCGTquery
![Page 12: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/12.jpg)
BLAST
• Allows querying sequence databases with sequence queries.
• It is the prototypical search tool.
• The paper describing it was the most cited paper in the 90s.
![Page 13: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/13.jpg)
Quiz:BLAST
What do you do if BLAST does not return a ‘hit’?
What does it mean if BLAST returns a sequence that is 60% identical? Is that significant (are the sequences evolutionarily related)?
Suppose Protein sequences A & B are 40% identical, and A &C are 40% identical. If we know that A&B are evolutionarily related, what does that say about A & C?
![Page 14: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/14.jpg)
Non sequence based queries
• Biological databases are not limited to sequences.
![Page 15: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/15.jpg)
Protein Sequences have structure
Quiz: Can you search using a structure query?
![Page 16: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/16.jpg)
Ex2: Sequences have motifs
How to represent and query such motifs?
![Page 17: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/17.jpg)
Quiz: Protein Sequence Analysis
• You are interested in all protein sequences that have the following pattern:– [AC]-x-V-x(4)-{ED}
• This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
• How can you search a protein sequence database for any such pattern?
• What if the database was a collection of patterns ?
![Page 18: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/18.jpg)
Database of Protein Motifs
![Page 19: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/19.jpg)
Quiz: Protein Sequence Analysis
Proteins fold into a complex 3D shape. Can you predict the fold by looking at the sequence?
What is a domain? How can you represent a domain? How can you query?
![Page 20: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/20.jpg)
Quiz: Biology
• DNA is the only inherited material. Proteins do most of the work, so DNA must somehow contain information about the proteins.
• How is the information about proteins encoded in DNA? What is the region encoding this information called?
![Page 21: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/21.jpg)
DNA, RNA and flow of information
• A gene is expressed in two steps1) Transcription: RNA synthesis2) Translation: Protein synthesis
![Page 22: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/22.jpg)
DNA, RNA, and the Flow of Information
TranslationTranscription
Replication
![Page 23: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/23.jpg)
Quiz:
How would you find genes in genomic sequence?
What is splicing? Alternative splicing? How can you (computationally) tell if a gene has alternative splice forms?
What is a gene?
![Page 24: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/24.jpg)
Quiz:Transcription?
• What causes transcription to switch on or off? How can we find transcription factor binding sites?
• The number of transcripts of a gene is indicative of the activity of the gene. Can we count the number of transcripts? Can we tell if the number of copies is abnormally high, or abnormally low?
![Page 25: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/25.jpg)
Quiz: Translation• How is Protein
Sequencing done?
Many proteins are post-translationally modified. How can you identify those proteins?
• What is a mass spectrometer?
![Page 26: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/26.jpg)
Quiz: Translation
• Are all genes translated?
• Can you predict non-coding genes in the genome? Can you predict structure for RNA?
• What is special about RNA?
![Page 27: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/27.jpg)
RNA sequences have Structure
![Page 28: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/28.jpg)
Quiz:RNA• How can you predict secondary, and
tertiary structure of RNA?• Given an RNA query (sequence +
structure), can you find structural homologs in a database? EX: tRNA
![Page 29: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/29.jpg)
Packaging• All of the transcripts
are encoded in DNA, which is packaged into the genome.
• Many databases (much of sequence) are devoted to storing entire genomic sequences.
![Page 30: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/30.jpg)
Genome Sequencing• How is the genome sequence determined?
Sequences can only be read 500-1000bp at a time. How long is the human genome?
• If human genome is of length X(=3Gb), and each shotgun fragment is of length y, how many fragments do we need to get X
• What is shotgun sequencing?
![Page 31: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/31.jpg)
Quiz: Sequencing
• Suppose you have fragments, and you want to assemble them into the genome, how would you do it? – How would you determine the overlaps– Layout, Consensus?
![Page 32: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/32.jpg)
1997
What was the main point of the debate?
![Page 33: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/33.jpg)
2001
![Page 34: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/34.jpg)
Sequencing Populations
• It took a long time (10-15 yrs) to produce the draft sequence of the human genome.
• Soon (within 10-15 years), entire populations can have their DNA sequenced. Why do we care?
![Page 35: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/35.jpg)
April’08 Bafna
Personalized genomics
![Page 36: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/36.jpg)
23andMe
Sep’07 UCSD Bix
![Page 37: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/37.jpg)
Sep’07 UCSD Bix
![Page 38: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/38.jpg)
Quiz:Population genetics
• We are all similar, yet we are different. How substantial are the differences?– Why are some people more likely to get a
disease then others?– If you had DNA from many sub-populations,
Asian, European, African, can you separate them?
– How is disease gene mapping done?
![Page 39: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/39.jpg)
Variations in DNA• What is a SNP?• What is DNA
fingerprinting?• What can you
study with these variations?
![Page 40: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/40.jpg)
How do these individual differences occur?
• Mutation• Recombination
![Page 41: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/41.jpg)
Mutations
000001010111000110100101000101010010000000110001111000000101100110
Infinite Sites Assumption:Each site mutates at most once
![Page 42: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/42.jpg)
Recombination
1101010100010111101010001010110100
1101010101011010011010101010110100
![Page 43: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/43.jpg)
Genotypes and Haplotypes• Each individual has two “copies” of each
chromosome. • At each site, each chromosome has one of two
alleles
• Current Genotyping technology doesn’t give phase
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
01
1 01 1 0 0 1 0
1 0 Genotype for the individual
![Page 44: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/44.jpg)
SNP databases
• Quiz: Given a database of ‘variations’ in a population (EX: dbSNP), how do you use it to map disease genes?
• Given database from different ethnicities, how do we check the ethnicity of a specific individual?
![Page 45: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/45.jpg)
Summary
• Biological data is complex.• Hard to standardize representation, and
harder to query such data• Important to understand this diversity
and the variety of tools available for querying.
![Page 46: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/46.jpg)
Course Outline
• Informal description of various data repositories
• Tools for querying this data– Underlying algorithms– Implementation issues
• Assignments– Using & building simple versions of these
tools.
![Page 47: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/47.jpg)
Perl/Python
• Advanced programming skills are not required except in optional projects..
• Facility for handling and manipulating data is important and will be covered in this course.
• Perl/Python are appropriate scripting languages. You can do a lot by learning a little.
![Page 48: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/48.jpg)
Grading
• 40% assignments, 15% Mid-term, 15% Final, 30% Project
• For all assignments, you are free to discuss, and use web resources unless otherwise stated.– Cite all sources and collaborators!
• The final exam will be take home and no collaboration is allowed.
• Academic honesty is more important than grades!
![Page 49: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/49.jpg)
Assignment 1
• Will be given out Tuesday. • Due in class next week, but is fairly
simple to accomplish with a scripting language.
![Page 50: CSE/Beng/BIMM 182: Biological Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062323/568167d6550346895ddd2ea5/html5/thumbnails/50.jpg)
Project• You can team up (<= 3) to do the project.• Some project require more biology, others
require serious programming.• There are 3 checkpoints, after the first
midterm.• For the final project, you must make a 15min
presentation at the end of the class.