lecture 10. genetic and genomic databasesbi190/bi190-2011-handout9.pdfbi190 advanced genetics 2011...

9
Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome Sternberg 2011 1 Lecture 10. Genetic and Genomic Databases Biological information resources that organize information around genomes and genes have become essential tools of life science research. The information comes from multiple sources, is organized so as to be computable, and is displayed for use. It is important to know where the information comes from, how it is organized using ontologies (standardized terms and their relationships), how one can search for information. Also, everyone needs a sense of how complete the information is. We also include a discussion of the importance of data standards. The extent of genome-scale data surpass a human’s ability to comprehend it at one time. We thus have to rely on databases of genomic information. Because genomic databases are interposed between primary data and the geneticist, it is crucial to understand how the information gets into such database, how one assesses the quality of these data, and the concepts underlying their storage, query and integration. There are many ways to organize knowledge. While humans are amazingly good at dealing with a hodge-podge of information, computers are notoriously bad at this task: computers need structured information! In the genome era, the large amounts of information led to the need for us to use computers to store and manipulate information because computers are very fast and accurate at doing repetitive tasks. Information can be compiled in a standard form. Information can be assembled into defined structures. One of the ways to organize biological information is to attach to the linear structure of a genome. Another way is by the anatomy of the organism. Yet another way is to human disease or a specific property. There are now thousands of biological databases. These databases range in size, complexity, purpose, and whether they serve humans, computers or both. A. How data gets into databases A genome database organizes information around a genome seqeuence. A genome database typically is organized around a genomic sequence. For example, the database might simply store the genome sequence and some description of features of that sequence. Such descriptions are called annotations. This is where it can get rather complex with many thousands of specific types and sub-types of features. …tctctctatatgatctgcagcaggtcatctctgcggcttatgcgttagcgcg… What types of information do we want? We might want to know what regions of the genome are repetitive, or the extent of each gene. Genomic databases have a variety of content DNA sequence (Chapter 8) is submitted by the producer of the data to one of the large public DNA sequence databases (GenBank or ..). The sequence typically includes some annotations. RNA or cDNA sequence is obtained by submission of the data upon publication. Primary annotations to the genomic sequence Extent of clones Genes Other elements

Upload: lythu

Post on 03-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

Lecture 10. Genetic and Genomic Databases Biological information resources that organize information around genomes and genes have become essential tools of life science research. The information comes from multiple sources, is organized so as to be computable, and is displayed for use. It is important to know where the information comes from, how it is organized using ontologies (standardized terms and their relationships), how one can search for information. Also, everyone needs a sense of how complete the information is. We also include a discussion of the importance of data standards. The extent of genome-scale data surpass a human’s ability to comprehend it at one time. We thus have to rely on databases of genomic information. Because genomic databases are interposed between primary data and the geneticist, it is crucial to understand how the information gets into such database, how one assesses the quality of these data, and the concepts underlying their storage, query and integration. There are many ways to organize knowledge. While humans are amazingly good at dealing with a hodge-podge of information, computers are notoriously bad at this task: computers need structured information! In the genome era, the large amounts of information led to the need for us to use computers to store and manipulate information because computers are very fast and accurate at doing repetitive tasks. Information can be compiled in a standard form. Information can be assembled into defined structures. One of the ways to organize biological information is to attach to the linear structure of a genome. Another way is by the anatomy of the organism. Yet another way is to human disease or a specific property. There are now thousands of biological databases. These databases range in size, complexity, purpose, and whether they serve humans, computers or both. A. How data gets into databases A genome database organizes information around a genome seqeuence. A genome database typically is organized around a genomic sequence. For example, the database might simply store the genome sequence and some description of features of that sequence. Such descriptions are called annotations. This is where it can get rather complex with many thousands of specific types and sub-types of features. …tctctctatatgatctgcagcaggtcatctctgcggcttatgcgttagcgcg… What types of information do we want? We might want to know what regions of the genome are repetitive, or the extent of each gene. Genomic databases have a variety of content DNA sequence (Chapter 8) is submitted by the producer of the data to one of the large public DNA sequence databases (GenBank or ..). The sequence typically includes some annotations. RNA or cDNA sequence is obtained by submission of the data upon publication. Primary annotations to the genomic sequence Extent of clones Genes Other elements

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

Gene expression data relates a gene or specific DNA sequence to a level, time, place or condition of gene expression. It might be derived from direct assay of the mRNA, protein, or from a reporter gene construct. For large-scale gene expression data (see Chapter 11) such as microarray or RNA-seq the data are submitted to standard databases. For other assays there is no requirement for submission of data upon publication, and most such data are curated by hand. Genetic variants relative to the reference sequence. Sequence conservation. A quantitative measure of the extent of conservation is calculated from aligning multiple sequences. Genetic mapping data Gene-Gene interactionsprotein-protein, protein-RNA interactions Association of genetic variants to phenotypes Figure SGF-1082. Part of human RAB3A locus. (

Figure. Zoomed in view showing amino acids.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

2. Stable, unique identifiers help maintain data integrity Imagine a gene with 12 names. Example from human or yeast. For example, the yeast SIN3 gene is also known as YOL004W, CPE1, GAM2, RPD1, SDI1, SDS16, and UME4. Databases store synonyms and make it easy to keep these straight. However, imagine two names that mean different things. cdc25 is the S. pombe////and CDC25, find worse cases. Imagine genes that change names because they merge or split. These nightmares happen frequently and can lead to great confusion. An excellent solution to these situations is to assign a stable name that identifies the gene. The largest database of biomedical abstracts (PubMed) does not store first names of authors of papers. Many individual researchers have the same initials and thus these names are ambiguous in the database. It would be an enormous task to disambiguate these names. As humans we would prefer to use whatever symbol we like, and computer programs can often help us translate our symbols into hidden unique identifiers. When there is ambiguity, we might get to choose. For example, if one searches PubMed for “elegans” you get returns that include C. elegans the nematode and S. elegans the turtle. Searching for “C. elegans” you might get C. elegans the flowering plant (Camelia elegans). NLM-NCBI uses unique identifiers for each taxon, so with discipline the user can find specifically the species of interest. Results can be tied to reagents to keep track of an inference chain Imagine an RNAi experiment is done with a particular sequence that is uniquely mappable to the genome but it is assigned to a gene, T. We associate T to Phenotype W using this RNAi reagent. Now, new sequencing of cDNA reveals that T, which had been predicted from the

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

genomic sequence and had only partial cDNA support, is split into two genes, T and U. If the RNAi sequence is now in U, and to keep the database straight, someone has to realize this and change the association of the Phenotype to U. However, if the RNAi experiment is associated with a sequence that is remapped continually to genome, then the Phenotype is associate correctly with U. Many data are curated by humans In many cases, data is entered into genomic database after examination and processing by a professional curator. For example, a biologist reads a research paper and extracts information. Data in a table in a paper is reasonably standard format. Researchers enter data Data is extracted automatically from papers B. Ontologies and their use Ontologies organize information Ontologies formalize some types of information. An ontology is a description of the relationship among defined terms. Both the terms and their relationships are defined, and this structured information allows computers to utilize the information effectively. Ontologies cover many types of information

Anatomy Ontologies e.g. mouse Phenotype e.g. C. elegans

The Gene Ontologies capture some basic information about genes and their products. Gene Ontologies (GO) include evidence for the associations. MOD organism or group database URL

sea urchin http://www.spbase.org/SpBase/

cellular slime mold  http://dictybase.org/ Drosphila melanogaster (fruitfly)  http://flybase.org/ C. elegans and other nematodes  wormbase.org yeast budding  yeast fission  Gramene  http://www.gramene.org/ mouse  rat  zebrafish  frog 

The image cannot be displayed. Your computer may not have enough memory to

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

The protein content of a genome can help define Pathways Figure Genetic modules based on association in the genome.

Conceptual view of a biological database

database table

organism URL human http://genome.ucsc.edu/cgi-

bin/hgGateway

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

mouse http://www.informatics.jax.org/ C. elegans http://www.wormbase.org/ Drosophila http://flybase.org/ S. cerevisaie http://www.yeastgenome.org/ S. pombe http://www.pombase.org/ (not yet

live)

http://old.genedb.org/genedb/pombe/ genome size 12.5 Mb (~14.1 Mb

rat http://rgd.mcw.edu/ zerafish http://zfin.org/cgi-

bin/webdriver?MIval=aa-ZDB_home.apg

Arabidopsis http://www.arabidopsis.org/ gramene http://www.gramene.org/ Gene Ontology Consortium

http://www.geneontology.org/

Variation

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

Bi190 Advanced Genetics 2011 Lecture 10/ho9 Annotating Genome

 

Sternberg 2011 

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.