24 august 2012ganesha associates1 basic reading, writing and informatics skills for biomedical...
TRANSCRIPT
Ganesha Associates 124 August 2012
Basic reading, writing and informatics skills for biomedical
researchSegment 4. Other types of
database and browser
Ganesha Associates 224 August 2012
Biological databases• A database is an indexed collection of information• Some databases contain mainly text, but others contain image,
sequence or structural data• A browser is a means of visualising this information and the
relationships between data elements• There is a growing amount of information in publicly available
databases. • For example, in 2011 the Nucleic Acids Research journal online
Molecular Biology Database Collection listed 1380.• The National Center for Biotechnology Information (NCBI) and the
European Bioinformatics Institute(EBI) host some of the most important databases used for biomedical research.
• Wikipedia also contains a list of biological databases• Which databases are relevant to your project?
Ganesha Associates 3
Data, data everywhere…• “Rapid release of prepublication data has served the
field of genomics well.”• “With close to one million gene-expression data sets now
in publicly accessible repositories, researchers can identify disease trends without ever having to enter a laboratory.”
• “Most researchers agree that open access to data is the scientific ideal, so what is stopping it happening [in other fields]?”
• “Earth scientists need better incentives, rewards and mechanisms to achieve free and open data exchange”
24 August 2012
Ganesha Associates 424 August 2012
The database problem
• Volume of digital data (both high throughput and text)– One second of HD video = 2000 pages of text
• Distributed systems and databases, lack of data standards, incompatible data formats
• Costs of creation, curation and maintenance• Retrieval: semantic search, metadata, images…
Ganesha Associates 524 August 2012
The problem – biomedical research
Gene ExpressionWarehouse
ProteinDisease
SNP
Enzyme
Pathway
Known Gene
SequenceCluster
Affy Fragment
Sequence
LocusLink
MGD
ExPASySwissProt
PDBOMIM
NCBIdbSNP
ExPASyEnzyme
KEGG
SPAD
UniGene
Genbank
NMR
Metabolite
Ganesha Associates 1024 August 2012
The problem - healthcare JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July
26th 2000
• 2,000 deaths/year from unnecessary surgery• 7,000 deaths/year from medication errors in hospitals• 20,000 deaths/year from other errors in hospitals• 80,000 deaths/year from infections in hospitals• 106,000 deaths/year from non-error, adverse effects of medications
These total up to 225,000 deaths per year in the US from iatrogenic causes which ranks these deaths as the # 3 killer.
Iatrogenic is a term used when a patient dies as a direct result of treatments by a physician, whether it is from misdiagnosis of the ailment or from adverse drug reactions used to treat the illness (drug reactions are the most common cause).
Ganesha Associates 1124 August 2012
The problem - healthcare• 17 year innovation adoption curve from discovery into
accepted standards of practice• Even if a standard is accepted, patients have a 50:50
chance of receiving appropriate care, a 5-10% probability of incurring a preventable, anticipatable adverse event
• Medical literature doubling every 19 years– Doubles every 22 months for AIDS care
• 2 million facts needed to practice • Genomics and personalized medicine will increase the
problem exponentially• Typical drug order today with decision support accounts
for, at best, Age, Weight, Height, Labs, Other Active Meds, Allergies, Diagnoses
Ganesha Associates 1224 August 2012
So how will we find things in databases ?
• Search engine collects, indexes, parses, and stores data to facilitate fast and accurate information retrieval.
• Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics (statistics), informatics, physics and computer science.
Ganesha Associates 2224 August 2012
Semantic levels
Definition Synonyms Classification (is_a)
Properties (has_a)
Other relations
Keywords
Dictionary
Controlled vocabulary
Thesaurus
Taxonomy
Ontology
Ganesha Associates 2724 August 2012
The Gene Ontology organisation
• The objective of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products.
• These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them.
• The controlled vocabularies of terms are structured to allow both attribution and querying to be at different levels of granularity.
• http://www.geneontology.org
Ganesha Associates 3324 August 2012
Mitochondrial P450 (CC24 PR01238; MITP450CC24)
An example of annotation
GO cellular component term:mitochondrial inner membrane ; GO:0005743
GO molecular function term:monooxygenase activity ; GO:0004497
GO biological process term:electron transport ; GO:0006118
Ganesha Associates 3524 August 2012 attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
MicroArray data analysis with GO
Ganesha Associates 3624 August 2012
GoPubMed
• GoPubMed is a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) serve as "Table of contents" in order to structure the millions of articles of the MEDLINE data base.
• GoPubMed is one of the first Web 2.0 search engines.
• The system was developed at the Technical University of Dresden by Michael Schroeder and his team and at Transinsight.
• http://www.gopubmed.org
Ganesha Associates 3824 August 2012
Medline CognitionCognition's Semantic NLP Understands:
Word stems - the roots of words; Words/Phrases - with individual meanings of ambiguous words and phrases listed out; The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like "re" and "ation"; How to disambiguate word senses - This allows Cognition's technology to pick the correct word meaning of ambiguous words in context; The synonym relations between word meanings; The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic "family tree of English" with mothers, daughters, and cousins; The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with.
Ganesha Associates 4024 August 2012
iHOP
Information Hyperlinked over Proteins. iHOP provides the network of genes and proteins as a natural way of accessing the millions of abstracts in PubMed
Ganesha Associates 4124 August 2012
iHOP• The minimal information view contains general
information, like the symbol, name and organism of a gene. Moreover it provides: – Useful links to external resources (e.g. UniProt, NCBI, OMIM,
etc.) – Links to other iHOP views on this gene – Homologues
• Other views contain all sentences found in the literature:– For the main gene of a page and other genes (gene B) which
iteract. – That mention the main gene together with relevant biomedical
terms such as lymphoma. • Sentences are ranked by significance, so that screening
over a few sentences will be usually sufficient to gain an idea of a gene's function.
Ganesha Associates 4324 August 2012
GenMAPP
• GenMAPP is a free computer application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes.
• Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of Gene Ontology Terms.
Ganesha Associates 4524 August 2012
Other ways to search – BLAST, PubChem, UCSC Genome Browser
>DinoDNA from JURASSIC PARK p. 103 nt 1-1200GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGGGGGGCG
By sequence – BLAST:
By structure – PubChem:
Ganesha Associates 4824 August 2012
UCSC Genome Browser• The Genome Browser zooms and scrolls over
chromosomes, showing the work of annotators worldwide.
• The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways.
• Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database.
• VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns.
• Genome Graphs allows you to upload and display genome-wide data sets.