ncbi’s bioinformatics resources michele r. tennant, ph.d., m.l.i.s. health science center...
TRANSCRIPT
![Page 1: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/1.jpg)
NCBI’s Bioinformatics NCBI’s Bioinformatics ResourcesResources
Michele R. Tennant, Ph.D., M.L.I.S.
Health Science Center Libraries
U.F. Genetics Institute
January 2015
![Page 2: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/2.jpg)
Entrez Nucleotides
![Page 3: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/3.jpg)
Entrez Nucleotides (GenBank)
• Database of nucleotide sequences (ATGC)
• Actually contains data from several databases - GenBank, EMBL, DDBJ, RefSeq
• Hard to search because many submitting scientists send in redundant information and poorly annotated information
![Page 4: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/4.jpg)
Nucleotide Data DomainNucleotide Data Domain
• As of December 15, 2014
• Over 184,938,063,614 bases
• Over 179,295,769 sequence records
• Some complete genomes and chromosomes
![Page 5: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/5.jpg)
So Why So Hard to Search?So Why So Hard to Search?
• No controlled vocabulary - lose power of MeSH - must OR synonyms. Often miss the records you want.
• Archival - quality of annotations depends on the submitter (especially features field); little to no quality control; spelling errors! Often miss the records you want.
• Redundant - lots of records for the same gene; partial records, etc. Often pull up records you don’t want.
![Page 6: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/6.jpg)
GenBank Sample Record
• Before searching, we will look at a GenBank sample record
• Note that the “Features” field provides useful biological information, and may be searched
![Page 7: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/7.jpg)
![Page 8: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/8.jpg)
Click any link in sample record
to access definition of
field and search tips
“Definition” field acts as record title – search
[titl]
Unique identifier; assigned by NCBI;
required by journals/grants
Link to PubMed citation/abstract
![Page 9: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/9.jpg)
The “Features” field provides the most
biological information; search as [fkey]
Numbers indicate
location on the nucleotide sequence
![Page 10: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/10.jpg)
…3158
![Page 11: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/11.jpg)
GenBank IdentifiersGenBank Identifiers
• Accession Number - U49845 [accn]• Unique identifier; does not change• Letter prefix no longer has significance
• Version - U49845.1 • If any change to sequence, version
U49845.2 created• GenInfo Identifier (GI number) [uid]
• Run parallel to accession.version system; change in sequence changes number
![Page 12: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/12.jpg)
Searching “Nucleotides”• Database is difficult to search:
• Redundant records• Archival - poor or missing annotation
• Best searches are done using commands; need a class to learn all
• Practice search – search for sequences for human presenilin 1• Is there anything odd about the some of the
retrieved results?
![Page 13: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/13.jpg)
Search for HUMAN presenilin 1
But end up with rat, mouse, etc.
Choose “nucleotide” from dropdown, then
click “search”
![Page 14: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/14.jpg)
Searching “Nucleotides”• We retrieved the non-human and PSEN2
(rather than PSEN1) records because the computer looked for the terms “human” and “presenilin 1” ANYWHERE in the record (click on details tab to see how the computer parsed your search)
• Use complex boolean searching to clean this up: term [field] AND term [field]
![Page 15: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/15.jpg)
Searching “Nucleotides”• How to get rid of non-human sequences?
• Search human [orgn] (this works for any taxon)
• How to get rid of non-presenilin 1 sequences?• Another trick – search PSEN1 [gene]• Note – you may miss relevant sequences, but should not pick
up irrelevant sequences• The sequences that you miss are the ones that have not been
annotated with the current official gene symbol in the “gene” field
• DO NOT use this method if you need to find every sequence for a particular gene
• Human [orgn] AND PSEN1 [gene]
![Page 16: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/16.jpg)
Use these filters to choose molecule type, confine to
RefSeq records
This is the search that was completed using fields (orgn, gene) and filters
![Page 17: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/17.jpg)
How Can I Find “Best” Sequences
• Non-redundant, curated subset of the sequence data domains
• Contains one record for each gene or splice variant from each organism represented
• Records can be thought of as “review articles” for sequences
• “Best” (usually longest) sequence used as seed• Value-added annotations provided by experts• Easy – a tab now exists to limit retrieval to just RefSeq
![Page 18: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/18.jpg)
Click on the RefSeq link to retrieve only the “best” sequences (highly
annotated, complete, nonredundant)
The typical RefSeq accession number
format: 2 letters, an underscore, and
then numbers
![Page 19: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/19.jpg)
![Page 20: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/20.jpg)
Viewing Formats
• The “Default” view is the standard GenBank record
• Researchers often use the “FASTA” format for analysis
• Change the record format at the “Display” pull-down menu
![Page 21: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/21.jpg)
Entrez Proteins
![Page 22: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/22.jpg)
Entrez ProteinsEntrez Proteins
• Contains data from several databases:
• SwissProt, PIR, PRF, PDB
• Translations from annotated coding regions in GenBank and RefSeq
• Redundant archival data domain of publicly available protein sequences
![Page 23: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/23.jpg)
Searching Entrez ProteinsSearching Entrez Proteins
• Searched like Entrez Nucleotides• “Filters” choices differ; includes
molecular weight and sequence length filters
![Page 24: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/24.jpg)
Entrez Gene
![Page 25: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/25.jpg)
Entrez Gene
• Pulls together information (sequences, structures, literature, gene models, pathways, etc.) for genes
• Best place to start for “gene-centered” info• One record per gene per organism• Search by names, symbols, accessions,
publications, GO terms, chromosome numbers, E.C. numbers, etc.
![Page 26: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/26.jpg)
Search using gene symbol
Could have searched under any of these
aliases (unlike GenBank where you
would have to try them all)
![Page 27: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/27.jpg)
Official gene symbol as
determined by the Human
Genome Nomenclature Commission
![Page 28: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/28.jpg)
Summary of protein, function
and disease-causing mutations; from RefSeq record
Links to PubMed records
that provide evidence of
function – any researcher can
add these
![Page 29: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/29.jpg)
Links to OMIM records of phenotype/
disease
Gene Ontology terms form a controlled vocabulary with three components – biological process, molecular function, and cellular
component
Links to homology
maps
Links to protein interactions
![Page 30: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/30.jpg)
Pathway info may be available from
the Kyoto Encyclopedia of
Genes and Genomes
Sequence and domain links
Links to GeneReviews – clinical resource
![Page 31: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/31.jpg)
Taxonomy Browser
![Page 32: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/32.jpg)
Search Taxonomy BrowserSearch Taxonomy Browser
• How many genera from the family Iguanidae are represented by sequence data?
• How many nucleotide and protein sequences are available for the family?
![Page 33: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/33.jpg)
![Page 34: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/34.jpg)
![Page 35: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/35.jpg)
Entrez Searching Summary
![Page 36: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/36.jpg)
To Find Everything(?) To Find Everything(?) Broaden SearchBroaden Search
• OR together synonyms • OR together related terms (gene name, gene symbol,
protein name, alternate spellings, disorder)• Don’t specify a field- search entire record• Truncation - use * at end of word root• Click “Related Records”• Try using Taxonomy Browser to pick up all taxa in a
particular group
![Page 37: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/37.jpg)
Fewer/Best Records Fewer/Best Records Narrow SearchNarrow Search
• Search particular fields:
• PubMed - MeSH Browser, subheadings, major MeSH
• Nucleotide - features, title, gene, properties, organism
• Use “Filters”• Search only the RefSeq database
![Page 38: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/38.jpg)
Will Entrez Find Will Entrez Find Every Sequence Record?Every Sequence Record?
• No!!! • Entrez relies on annotation of records, so you are
searching solely on “terminology”• Some records are not annotated, some records are
poorly or incorrectly annotated
• To find all useful sequences – need to search on sequence itself • Related sequence link• BLAST
![Page 39: NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015](https://reader036.vdocuments.us/reader036/viewer/2022062801/56649e7e5503460f94b817d6/html5/thumbnails/39.jpg)
Entrez “Related Records”Entrez “Related Records”
• Will vary depending on data domain• PubMed related articles
• Based on a “word weight” algorithm – MeSH, title, abstract words
• In order by weight (highest weight first)
• Nucleotide and protein related sequences• Based on basic BLAST search• In order by best BLAST score