introduction to biology saleet jafri gmu cholesterol water seeking head group water fatty chains...

84
Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant orga Presumed common progenitor of archaebacteria and eukaryotes ARCHAEA EUKARYA BACTERIA

Upload: alice-ursula-mills

Post on 30-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Introduction to Biology Saleet Jafri GMU

Cholesterol Water seeking head group

water fatty chains

Presumed common progenitor of all extant organism

Presumed common progenitor of archaebacteria and eukaryotes

ARCHAEA

EUKARYA

BACTERIA

Page 2: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Cell wall (outer membrane)Cell wall (inner membrane)

Ribosome

DNA

RNA

Nucleoid DNA

mesosomeseptum

inner (plasma) membraneCell wallPeriplasmic spaceOuter membrane

Periplasmic space and cell wall

outerMembrane inner membrane

nucleoid

0.5 m|_____|

Page 3: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Eucharyotic cell (organelles)

Nuclear membrane

Plasma (cell) membrane

Gogli vesicles

lysosome

Secretory vesicle

Nucleus

Mitochondrion Peroxisome

Rough endoplasmic reticulum

1 m|____|

Cell membraneNucleusCytoplasmEndoplasmic Reticulum (ER) – rough and smooth – A membranous organelle system in the cytoplasm.

The outer surface may be ribosome-studded (rough) or not (smooth).Gogli apparatus –receives newly formed proteins from ER; modifies; directs them to final destination.Mitochondria – respiratory centers, have their own circular DNA, of bacterial origin.Chromosomes – chromatin, histones, centromeres and arms (2 pair in Eukaryotes).Lysosomes – contain acid hydrolases – nucleases, proteases, glycodidases, lipases, phosphatases, sulfatases, phosopholipases.Peroxisomes – use oxygen to remove hydrogen from substrates forming H2O2, abundant in kidney and liver detoxification.Cytoskeleton – an internal array of microtubules, microfilaments, and intermediate filaments that confer shape and the ability to move on a Eukaryote cell.

Page 4: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Eukaryote Membrane

2 kinds of Nucleic Acids (RNA = ribonucleic acid and DNA = deoxyribonucleic acid)Nucleic acid structure: purines: adenine=A guanine=G

pyrimidines: uracil=U thymine=T cytosine=CA always pairs with T and C always pairs with G (each pair is called a base pair in double helix DNA)DNA may consist of millions of base pairsA short sequence (<100) is called an oligonucleotideRNA: different sugar (ribose instead of 2’-deoxyribose)Uracil (U) instead of thymine (U binds with A)RNA does not form a complex 3-D structure (like DNA and other protein)Protein = functional and structural units of the cellCentral Dogma: DNA RNA protein (flow of information is unidirectional)

leaflets

Fatty acyi tails

Phospholipidbilayer

phospholipids

Hydrophilic polar headPeripheral proteins

Hydrophobic core

Integral protein

oligosaccharide

Exterior

Interior

glycoprotienglycolipid

Page 5: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Gene or DNA transcription RNA molecules synthesized by RNA polymerase.RNA polymerase binds very tightly to promoter. region on DNA.Promoter region contains start site.Transcription ends at termination signal site.Primary transcript: direct coding of DNA RNA.RNA splicing: introns removed to make mRNA.mRNA has codon sequence that codes for a protein.Uracil replaces thymineSplicing and alternative splicing happens.

TranslationTransfer RNA (tRNA) makes connection between specific codons in mRNA and amino acids.As tRNA binds to the next codon in mRNA, its amino acid is bound to the last amino acid in the protein chain.When a STOP codon is encountered, the ribosome releases the mRNA and synthesis ends.

tRNA links an amino acid to the codon on the mRNA via the anti-codon.rRNA = RNA found in ribosomesRibosomes = large and small subunit, made of protein and rRNAInitiator tRNA always carries methionineInitiation factors=proteins catalyzing start of transcription

Endoplasmic reticulumPost-transcriptional modification

translation

. mRNA . | exon1 | exon2 | exon3|

.------Chromosomic DNA (gene)--------------------------.||||||||||||| exon1 |||||||||| exon2 |||||||||||||| exon3 |||||||||||Promoter intron1 intron2 intron3

transcription

. Nuclear RNA . | exon1 |||||||||| exon2 |||||||||||||| exon3| intron1 intron2

Page 6: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed
Page 7: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed
Page 8: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed
Page 9: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

In eukaryotes, 1 mRNA = 1 protein. (in bacteria, 1 mRNA can be polycistronic, or code for several proteins)

DNA in eukaryotes forms a stable, compacted complex with histones (in bacteria, DNA is not in a permanently condensed state

Eukaryotic DNA contains large regions of repetitive DNA. (in bacteria, DNA rarely contains any "extra" DNA)

Much of eukaryotic DNA does not code for proteins (~98% is non-coding in humans; in bacteria, often less than 5% of genome)

Sometimes, eukaryotes can use controlled gene rearrangement for increasing number of specific genes. (in bacteria, happens rarely)

Eukaryotic genes are split into exons and introns. (in bacteria, genes are almost never split)

In eukaryotes, mRNA is synthesized in nucleus, then processed and exported to cytoplasm. (in bacteria, transcription and translation can take place simultaneously off same piece of DNA

Central Dogma

RNA

Protein

DNAProposed by Francis Crick in 1958 to describe the flow of information in a cell.

Information stored in DNA is transferred residue-by-residue to RNA which in turn transfers the information residue-by residue to protein.

The Central Dogma was proposed by Crick to help scientists think about molecular biology.

It has undergone numerous revisions in the past 45 years.

deoxyribonucleic acid

ribonucleic acid

Concept of gene is historically defined on basic of genetic inheritance of phenotype (Mendellian Inheritance)

DNA of an organism encodes genetic info. It’s made up of double stranded helix composed of ribose sugars

Adenine(A), Citosine (C), Guanine (G) and Thymine (T).

[note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]

Page 10: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

base: thymine(pyrimidine)

sugar: 2’-deoxyribose

monophosphate

no 2’-hydroxyl

(5’ to 3’)

5’

3’

base:adenine(purine)

1’2’

4’

3’ linkage

5’ linkage

DNA

base

sugar

nucleoside

DNA: terminology

nucleotides (nucleoside mono-, di-, and triphosphates)

base

sugar

phosphate(s)

Page 11: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

DNA is double stranded

DNA strands are antiparallel

G-C pairs have 3 hydrogen bonds

A-T pairs have 2 hydrogen bonds

One strand is the complement of the other

Major and minor grooves present different surfaces

Cellular DNA is almost exclusively B-DNA

B-DNA has ~10.5 bp/turn of the helix

DNA: structure

Page 12: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

nucleoside

base

sugar

Base Nucleoside (RNA) Deoxynucleoside (DNA)

Adenine Adenosine DeoxyadenosineGuanine Guanosine DeoxyguanosineCytosine Cytidine DeoxycytidineUracil Uridine (not usually found)Thymine (not usually found) (Deoxy)thymidine

RNA:terminology

RNA can be single or double stranded

G-C pairs have 3 hydrogen bonds

A-U pairs have 2 hydrogen bonds

Single-stranded, double-stranded, and loop RNA present different surfaces

Page 13: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Protein

20 amino acids

amino group

carboxyl group

Peptide bond

Page 14: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

-helix antiparallel -sheet

Protein structure

Page 15: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

The Central Dogma

Transcription

Translation

Replication

RNA

Protein

DNA

duplication of DNA using DNA as the template

synthesis of RNA using DNA as the template

synthesis of proteins using RNA as the template

ATGAGTAACGCGTACTCATTGCGC

ATGAGTAACGCGTACTCATTGCGC

ATGAGTAACGCGTACTCATTGCGC

+

AUGAGUAACGCG

MetSerAsnAla

(nontemplate, antisense)(template, sense)

(mRNA)

(protein)

(gene)

tRNAribosomes

codon

Page 16: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

The Central Dogma

Transcription

RNA processing

Translation

Post-translational modification

Replication

Repair and recombination

RNA

Protein

DNA

1. RNA pol I-ribosomal RNA (rRNA)2. RNA pol II-messenger RNA (mRNA)3. RNA pol III-5S rRNA, snRNA, tRNA

1. mRNA splicing2. rRNA and tRNA processing3. capping and polyadenylation

1. DNA pol and

2. DNA pol and

1. phosphorylation2. methylation3. ubiquitination

Page 17: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Compartmentalization of processes (transport is important)

replication

Splicing out introns?

Page 18: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Regulation occurs at each step of a process

1. Initiation (starting)-what is the signal that initiates the process?-what are the factors involved in initiation (cis-and trans-acting)?

2. Elongation (continuation)-how is the process maintained with high fidelity once initiated?-what are the factors involved in elongation (cis- and trans-acting)?

3. Termination (ending)-what is the signal that stops the process?-what are the factors involved in termination (cis- and trans-acting)?

Other general regulatory considerations

1. How is the rate of a process regulated?2. How are the steps regulated in a cell, tissue, or gene-specific manner?3. Stability of biomolecules4. Cellular localization of biomolecules

Page 19: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Exceptions to the Central Dogma

DNA

retroviruses use reverse transcriptaseto replicate their genome(David Baltimore and Howard Temin)

RNA viruses

mRNA introns (splicing)(Philip Sharp and Richard Roberts)

RNA editing (deamination of cytosineto yield uracil in mRNA)

RNA interference (RNAi) a mechanismof post-transcriptional gene silencing utilizing double-stranded RNA

RNAs (ribozymes) can catalyze anenzymatic reaction(Thomas Cech and Sidney Altman)

RNA

Protein

Prions are heritable proteins responsiblefor neurological infectious diseases(e.g. scrapie and mad cow) (Stanley Pruisner)

Epigenetic marks, such as patterns ofDNA methylation, can be inherited andprovide information other than the DNAsequence

Nobel Prizes

Page 20: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

The Flow of Biotechnology Information

> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTAACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGACGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCTACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGAACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGTAAGAAGATCGCGAACATCTAGTAGA

> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANI

Gene Function

Page 21: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Prokaryotes(intronless protein coding genes)

Transcription (gene is encoded on minus strand .. And the reverse complement is read into mRNA)

mRNA

5´ UTR 3´ UTRCoDing Sequence (CDS)

ATG

Downstream (3’)promoter Gene regionUpstream (5’)

DNATAC

ATG

Translation: tRNA reads off each codon (3 bases at a time) starting at start codon until it reaches a STOP codon.

protein

Why does nature bother with mRNA? Why would the cell want to have an intermediate between DNA and the proteins it encodes?

•Gene information can be amplified by having many copies of an RNA made from one copy of DNA. •Regulation of gene expression can be effected by having specific controls at each element of the pathway between DNA and proteins. The more elements there are in the pathway, the more opportunities there are to control it in different circumstances. •In Eukaryotes, DNA can then stay pristine and protected, away from caustic chemistry of cytoplasm.

downstreampromoterupstream

Gene 1 Gene 2 Gene 3

prokaryote (operon structure): Sometimes genes that are part of same operational pathway are grouped together under single promoter - then produce a pre-mRNA which eventually produces 3 separates mRNAs

Page 22: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Bacterial genomes have simple gene structure.

Bacterial Gene Structure of signals

- Transcription factor binding site.

- translation binding site (shine-dalgarno 10 bp upstream of AUG (AGGAGG)

- Promoters

-35 sequence (T82T84G78A65C54A45) 15-20 bases

-10 sequence (T80A95T45A60A50T96) 5-9 bases

-Start of transcription : initiation start: Purine90 (sometimes it’s the “A” in CAT)

- Termination

- One or more Open Reading Frame

- start-codon (unless sequence is partial)

- until next in-frame stop codon on that strand ..

Separated by intercistronic sequences.

Genetic Code: How does an mRNA specify an amino acid seq? It would be impossible for each amino acid to be specified by one nucleotide, because there are only 4 nucleotides and 20 amino acids.2 nucs could specify 16; 3 ~ 64. Each amino acid is specified by up to 6 different combos of 3 nucleotides, called codons, each coding for one amino acid. 1st codon is START, and usually coincides with Methionine. (M which has codon code ‘ATG’)Last codon is STOP, and does NOT code for an amino acid. It is sometimes represented by ‘*’CoDing region (CDS) starts at START codon and ends at STOP.Different organisms have different frequencies of codon usage. A handful of species vary from this codon association and use different codons for different amino acids. How do tRNAs recognize to which codon to should an amino acid? tRNA has anticodon on its mRNA binding end, complementary to the codon on the mRNA. Each tRNA only binds appropriate amino acid for its anticodon.

Page 23: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

RNA

RNA has the same primary structure as DNA. It consists of a sugar-phosphate backbone, with nucleotides attached to the 1' carbon of the sugar. DNA/RNA differnces are:

RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference between deoxyribonucleic acid and ribonucleic acid.

Instead of using the nucleotide thymine, RNA uses another nucleotide called uracil:

Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure.

Because the RNA molecule is not restricted to a rigid double helix, it can form many different stable three-dimensional tertiary structures.

tRNA ( transfer RNA) is a small RNA that has a very specific secondary and tertiary structure such that it can bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T

tRNA Secondary structure

3-D Tertiary structure

Page 24: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change.

Most modern gene prediction programs need to be “trained”. E.g. they find their own consensus and assembly rules given a few examples genes.

A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORF’s restrict the search space of possible gene candidates.

E.g. selfid program([email protected])

Bacterial Gene Prediction

OPEN READING FRAME: On a given piece of DNA, there can be 6 possible frames. The ORF can be either on + or minus strand and on any of 3 possible frames

Frame 1: 1st base of start codon can either start at base 1,4,7,10,...

Frame 2: 1st base of start codon can either start at base 2,5,8,11,...

Frame 3: 1st base of start codon can either start at base 3,6,9,12,...

(frame –1,-2,-3 are on minus strand)

Some progs have other conventions for naming frames (0..5, 1-6..)

Gene finding in eukaryotic cDNA uses ORF finding +blastx as well.

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

try with gi=41 ( or your own piece of DNA)

Page 25: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus) The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. ( many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a messenger to carry the information stored in the DNA in the nucleus to the cytoplasm where the ribosomes can make it into protein.

Eukaryotic Central Dogma

Page 26: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Eukaryotic Nuclear Gene Structure

Gene prediction for Pol II transcribed genes.

• Upstream Enhancer elements.

• Upstream Promoter elements.

• GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)

• TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990))

• 14-20 nt spacer DNA

• CAP site (8 bp)

• Transcription Initiation.

• Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon.

• polyA signal (AATAAA 99%,other)

Page 27: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

•Transcript region, interrupted by introns. Each introns

•starts with a donor site consensus (G100T100A62A68G84T63..)

•Has a branch site near 3’ end of intron (one not very conserved consensus UACUAAC)

•ends with an acceptor site consensus. (12Py..NC65A100G100)

AGUACUAAC

introns

Page 28: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

•The exons of the transcript region are composed of:

•5’UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome)

•AUG (or other start codon)

•Remainder of coding region

•Stop Codon

•3’ UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)

Exons

Page 29: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

~6-12% of human DNA encodes proteins(higher fraction in nematode)

~10% of human DNA codes for UTR

~90% of human DNA is non-coding.

Structure of the Eukaryotic Genome

Page 30: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Untranslated regions (UTR’s)

•introns (can be genes within introns of another gene!)

•intergenic regions.

- repetitive elements

- pseudogenes (dead

genes that may(or not) have been retroposed back in the genome as a single-exon “gene”

Non-Coding Eukaryotic DNA

Page 31: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Pseudogenes:

Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both.

Processed pseudogenes:

Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail.

Insertion process flanks mRNA sequence with short direct repeats.

Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence.

Do not confuse with single-exon genes.

Pseudogenes

Page 32: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Each repeat family has many subfamilies.

- ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site.

- Retroposons. ( can get copied back into genome)

- Telltale sign: Direct or inverted repeat flank the repeated element. That repeat was the priming site for the RNA that was inserted.

LINEs (Long INtersped Elements)

L1 1-7kb long, 50000 copies

Have two ORFs!!!!! Will cause problems for gene prediction programs.

SINEs (Short Intersped Elements)

Repeats

Page 33: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Low-Complexity Elements

• When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation!

• Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity.

• Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified.

• The low-complexity sequence can also be hidden at the translated protein level.

Page 34: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

•To avoid finding spurious matches in alignment programs, you should always mask out the query sequence.

•Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs).

•Before running blastn against a genomic record, you must mask out the repeats.

•Most used Programs:

CENSOR:

Repeat Masker:

http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

Masking

Page 35: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

More Non-Protein genes

rRNA - ribosomal RNA is one of the structural components of the ribosome. It has sequence complementarity to regions of the mRNA so that the ribosome knows where to bind to an mRNA it needs to make protein from.

snRNA - small nuclear RNA is involved in the machinery that processes RNA's as they travel between the nucleus and the cytoplasm.

hnRNA – hetero-nuclear RNA.small RNA involved in transcription.

Page 36: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

The protein as read off from the mRNA may not be in the final form that will be used in the cell. Some proteins contains

• Signal Peptide (located at N-terminus (beginning)), this signal peptide is used to guide the protein out of the nucleus towards it´s final cellular localization. This signal peptide is cleaved-out at the cleavage site once the protein has reach (or is near) it´s final destination.

•Various Post-Translational modifications (phosphorylation)

The final protein is called the “mature peptide”

Protein Processing & localization.

Page 37: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Because the mRNA is actually read off the minus strand of the DNA, the nucleotide sequence are always quoted on the minus strand.

In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. There is no symbol for Uracil.. It is always represented by a ´T´

Even genomic sequence follows that convention. A gene on the ´plus´ strand is quoted so that it is in the same strand as it´s product mRNA.

Convention for nucleotides in database

Page 38: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Protein EngineeringChange DNA SequenceChange RNA SequenceChange Amino Acid Sequence

NH2-terminus

COOH-terminus

5’ 3’

mRNA

mRNA Reading Direction Corresponds to Protein Chemical Directionality

Page 39: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Backbone Torsion AnglesDetermine Secondary Structure

Page 40: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Protein Tertiary Structure Tied to Function

BiomolecularEnergetics

Electrostatic Interactions

Hydrophobic/van der Waals Interactions

COO-

+H3N

CH3 H3C

OH N Hydrogen Bonding Interactions

Page 41: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Biology Information on the Internet

Page 42: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Biology Information on the Internet

• Introduction to Databases

• Searching the Internet for Biology Information.– General Search methods– Biology Web sites

• Introduction to Genbank file format.

• Introduction to Entrez and Pubmed

• Ref: Chapters 1,2,5,6 of “Bioinformatics”

Page 43: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Databases:– A collection of Records.

– Each record has many fields.– Each field contain specific information.– Each field has a data type.

» E.g. money, currency,Text Field, Integer, date,address(text field) ,citation (text field)

– Each record has a primary key. A UNIQUE identifier that unambiguously defines this record.

gi Accession version date Genbank Division taxid organims Number of Chromosomes6226959 NM_000014 3 06/01/00 PRI 9606 homo sapiens 22 diploid + X+Y6226762 NM_000014 2 10/12/99 PRI 9606 homo sapiens 22 diploid + X+Y4557224 NM_000014 1 02/04/99 PRI 9606 homo sapiens 22 diploid + X+Y

41 X63129 1 06/06/96 MAM 9913 bos taurus 29+X+Y

Spread-sheet

Flat-file version of a database.

Page 44: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

gi Accession version date Genbank Division taxid organims Number of Chromosomes6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 22 diploid + X+Y6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 22 diploid + X+Y4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 22 diploid + X+Y

41 X63129 1 06/06/1996 MAM 9913 bos taurus 29+X+Y

Gi = Genbank Identifier: Unique Key : Primary Key

GI Changes with each update of the sequence record.

Accession Number: Secondary key: Points to same locus and sequence despite sequence updates.

Accession + Version Number equivalent to Gi

Page 45: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

gi Accession version date Genbank Division taxid organims Number of Chromosomes6226959 NM_000014 3 01/06/2000 PRI 9606 homo sapiens 22 diploid + X+Y6226762 NM_000014 2 12/10/1999 PRI 9606 homo sapiens 22 diploid + X+Y4557224 NM_000014 1 04/02/1999 PRI 9606 homo sapiens 22 diploid + X+Y

41 X63129 1 06/06/1996 MAM 9913 bos taurus 29+X+Y

gi Accession version date Genbank Division taxid6226959 NM_000014 3 01/06/2000 PRI 96066226762 NM_000014 2 12/10/1999 PRI 96064557224 NM_000014 1 04/02/1999 PRI 9606

41 X63129 1 06/06/1996 MAM 9913

taxid organims Number of Chromosomes9606 homo sapiens 22 diploid + X+Y9913 bos taurus 29+X+Y

Relational Database (Normalizing a database for repeated sub-elements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.)

Page 46: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Types of Relational databases.

• The Internet can be though of as one enormous relational database.– The “links”/URL are the primary keys.

• SQL (Standard Query Language)– Sybase; Oracle ; Access; (Databases systems)

• Sybase used at NCBI.

– SRS(One type of database querying system of use in Biology)

Page 47: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Indexed searches.• To allow easy searching of a database, make

an index.

• An index is a list of primary keys corresponding to a key in a given field (or to a collection of fields)

Genbank divisionPRI 6226959;6226762;4557224;…MAM 41;…

AccessionNM_0000146226959;6226762;4557224;X63129 41;

Page 48: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Indexed searches.• Boolean Query: Merging and Intersecting lists:

– AND (in both lists) (e.g. human AND genome)– +human +genome

– human && genome

– OR (in either lists) (e.g. human OR genome)– human || genome

Page 49: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Search strategies• Search engines use complex strategies that go

beyond Boolean queries.– Phrases matching:

• human genome -> “human genome”

– togetherness: documents with human close to genome are scored higher.

– Term expansion & synomyms:• human -> homo sapiens

– neigbours:– human genome-> genome projects, chromosomes,genetics

– Frequency of links (www.google.com)• To avoid these term mapping, enclose your queries in quotes:

“human” AND “genome”

Page 50: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Search strategies• Search engines use complex strategies that

go beyond Boolean queries.

• To avoid these term mapping, enclose your queries in quotes: “human” AND “genome”

• To require that ALL the terms in your query be important, precede them with a “+” . This also prevents term mapping.

• To force the order of the words to be important, group sentences within strings. “biology of mammals”.

Page 51: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Indexed searches.Example

• find the advanced query page at http://www.altavista.com

• type human (and hit the Search button)• Type genome: • type human AND genome• type “human genome” (finds the least matches)• type human OR genome (finds the most matches)

Page 52: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Search Engines:– Web Spiders: Collection of All web pages, but

since Web pages change all the time and new ones appear, they must constantly roam the web and re-index.. Or depend on people submitting their own pages.

• www.google.com (BEST!)

• www.infoseek.com

• www.lycos.com

• www.exite.com

• www.webcrawler.com

• www.lycos.com

• www.looksmart.com (country specific)

Page 53: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Search Engines:• www.google.com (BEST!)

• Google ranks pages according to how many pages with those terms refer to the pages you are asking for. Not only must one document contain ALL the search terms, but other documents which refer to this one must also contain all the terms.

• Great when you know what you are looking for! You can also use “” to require immediate proximity and order of terms.

• E.g. type» Web server for the blast program.

But google only indexes about 40% of the web.. So you may have to use other web spiders.

(disclaimer.. I don’t own stock in that company.. But I’d like to)

Page 54: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Search Engines:– Curated Collections: Not comprehensive:

Contains list of best sites for commonly requested topics, but is missing important sites for more specialized topics (like biology)

• www.yahoo.com (Has travel maps too!)

– Answer-based curated collections: Easy to use english-like queries. First looks at list of predefined answers, then refines answers based on user interaction. Also answer new questions.

• www.askjeeves.com• www.magellan.com• www.altavista.com(has translation TOOLS)• www.hotbot.com

Page 55: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Search Engines:– Meta-Search Engines: Polls several search

engines, and returns the consensus of all results. Is likely to miss sites, but the sites it returns are very relevant to the query.

– Other operating mode is to return the sum of all the results.. Then becomes very sensitive to a very detailled query.

• www.metacrawler.com

• www.savvysearch.com

• www.1blink.com (fast)

• www.metafind.com

• www.dogpile.com

Page 56: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Virtual Libraries: Curated collections of links for Biologists.(by Biologists)– Pedro’s BioMolecular Research Tools:(1996)

• http://www.public.iastate.edu/~pedro/

– Virtual Library: Bio Sciences• http://vlib.org/Biosciences.html

– Publications and abstract search.• http://www.ncbi.nlm.nih.gov/

– Expasy server• http://www.expasy.ch

– EBI Biocatalog (software & databases list)• http://www.ebi.ac.uk/biocat/

Page 57: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Biological Databases• Nucleotide databases:

– Genbank: International Collaboration• NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)

• A “bank” No curation.. Submission to these database is required for publication in a journal.

– Organism specific databases (Exercize: Find URLs using search engines)

• FlyBase

• ChickGBASE

• pigbase

• wormpep

• YPD (Yeast Protein Database)

• SGD(Saccharomyces Genome Database)

Page 58: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Protein Databases:– NCBI:

– Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing)

• http://www.expasy.ch(latest pay version)

• NCBI has the latest free version.

• Translated Proteins from Genbank Submissions

– EMBL• TrEMBL is a computer-annotated supplement of SWISS-PROT

that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT

– PIR

Page 59: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Structure databases:– PDB: Protein structure database.

• Http://www.rscb.org/pdb/

– MMDB: NCBI’s version of PDB with entrez links.

• Http://www.ncbi.nlm.nih.gov

• Genome Mapping Information:– http://www.il-st-acad-sci.org/health/genebase.html

– NCBI(Human)– Genome Centers:

• Stanford, Washington University, Stanford

– Research Centers and Universities

Page 60: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Litterature databases:– NCBI: Pubmed: All biomedical litterature.

• Www.ncbi.nlm.nih.gov• Abstracts and links to publisher sites for

– full text retrieval/ordering– journal browsing.

– Publisher web sites.– Biomednet: Commercial site for litterature

search.

• Pathways Database:– KEGG: Kyoto Encyclopedia of Genes and

Genomes: www.genome.ad.jp/kegg/kegg/html

Page 61: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Database Identifiers: Primary keys– GI (changes with each sequence update for

NCBI only)• Annotation may change without the gi changing!

– Accession(stable)– version(changes with each sequence update)– “Version” also refers to Accession.version– Secondary accession: Records may have been

merged in the past.. So the records which were not chosen as the primary were made secondary.

Page 62: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Primary Databases

• A primary Database is a repository of data derived from experiments or from research knowledge.– Genbank (Nucleotide repository)– Protein DB, Swissprot– PDB (MMDB) are primary databases.– Pubmed (litterature)– Genome Mapping databases.– Kegg Database.(pathways)

Page 63: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Secondary Databases

• A secondary database contains information derived from other sources.– Refseq (Currated collection of Genbank at

NCBI)– Unigene (Clustering of ESTs at NCBI)

• Organism-specific databases are often a mix between primary and secondary.

Page 64: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank Records• A Bank: No attempt at reconciliation.• Submit a sequence Get an Accession Number!

– Cannot modify sequences without submitter’s consent. – No attempt at reconciliation.(not a unique collection

per LOCUS/gene)– Entries of various sequence quality and different

sources==> Separate in various divisions based on• High Quality sequences in taxon specific divisions.• Low Quality sequences in Usage specific databases.

• A Collaboration between NCBI, EMBL and DDBJ. They contain (nearly) the same information, only the data format differs.

EMBL does not differentiate between the different types of RNA records, while NCBI (and DDBJ) do. In Entrez EMBL records are patched up to add that information.

Page 65: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Refseq and LocusLink• Attempt to produce 1 mRNA, 1 protein, and

1 genomic gene for each frequently occuring allele of a protein expressing gene.

• www.ncbi.nlm.nih.gov/LocusLink• Special non-genbank Accession numbers

– NM_nnnnnn mRNA refseq– NP_nnnnnn protein refseq– NC_nnnnnn refseq genomic contig– NT_nnnnnn temporary genomic contig– NX_nnnnnn predicted gene

Page 66: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank divisions

Sequences in genbank are split into various categories based on

1) The quality and type of sequences

2) The high quality nucleotide sequences are divided into organism-dependant divisions.

Page 67: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• Genbank Entry type: (and query to restrict to that field)– mRNA (1/10000 errors)

• biomol_mRNA [PROP]– cDNA (EST, 95-99% accuracy, single pass )

• gbdiv_EST [PROP]– genomic ( biomol_genomic [PROP])

• in HTGS division: >99% accuracy;– gbdiv_HTG [PROP]

• GSS(low-quality genome survey sequences)– gbdiv_GSS [PROP]

• rest of Genbank; 1/10000 accuracy.– Human gbdiv_PRI [PROP]– mouse gbdiv_ROD [PROP]– bovine gbdiv_MAM [PROP]

– STS(EST or cDNA used in mapping)• gbdiv_STS [PROP]

Page 68: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

FASTA Format

>identifier descriptive text

nucleotide of amino-acid

sequence on multiple lines if needed.

Example:>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin

GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC

CATCACGCGGGGCCTTCTGCTGCTGGC ….

MOST important data format!!!

Page 69: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Modified FASTA Format

1) A few tools follow the convention that lower case sequences are masked. (repeat masker, some versions of blast, megablast, blastz)

2) A few analysis tools (like CLUSTAL) want a simplified identifier on the defline.. So they can have a short string for the alignment.

>X63129.1GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTCCATCACGCGGGGCCTTCTGCTGCTGGC ….

Page 70: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

• WIM now will talk about GCG …

Page 71: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Feature table(NCBI;EMBL/DDBJ)

• http://www.ncbi.nlm.nih.gov/collab/FT/index.html

Page 72: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank Data format

• LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992

• DEFINITION B.taurus mRNA for alpha-1-antitrypsin.

• ACCESSION X63129

• NID g41

• VERSION X63129.1 GI:41

• KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin.

• SOURCE Bos taurus.

• ORGANISM Bos taurus

• Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;

• Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.

41

Page 73: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank References• LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992

• ...

• REFERENCE 1 (bases 1 to 1380)

• AUTHORS Sinha,D.

• TITLE Direct Submission

• JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry, Temple University, 3400 North Broad Street, Philadelphia, PA

19140, USA

• REFERENCE 2 (bases 1 to 1380)

• AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.

• TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin

• JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992)

• MEDLINE 92223096

• FEATURES Location/Qualifiers

Page 74: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank Source Qualifier• LOCUS BTA1AT 1380 bp mRNA MAM 30-APR-1992

• ...

• FEATURES Location/Qualifiers

• source 1..1380

• /organism="Bos taurus"

• /db_xref="taxon:9913"

• /tissue_type="liver"

• /cell_type="hepatocyte"

• /clone_lib="lambda gt11"

• /clone="2f-Ic"

• mRNA <1..>1380

• sig_peptide 33..104

• ...

Page 75: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank mRNA+CDS features• mRNA <1..>1380

• sig_peptide 33..104

• CDS 33..1283

• /codon_start=1

• /product="alpha-1-antitrypsin"

• /protein_id="CAA44840.1"

• /db_xref="PID:g42"

• /db_xref="GI:42"

• /db_xref="SWISS-PROT:P34955"

• /translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAACHKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKGLGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLEDVKNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVNYISFKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLASWVLLLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPKLSISETYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGTEAVGSTFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA"

• mat_peptide 105..1280

• /product="alpha-1-antitrypsin"

• polyA_signal 1343..1348

• polyA_site 1368

Page 76: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Genbank Sequence format• ...• BASE COUNT 357 a 413 c 322 g 288 t• ORIGIN • 1 gaccagccct gacctaggac agtgaatcga taatggcact ctccatcacg cggggccttc• 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct ggctggagtt ctccaaggac• 121 acgctgtcca agagacagat gatacatccc accaggaagc agcgtgccac aagattgccc• 181 ccaacctggc caactttgcc ttcagcatat accaccattt ggctcatcag tccaacacca• 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt tgcgatgctc tccctgggag• 301 ccaagggcaa cactcacact gagatcctga agggcctggg tttcaacctc actgagctcg• 361 cagaggctga gatccacaaa ggctttcagc atcttctcca caccctgaac cagccaaacc• ...• 1321 gtccccccac tccctccatg gcattaaagg atgactgacc tagccccgaa aaaaaaaaaa• //

Page 77: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

EMBL DATA FORMAT

• Embl: http://www.ebi.ac.uk/Databases/

• http://www.ebi.ac.uk/cgi-bin/emblfetch

• Use Accession X63129

Page 78: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

DDBJ DATA FORMAT

• DDBJ: http://www.ddbj.nig.ac.jp/

• http://ftp2.ddbj.nig.ac.jp:8000/getstart-e.html

• Use Accession X63129

• Flat file format same as NCBI/Genbank format.

Page 79: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Entrez• Index Based search system. Each field in

the database is searchable individually or as agregate. – (e.g. CDS [FKEY]) – default is agregate [ALL FIELDS] *

• All primary databases are interlinked as one big relational database.– (e.g. Pubmed links in Genbank records)

• Phrase matching.– Human genome -> “human genome”

Page 80: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

Entrez

• Available neighbours (related documents or related sequences)

• In Pubmed searches: Term mapping to neighbouring documents and neighbouring terms.

• Term mapping to chemical names.– In pubmed: term [All Fields] is term mapped to

chemical names + MeSH terms + Text Fields.

– .. Unless “term” is whithin double quotes.

Page 82: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

SWISSPROT

1. Core data: protein sequence data; the citation information and the

taxonomic data2. Annotation

• Function(s) of the protein • Domains and sites. For example calcium binding regions, ATP-

binding sites, zinc fingers, homeobox, kringle, etc. • Post-translational modification(s). For example carbohydrates,

phosphorylation, acetylation, GPI-anchor, etc. • Secondary structure • Quaternary structure. For example homodimer, heterotrimer, etc. • Similarities to other proteins • Disease(s) associated with deficiencie(s) in the protein • Sequence conflicts, variants, etc.

http://www.expasy.ch/sprot/sprot_details.html

Page 83: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

SWISSPROT

http://www.expasy.ch/cgi-bin/get-random-entry.pl?S

Page 84: Introduction to Biology Saleet Jafri GMU Cholesterol Water seeking head group water fatty chains Presumed common progenitor of all extant organism Presumed

REBASE (Restriction enzymes dataBASE)

Restriction enzymes have a pattern recognition sequence, and then within or a few bases away from that pattern is the actual cutting site

http://rebase.neb.com/rebase/rebase.htmlI prefer the bairoch format (SWISSPROT format)http://rebase.neb.com/rebase/rebase.f19.htmlID enzyme name ET enzyme type OS microorganism name PT prototype RS recognition sequence, cut site MS methylation site (type) CR commercial sources for the restriction enzyme CM commercial sources for the methylase RN [count] RA authors RL jour, vol, pages, year, etc.