cs 6293 advanced topics: translational bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Outline

• Course overview

• Short introduction to molecular biology

Course Info

• Time: TR 4:00-5:15pm• Location: MB 1.01.03• Instructor: Dr. Jianhua Ruan

Office: S.B. 4.01.48Phone: 458-6819Email: jianhua.ruan@utsa.eduOffice hours: W 2-3pm or by appointment

• Web: http://www.cs.utsa.edu/~jruan, follow link to teaching, then to cs6293

Survey

• Help me better design lectures and assignments

• Form available on course webpage– Your name– Email– Academic preparation– Interests

Course description

• Review of the “most recent” developments & research problems in bioinformatics– Some overlap with CS5263: (Introduction to)

Bioinformatics and CS6293 Fall 2010

• Prerequisite:– CS5263– Strong background in algorithms and data structures – Solid knowledge of statistics and probability– Desire and ability to learn by yourself

Reading materials

• No textbooks

• Reading materials– Slides– Book chapters– Journal / conference papers– Posted on course website usually a

week before discussion

Covered topics

• Biology• (Next-generation) sequence analysis algs• Gene expression data mining• Translational bioinformatics

– Use the PLoS Computational Biology collection: http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11

• TBD• You are expected to read a lot of papers and

doing multiple presentations

Grading

• Attendance: 10%– At most 3 classes missed without affecting grade,

unless approved by the instructor• Homeworks and presentations: 40%

– 3-5 assignments• Combination of theoretical and programming exercises• Presenting and discussing papers• Scribing

– No late submission accepted– Read the collaboration policy!

• Midterm project / exam: 20%• Final project / exam: 30%

Why bioinformatics

• The advance of experimental technology has resulted in a huge amount of data– The human genome is “finished”– Even if it were, that’s only the beginning…

• The bottleneck is how to integrate and analyze the data– Noisy– Diverse

Growth of GenBank vs Moore’s law

Genome annotations

Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

What is bioinformatics

• National Institutes of Health (NIH):– Research, development, or application of

computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is bioinformatics

• National Center for Biotechnology Information (NCBI):– the field of science in which biology, computer

science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

Chemistry

MathematicsStatistics

Computer ScienceInformatics

Physics

Medicine

BiologyMolecular Biology

Bioinformatics

Computer Scientists vs Biologists

(courtesy Serafim Batzoglou, Stanford)

Biologists vs computer scientists

• (almost) Everything is true or false in computer science

• (almost) Nothing is ever true or false in Biology

• Biologists seek to understand the complicated, messy natural world

• Computer scientists strive to build their own clean and organized virtual world

• Computer scientists are obsessed with being the first to invent or prove something

• Biologists are obsessed with being the first to discover something

Some examples of central role of CS in bioinformatics

1. Genome sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

~500 nucleotides

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

3x109 nucleotides

Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome

A big puzzle~60 million pieces

1. Genome sequencing

Where are the genes?Where are the genes?

2. Gene Finding

In humans:

~22,000 genes~1.5% of human DNA

Start codonATG

5’ 3’Exon 1 Exon 2 Exon 3Intron 1 Intron 2

Stop codonTAG/TGA/TAA

Splice sites

2. Gene Finding

Hidden Markov Models

(Well studied for many years in speech recognition)

3. Protein Folding• The amino-acid sequence of a protein determines the 3D fold

• The 3D fold of a protein determines its function

• Can we predict 3D fold of a protein given its amino-acid sequence?– Holy grail of computational biology —40 years old problem

– Molecular dynamics, computational geometry, machine learning

4. Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence AlignmentIntroduced ~1970BLAST: 1990, one of the most cited papers

in historyStill very active area of research

Efficient string matching algorithms

Fast database index techniques

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

Lipman & Pearson, 1985

Database size today: 1012

(increased by 2 million folds).

BLAST search: 1.5 minutes

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).

5. Microarray data analysisExample: Clinical prediction of Leukemia type

• 2 types of leukemia– Acute lymphoid (ALL)

– Acute myeloid (AML)

• Different treatments & outcomes• Predict type before treatment?

Bone marrow samples: ALL vs AML

Measure amount of each gene

Some goals of biology for the next 50 years

• List all molecular parts that build an organism– Genes, proteins, other functional parts

• Understand the function of each part• Understand how parts interact physically and functionally• Study how function has evolved across all species• Find genetic defects that cause diseases• Design drugs rationally• Sequence the genome of every human, use it for personalized

medicine

• Bioinformatics is an essential component for all the goals above

A short introduction to molecular biology

• Two categories:– Prokaryotes (e.g. bacteria)

• Unicellular• No nucleus

– Eukaryotes (e.g. fungi, plant, animal)• Unicellular or multicellular• Has nucleus

Prokaryote vs Eukaryote

• Eukaryote has many membrane-bounded compartment inside the cell– Different biological processes occur at different

cellular location

Organism, Organ, CellOrganism

Chemical contents of cell

• Water• Macromolecules (polymers) - “strings” made by linking

monomers from a specified set (alphabet)–Protein–DNA–RNA–…

• Small molecules–Sugar–Ions (Na+, Ka+, Ca2+, Cl- ,…)–Hormone–…

• DNA: forms the genetic material of all living organisms– Can be replicated and passed to descendents– Contains information to produce proteins

• To computer scientists, DNA is a string made from alphabet {A, C, G, T}– e.g. ACAGAACGTAGTGCCGTGAGCG

• Each letter is a nucleotide• Length varies from hundreds to billions

• Historically thought to be information carrier only– DNA => RNA => Protein– New roles have been found for them

• To computer scientists, RNA is a string made from alphabet {A, C, G, U}– e.g. ACAGAACGUAGUGCCGUGAGCG

• Each letter is a nucleotide• Length varies from tens to thousands

Protein

• Protein: the actual “worker” for almost all processes in the cell– Enzymes: speed up reactions– Signaling: information transduction– Structural support– Production of other macromolecules– Transport

• To computer scientists, protein is a string made from 20 kinds of characters– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP

• Each letter is called an amino acid• Length varies from tens to thousands

DNA/RNA zoom-in

• Commonly referred to as Nucleic Acid• DNA: Deoxyribonucleic acid• RNA: Ribonucleic acid• Found mainly in the nucleus of a cell (hence

“nucleic”)• Contain phosphoric acid as a component (hence

“acid”)• They are made up of a string of nucleotides

Nucleotides• A nucleotide has 3 components

– Sugar ring (ribose in RNA, deoxyribose in DNA)

– Phosphoric acid– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T) in DNA and Uracil (U) in RNA

Units of RNA: ribo-nucleotide

• A ribonucleotide has 3 components– Sugar - Ribose– Phosphate group– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Uracil (U)

Units of DNA: deoxy-ribo-nucleotide

• A deoxyribonucleotide has 3 components– Sugar – Deoxy-ribose– Phosphate group– Nitrogen base

• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T)

Polymerization: Nucleotides => nucleic acids

Phosphate

Nitrogen Base

Phosphate

Nitrogen Base

Phosphate

Nitrogen Base

5’-AGCGACTG-3’

AGCGACTG

Phosphate

Often recorded from 5’ to 3’, which is the direction of many biological processes.e.g. DNA replication, transcription, etc.

Free phosphate 5 prime 3 prime

5’-AGUGACUG-3’

AGUGACUG

Often recorded from 5’ to 3’, which is the direction of many biological processes.e.g. translation.

Free phosphate 5 prime 3 prime

Base-pair:

5’3’

5’-AGCGACTG-3’3’-TCGCTGAC-5’

AGCGACTGTCGCTGAC

Forward (+) strand

Backward (-) strand

One strand is said to be reverse- complementary to the other

DNA usually exists in pairs.

DNA double helix

G-C pair is stronger than A-T pair

Reverse-complementary sequences

• 5’-ACGTTACAGTA-3’

• The reverse complement is:

3’-TGCAATGTCAT-5’

5’-TACTGTAACGT-3’

• Or simply written as

TACTGTAACGT

Orientation of the double helix

• Double helix is anti-parallel–5’ end of one strand pairs with 3’ end of the other–5’ to 3’ motion in one strand is 3’ to 5’ in the other

• Double helix has no orientation–Biology has no “forward” and “reverse” strand–Relative to any single strand, there is a “reverse complement” or “reverse strand”–Information can be encoded by either strand or both strands

5’TTTTACAGGACCATG 3’3’AAAATGTCCTGGTAC 5’

• RNAs are normally single-stranded

• Form complex structure by self-base-pairing

• A=U, C=G

• Can also form RNA-DNA and RNA-RNA double strands.– A=T/U, C=G

Carboxyl groupAmino group

Protein zoom-in

Side chain

Generic chemical form of amino acid

• Protein is the actual “worker” for almost all processes in the cell

• A string built from 20 kinds of chars– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH

• Each letter is called an amino acid

R | H2N--C--COOH | H

• 20 amino acids, only differ at side chains– Each can be expressed by three letters– Or a single letter: A-Y, except B, J, O, U, X, Z

– Alanine = Ala = A

– Histidine = His = H

Units of Protein: Amino acid

R R | | H2N--C--CO--NH--C--COOH | | H H

R R | | H2N--C--COOH H2N--C--COOH | | H H

Amino acids => peptide

Peptide bond

Protein

• Has orientations• Usually recorded from N-terminal to C-terminal• Peptide vs protein: basically the same thing• Conventions

– Peptide is shorter (< 50aa), while protein is longer– Peptide refers to the sequence, while protein has 2D/3D structure

R R R R R

N-terminal C-terminal

Protein structure• Linear sequence of amino acids folds to

form a complex 3-D structure.

• The structure of a protein is intimately connected to its function.

Genome and chromosome

• Genome: the complete DNA sequences in the cell of an organism – May contain one (in most prokaryotes) or

more (in eukaryotes) chromosomes

• Chromosome: a single large DNA molecule in the cell– May be circular or linear– Contain genes as well as “junk DNAs”– Highly packed!

Formation of chromosome

50,000 times shorter than extended DNA

The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun

• Gene: unit of heredity in living organisms – A segment of DNA with information to make a

protein or a functional RNA

Some statistics

Chromosomes Bases Genes

Human 46 3 billion 20k-25k

Dog 78 2.4 billion ~20k

Corn 20 2.5 billion 50-60k

Yeast 16 20 million ~7k

E. coli 1 4 million ~4k

Marbled lungfish

? 130 billion ?

Human genome

• 46 chromosomes: 22 pairs + X + Y1 from mother, 1 from father

• Female: X + X

• Male: X + Y

Human genome

• Every cell contains the same genomic information– Except sperms and eggs, which only contain

half of the genome• Otherwise your children would have 46 + 46

chromosomes …

Cell division: mitosis• A cell duplicates its

genome and divides into two identical cells

• These cells build up different parts of your body

Cell division: meiosis• A reproductive cell

divides into four cells, each containing only half of the genomes– Diploid => haploid

• Two haploid cells (sperm + egg) forms a zygote– Which will then develop

into a multi-cellular organism by mitosis

Central dogma of molecular biology

DNA replication is critical in both mitosis and meiosis

DNA Replication

• The process of copying a double-stranded DNA molecule– Semi-conservative

5’-ACATGATAA-3’

3’-TGTACTATT-5’

5’-ACATGATAA-3’ 5’-ACATGATAA-3’

3’-TGTACTATT-5’ 3’-TGTACTATT-5’

• Mutation: changes in DNA base-pairs• Proofreading and error-correcting mechanisms

exist to ensure extremely high fidelity (one mistake per 109 – 1011 nucleotides)

p p p Nucleotide triphosphate(dNTP)

Central dogma of molecular biology

Transcription

• The process that a DNA sequence is copied to produce a complementary RNA– Called message RNA (mRNA) if the RNA carries

instruction on how to make a protein – Called non-coding RNA if the RNA does not carry

instruction on how to make a protein– Only consider mRNA for now

• Similar to replication, but– Only one strand is copied– No proof-reading so relatively higher error rate

Transcription(where genetic information is stored)

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

DNA-RNA pair:

A=U, C=G

T=A, G=C

Translation

• The process of making proteins from mRNA• A gene uniquely encodes a protein• There are four bases in DNA (A, C, G, T), and four in

RNA (A, C, G, U), but 20 amino acids in protein• How many nucleotides are required to encode an amino

acid in order to ensure correct translation?– 4^1 = 4– 4^2 = 16– 4^3 = 64

• The actual genetic code used by the cell is a triplet.– Each triplet is called a codon

The Genetic CodeThirdletter

Translation

• The sequence of codons is translated to a sequence of amino acids

• Gene: -GCT TGT TTA CGA ATT-• mRNA: -GCU UGU UUA CGA AUU -• Peptide: - Ala - Cys - Leu - Arg - Ile –

• Start codon: AUG– Also code Methionine– Stop codon: UGA, UAA, UAG

Translation• Transfer RNA (tRNA) – a different type of RNA.

– Freely float in the cell.– Every amino acid has its own type of tRNA that binds

to it alone.

• Anti-codon – codon binding crucial.

tRNA-Leu

Nascent peptide

tRNA-Pro

Anti-codon

Transcriptional regulation

genepromoter

Transcription starting site

RNA PolymeraseTranscription factor

• RNA polymerase binds to certain location on promoter to initiate transcription

• Transcription factor binds to specific sequences on the promoter to regulate the transcription– Recruit RNA polymerase: induce– Block RNA polymerase: repress– Multiple transcription factors may coordinate

Splicing

genepromoter

Transcription starting site

Pre-mRNAtranscription

• Pre-mRNA needs to be “edited” to form mature mRNA

5’ UTR 3’ UTRexon exon exon

intron intron

Start codon Stop codon

Open reading frame (ORF)

Pre-mRNA

Mature mRNA(mRNA)

Splicing

Summary• DNA: a string made from {A, C, G, T}

– Forms the basis of genes– Has 5’ and 3’– Normally forms double-strand by reverse complement

• RNA: a string made from {A, C, G, U}– mRNA: messenger RNA– tRNA: transfer RNA– Other types of RNA: rRNA, miRNA, etc.– Has 5’ and 3’– Normally single-stranded. But can form secondary structure

• Protein: made from 20 kinds of amino acids– Actual worker in the cell– Has N-terminal and C-terminal– Sequence uniquely determined by its gene via the use of codons– Sequence determines structure, structure determines function

• Central dogma: DNA transcribes to RNA, RNA translates to Protein– Both steps are regulated

Experimental techniques to manipulate DNA

DNA synthesis

• Creating DNA synthetically in a laboratory• Chemical synthesis

– Chemical reactions– Arbitrary sequences– Typically around 15-25 bases, single stranded– Maximum length 160-200

• Cloning: make copies based on a DNA template– Biological reactions– Requires template– Utilizes same mechanisms as in DNA replication– Many copies of a long DNA in a short time

in vitro DNA Cloning

• Polymerase chain reaction (PCR)

denature

5’5’

5’ 5’5’

Primer (< 30 bases)

5’ 5’

5’5’

DNA Polymerase

in vivo DNA Cloning

• Connect a piece of DNA to bacterial DNA, which can then be replicated together with the host DNA

bacterial DNA

DNA sequencing technology

• Read out the letters from a DNA sequence• Chain-termination method (Sanger method)

1974, Frederick Sanger

GTGAGGCGCTGC

DNA sequencing: Basic idea

• PCR primer extension

5’-TTACAGGTCCATACTA 3’-AATGTCCAGGTATGATACATAGG-5’

• We need to supply A, C, G, T for the synthesis to continue

• Besides A, C, G, T, we add some A*, C*, G*, and T*– Very similar to ACGT in all aspects, except that– The extension will stop if used

DNA sequencing, cont

Basecalling

Sequencing speed

• Current methods can directly sequence only relatively short (<1000bp long) DNA fragments in a single reaction

• Automated DNA-sequencing instruments (using gel-filled capillaries) can sequence up to 384 DNA samples in a single batch (run) in up to 24 runs a day: ~ 3,000,000 bases per day

Advances in DNA sequencing

• 1969: three years to sequence 115nt DNA• 1979: three years to sequence ~1650nt• 1989: one week to sequence ~1650nt• 1995: Haemophilus genome sequenced at

TIGR - 1,830,138nt• 2000: Human Genome - working draft

sequence, 3 billion bases• 2004: 454 Life Science invented the first

new-generation sequencer

The bioinformatics landmark

• Completion of human genome sequencing is a success embraced by – Advancement in sequencing technology– Speed of computation– Algorithm development in bioinformatics

• HGP (Human Genome Project) strategy – Hierarchical sequencing– Estimated 15 years (1990 – 2005), completed in 13 years– $3 billion

• Celera strategy– Whole-genome shotgun sequencing– Three years (1998-2001)– $300 million

Prior to year 2007

• Over 300 genomes have been sequenced

• ~1011 - 1012 nt

Year 2007• Genomes of three individual human were

sequenced– James Watson– Craig Venter– Yang Huanming

• Cost for sequencing Watson’s genome– $3 million, 2 months– Compared to $3 billion, 13 years for HGP

• These are achieved without the new-generation sequencing technology !

• June 3 2010: “Illumina Drops Personal Genome Sequencing Price to Below $20,000”

• Sequencing speed has been tremendously improved

• High efficiency and relatively low cost makes it possible to sequence the genome of any individual from any species

What’s next?

Continue to sequence more species? Genome 10K project

More individuals?1000 Genome project

What to do with those sequences?

Coming next: biological sequence analysis

cs 6293 advanced topics: translational bioinformatics

stanfordbiologists vs

computer science

health data

field of science

midterm project exam

final project exam

information technology

course website

Documents

cs 6293 advanced topics: translational bioinformatics

editorial data mining in translational...

bioinformatics, translational bioinformatics, personalized...

translational bioinformatics at vhir: understanding...

143 service manual -travelmate 6293

adventures in translational bioinformatics

introduction to translational and clinical...

cs 6293 advanced topics: translational bioinformatics intro...

cs 6293 advanced topics: current bioinformatics

translational bioinformatics...

bioinformatics in translational medicine - cjk...

bioinformatics in translational medicine · why...

chang feng quo - community annotation in translational...

review article translational bioinformatics for diagnostic

creating medically-driven integrative bioinformatics...

genomics and bioinformatics resources for translational...

accelerating clinical and translational research challenges...

translational bioinformatics & genomics · 2012-10-09 ·...

chapter 16: text mining for translational bioinformatics 1

cs 6293 advanced topics: bioinformatics