Download - CS5263 Bioinformatics
![Page 1: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/1.jpg)
CS5263 Bioinformatics
Lecture 1: Introduction
![Page 2: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/2.jpg)
Outline
• Administravia
• What is bioinformatics
• Why bioinformatics
• Topics in bioinformatics
• What you will & will not learn
• Introduction to molecular biology
![Page 3: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/3.jpg)
Student info
• Your name
• Enrollment status
• Academic background
• Interests
![Page 4: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/4.jpg)
Course Info
• Instructor: Jianhua Ruan
Office: S.B. 4.01.48
Phone: 458-6819
Email: [email protected]
Office hours: Tues 6:30-7:30, Wed 3-4pm
• Web: http://www.cs.utsa.edu/~jruan/teaching/cs5263_fall_2007/
![Page 5: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/5.jpg)
Course description
• A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint.
• Discussions balanced between algorithmic analyses and biological applications
• Prerequisite:– Knowledge in algorithms and data structure – Programming experience– Basic understanding of statistics and probability– Appetite to learn some biology
![Page 6: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/6.jpg)
Textbooks
• Required:– An Introduction to Bioinformatics Algorithms
by Jones and Pevzner
• Recommended:– Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acidsby Durbin, Eddy, Krogh and Mitchison
• Additional resources – See course website
![Page 7: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/7.jpg)
Grading
• Attendance: 10%– At most 2 classes missed without affecting
grade
• Homeworks: 50%– No late submission accepted– Read the collaboration policy!
• Final project and presentation: 40%
![Page 8: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/8.jpg)
What is bioinformatics
• National Institutes of Health (NIH):– Research, development, or application of
computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
![Page 9: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/9.jpg)
What is bioinformatics
• National Center for Biotechnology Information (NCBI):– the field of science in which biology, computer
science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.
![Page 10: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/10.jpg)
What is bioinformatics
• Wikipedia – Bioinformatics refers to the creation and
advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.
![Page 11: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/11.jpg)
Why bioinformatics
• Modern biology generates huge amount of data– Human genome sequence has 3 billion bases
• Complex relationships among different types of data– Challenges to integrate and analyze data
• Algorithmic challenges– Biologists trained to programming are probably not sufficient
• Tremendous needs in both academic and industry– Job opportunities
• You get the chance to learn something different
![Page 12: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/12.jpg)
Some examples of central role of CS in bioinformatics
![Page 13: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/13.jpg)
1. Genome sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
~500 nucleotides
![Page 14: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/14.jpg)
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT
3x109 nucleotides
Computational Fragment AssemblyIntroduced ~19801995: assemble up to 1,000,000 long DNA pieces2000: assemble whole human genome
A big puzzle~60 million pieces
1. Genome sequencing
![Page 15: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/15.jpg)
Where are the genes?Where are the genes?
2. Gene Finding
In humans:
~22,000 genes~1.5% of human DNA
![Page 16: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/16.jpg)
Start codonATG
5’ 3’Exon 1 Exon 2 Exon 3Intron 1 Intron 2
Stop codonTAG/TGA/TAA
Splice sites
2. Gene Finding
Hidden Markov Models
(Well studied for many years in speech recognition)
![Page 17: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/17.jpg)
3. Protein Folding• The amino-acid sequence of a protein determines the 3D fold• The 3D fold of a protein determines its function• Can we predict 3D fold of a protein given its amino-acid
sequence?– Holy grail of compbio—40 years old problem– Molecular dynamics, computational geometry, machine learning, robotics
![Page 18: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/18.jpg)
4. Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | |
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence AlignmentIntroduced ~1970BLAST: 1990, most cited paper in historyStill very active area of research
query
DB
BLAST
Efficient string matching algorithms
Fast database index techniques
![Page 19: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/19.jpg)
Sequence comparison is key to• Finding genes• Determining function• Uncovering the evolutionary processes
Sequence conservation implies function
![Page 20: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/20.jpg)
5. Evolution
More than 200 complete genomes have been
sequenced
![Page 21: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/21.jpg)
5. Evolution
![Page 22: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/22.jpg)
6. Microarray analysisClinical prediction of Leukemia type
• 2 types– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes• Predict type before treatment?
Bone marrow samples: ALL vs AML
Measure amount of each gene
![Page 23: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/23.jpg)
Some goals of biology for the next 50 years
• List all molecular parts that build an organism– Genes, proteins, other functional parts
• Understand the function of each part• Understand how parts interact• Study how function has evolved across all species• Find genetic defects that cause diseases• Design drugs rationally• Sequence the genome of every human, use it for
personalized medicine
• Bioinformatics is an essential component for all the goals above
![Page 24: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/24.jpg)
Major conferences
• ISMB (Summer every year)• RECOMB (and its satellites) (Spring every year)• PSB (Jan every year, Hawaii)• ECCB (Europe)• CSB (July every year, Stanford)• Conferences in computer science
– ICDM (conference on data mining)– ICML (conference on machine learning)– AAAI (conference on AI)
![Page 25: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/25.jpg)
Major journals
• Bioinformatics• Journal of Computational Biology• PLoS Computational Biology• BMC Bioinformatics• Genome Biology• Genome Research• Nucleic Acids Research• IEEE Trans on Computational Biology• Science, Nature, PNAS, Cell, Nature Genetics,
Nature Biotech, …
![Page 26: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/26.jpg)
Major Bioinfo research topics
![Page 27: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/27.jpg)
Covered topics
• Sequence analysis– Alignment– Motif finding– Pattern matching– Phylogenetic tree
• Sequence-based predictions– Gene components– RNA structure
• Functional Genomics– Microarray analysis– Biological networks
![Page 28: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/28.jpg)
What you will learn?
• Basic concepts in molecular biology and genetics
• Selected topics in bioinformatics and challenges
• Algorithms:– DP, graph, string algorithms– Statistical learning algorithms: HMM, EM,
Gibbs sampling– Data mining: clustering / classification
![Page 29: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/29.jpg)
What you will not learn?
• Existing tools / databases
• Design / perform biological experiments
• Protein structure prediction (commonly avoided by most bioinfo researchers…)
• Building bioinformatics software tools (GUI, database, Perl / Python, …)
![Page 30: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/30.jpg)
Goals
• Basis of sequence analysis and other computational biology algorithms
• Overall picture about the field
• Read / criticize research articles
• Think about the sub-field that best suits your background to explore
• Communicate and exchange ideas with (computational) biologists
![Page 31: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/31.jpg)
Computer Scientists vs Biologists
(courtesy Serafim Batzoglou, Stanford)
![Page 32: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/32.jpg)
Biologists vs computer scientists
• (almost) Everything is true or false in computer science
• (almost) Nothing is ever true or false in Biology
![Page 33: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/33.jpg)
Biologists vs computer scientists
• Biologists seek to understand the complicated, messy natural world
• Computer scientists strive to build their own clean and organized virtual world
![Page 34: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/34.jpg)
Biologists vs computer scientists
• Computer scientists are obsessed with being the first to invent or prove something
• Biologists are obsessed with being the first to discover something
![Page 35: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/35.jpg)
Biologists vs computer scientists
• Biologists are comfortable with the idea that all data have errors, and every rule has exceptions
• Computer scientists are not
![Page 36: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/36.jpg)
Biologists vs computer scientists
• Computer scientists get high-paid jobs after graduation
• Biologists typically have to complete one or more 5-year post-docs...
![Page 37: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/37.jpg)
Molecular biology 101
• Cell
• DNA, RNA, Protein
• Genome, chromosome, gene
• Central dogma
![Page 38: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/38.jpg)
Life
• Categories– Prokaryotes (e.g. bacteria)
• Unicellular• No nucleus
– Eukaryotes (e.g. fungi, plant, animal)• Unicellular or multicellular• Has nucleus
• The most important distinction among groups of organism
![Page 39: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/39.jpg)
Prokaryote vs Eukaryote
• Eukaryote has many membrane-bounded compartment inside the cell– Different biological processes occur at different
cellular location
![Page 40: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/40.jpg)
Chemical contents of cell
• Small molecules–Sugar–Ions (Na+, Ka+, Ca2+, Cl- ,…)–…
• Macromolecules (polymers): –DNA–RNA–Protein–…
• Polymers: “strings” made by linking monomers from a specified set (alphabet)
![Page 41: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/41.jpg)
Polymer Monomer
DNA Deoxyribonucleotides
RNA Ribonucleotides
Protein Amino Acid
![Page 42: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/42.jpg)
DNA
• DNA: forms the genetic material of all living organisms– Can be replicated and passed to descendents– Contains information to produce proteins
• To computer scientists, DNA is a string made from alphabet {A, C, G, T}– e.g. ACAGAACGTAGTGCCGTGAGCG
• Each letter is called a base– A deoxyribonucleotides
• Length varies. From hundreds to billions
![Page 43: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/43.jpg)
RNA
• Historically thought to be information carrier only– DNA => RNA => Protein– New roles have been found for them
• To computer scientists, RNA is a string made from alphabet {A, C, G, U}– e.g. ACAGAACGUAGUGCCGUGAGCG
• Each letter is called a base– A ribonucleotides
• Length varies. From tens to thousands
![Page 44: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/44.jpg)
Protein
• Protein: the actual “worker” for almost all processes in the cell– Enzymes: speed up reactions– Signaling: information transduction– Structural support– Production of other macromolecules– Transport
• To computer scientists, protein is a string built from 20 letters– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
• Each letter is called an amino acid• Lengths: from tens to thousands
![Page 45: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/45.jpg)
Central dogma of molecular biology
![Page 46: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/46.jpg)
DNA/RNA zoom-in
• Commonly referred to as Nucleic Acid
• DNA: Deoxyribonucleic acid
• RNA: Ribonucleic acid
• Found mainly in the nucleus of a cell (hence “nucleic”)
• Contain phosphoric acid as a component (hence “acid”)
• They are made up of nucleotides
![Page 47: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/47.jpg)
Nucleotides• A nucleotide has 3 components
– Sugar (ribose in RNA, deoxyribose in DNA)– Phosphoric acid– Nitrogen base
• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T) or Uracil (U)
![Page 48: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/48.jpg)
Monomers of RNA• A ribonucleotide has 3 components
– Sugar - Ribose– Phosphate group– Nitrogen base
• Adenine (A)• Guanine (G)• Cytosine (C)• Uracil (U)
![Page 49: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/49.jpg)
Monomers of DNA• A deoxyribonucleotide has 3 components
– Sugar - Deoxyribose– Phosphoric acid– Nitrogen base
• Adenine (A)• Guanine (G)• Cytosine (C)• Thymine (T)
![Page 50: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/50.jpg)
![Page 51: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/51.jpg)
Polymerization: Nucleotides => nucleic acids
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
Phosphate
Sugar
Nitrogen Base
![Page 52: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/52.jpg)
G
A
G
T
C
A
G
C
5’-AGCGACTG-3’
AGCGACTG
Phosphate
Sugar
Base
1
23
4
5
Many biological processes go from 5’ to 3’e.g. DNA replication, transcription, etc.
5’
3’
DNA
![Page 53: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/53.jpg)
G
A
G
U
C
A
G
U
5’-AGUGACUG-3’
AGUGACUG
Phosphate
Sugar
Base
1
23
4
5
Many biological processes go from 5’ to 3’e.g. transcription.
5’
3’
RNA
![Page 54: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/54.jpg)
T
C
A
C
T
G
G
C
G
A
G
T
C
A
G
C
Base-pair:
A = T
G = C
5’
5’3’
3’
5’-AGCGACTG-3’3’-TCGCTGAC-5’
AGCGACTGTCGCTGAC
AGCGACTG
Forward (+) strand
Backward (-) strand
One strand is said to be reverse- complementary to the other
![Page 55: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/55.jpg)
Reverse-complementary sequences
• 5’-ACGTTACAGTA-3’
• The reverse complement is:
3’-TGCAATGTCAT-5’
=>
5’-TACTGTAACGT-3’
• Or simply written as
TACTGTAACGT
![Page 56: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/56.jpg)
DNA double helix
![Page 57: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/57.jpg)
Orientation of the double helix
• Double helix is anti-parallel–5’ end of each strand at 3’ end of the other–5’ to 3’ motion in one strand is 3’ to 5’ in the other
• Double helix has no orientation–Biology has no “forward” and “reverse” strand–Relative to any single strand, there is a “reverse complement” or “reverse strand”–Information can be encoded by either strand or both strands
5’TTTTACAGGACCATG 3’3’AAAATGTCCTGGTAC 5’
![Page 58: CS5263 Bioinformatics](https://reader031.vdocuments.us/reader031/viewer/2022012914/568152e2550346895dc10056/html5/thumbnails/58.jpg)
RNA Secondary structures
• RNAs are normally single-stranded
• Can form complex structure by self-base-pairing
• A=U, C=G