plan for day 1 1.the course 1.registration 2.layout 3.expectations 4.evaluation and exam 2.what is...
Post on 15-Jan-2016
212 views
TRANSCRIPT
Plan for day 1
1. The course1. Registration2. Layout3. Expectations4. Evaluation and exam
2. What is bioinformatics?3. Setup and connect computers
LUNCH
1. 13:00. Setup and connect computers2. Software overview3. CLC Combined Workbench (presentation, installation, demo)4. Install and play with general computer tools
What is bioinformatics?
Anders Krogh & Morten LindowThe Bioinformatics CentreDept of BiologyUniversity of Copenhagen
A big change in biology has taken place
Measure the expressionof a single gene in a singlesample
Measure the expression of allgenes in many samples
Before Now
Mutations
Before you mapped mutations in bacteria (lots of work, I think)
Now you sequence the whole genome with ”next generation sequencing”
Protein interactions
Find interaction partnersfor one protein Find interaction partners for all proteins
Before After
Biology has become an information science
Genome sequencing is just the beginning
GGCAAACCCTGTGATTCAGTTTGTCTGTGATTTGCTTAACCGGGATATTTCTTCTCGACCTTTATCTGATGCTGATCGTG TTAAGATAAAAAAGGCTCTTAGAGGTGTCAAAGTTGAAGTGACTCATCGAGGAAACATGCGCCGGAAGTACCGCATTTCC GGTTTGACTGCTGTGGCCACTCGGGAATTGACATTCCCAGTAGATGAAAGAAATACTCAGAAATCTGTTGTAGAATACTT CCACGAAACATATGGTTTTCGCATTCAGCACACTCAACTACCATGCTTGCAAGTTGGGAATTCTAATAGGCCTAATTACT TACCAATGGAGGTATGCAAGATTGTTGAAGGCCAGCGGTATTCCAAAAGATTGAATGAGAGACAGATCACTGCTTTGCTG AAGGTTACCTGTCAGCGCCCGATAGATCGAGAAAAAGATATCTTACAGACGGTGCAACTCAATGATTATGCTAAAGATAA TTATGCTCAAGAGTTTGGCATCAAAATAAGTACTTCTCTGGCTTCTGTTGAGGCTCGTATACTGCCTCCTCCATGGCTTA AGTACCACGAGTCTGGAAGGGAAGGGACTTGTCTGCCACAAGTTGGTCAATGGAACATGATGAATAAGAAAATGATCAAT GGTGGAACGGTGAATAATTGGATCTGCATCAACTTTTCTAGGCAAGTGCAGGACAATCTAGCGCGTACATTTTGTCAGGA ACTTGCTCAAATGTGTTACGTATCTGGCATGGCATTTAATCCGGAACCAGTCCTCCCACCAGTCAGTGCTCGCCCTGAGC AAGTAGAGAAGGTCTTGAAGACTAGATATCATGATGCCACATCAAAACTCTCCCAAGGAAAAGAAATTGATCTGCTTATT GTCATTCTGCCCGATAATAATGGATCATTATACGGTGATTTGAAACGCATATGTGAGACTGAACTCGGCATAGTCTCTCA ATGTTGCCTGACAAAGCATGTCTTTAAGATGAGCAAACAATACATGGCTAATGTTGCGCTGAAGATTAATGTGAAGGTTG GAGGAAGAAACACAGTGCTTGTTGATGCTCTATCTAGGCGGATTCCTCTAGTCAGTGATCGACCCACCATTATATTTGGT GCTGATGTTACCCACCCTCACCCTGGAGAGGATTCAAGCCCATCTATTGCTGCTGTTGTGGCATCTCAGGATTGGCCTGA AATCACTAAATATGCTGGATTAGTTTGCGCTCAAGCGCATAGGCAGGAGCTCATTCAGGATCTGTTCAAAGAGTGGAAGG ATCCTCAGAAAGGTGTGGTGACTGGTGGCATGATAAAGGAGTTGCTCATAGCCTTCCGTAGATCAACTGGGCATAAACCA CTAAGGATCATCTTCTACAGGGATGGAGTCAGTGAGGGACAATTTTACCAAGTTTTGCTCTATGAACTTGATGCCATCCG CAAGGCCTGTGCTTCGCTGGAAGCAGGTTATCAACCACCAGTGACATTTGTGGTGGTGCAGAAGCGTCATCACACGAGGC TGTTTGCTCAGAACCACAATGATCGCCATTCGGTGGACAGAAGTGGGAATATTTTACCTGGCACTGTTGTGGACTCTAAA ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCA CGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGAT GCACACGCTCAGTTTCAATTGTTCCCCCTGCATATTATGCACATCTAGCAGCTTTTAGGGCTCGATTCTACATGGAGCCA GAGACATCAGACAGTGGCTCAATGGCTAGTGGGAGCATGGCACGTGGAGGTGGAATGGCTGGTAGAAGCACACGCGGGCC TAATGTCAATGCTGCAGTGAGGCCACTCCCAGCTCTGAAAGAGAATGTGAAGCGTGTCATGTTCTACTGCTGAGTTGATT CACCCTCTATCTATCTTTATGACCTAAATTAATGAAGAATATCATGTATGCTTTCTAAGACTTATCGTGTGTTTGGATAT TTCATCACTCTTTCTCTATGAGTATGAGATGCTTTATGACTCTTGTTTGACAACTACTAAACTTTATTATTCAAAACAGA CTTTGATCCTTTCAAAAAAAAAAAAAAAAAAAA TAGAGAGAGAGAGAAAGATATAGAGAGAACACAGAGAGGCGAGAGCGACGTAGGGTTGGTGTTTCGTACGGATTTTCTCG GTCAATCCTAGTTTCTCCGGCGAGAGATTGCTTTTCAGGAATCATCATGGTGAGAAAGAGAAGAACGGATGCTCCATCTG AAGGAGGTGAAGGCTCTGGGTCTCGTGAAGCTGGTCCAGTCTCAGGTGGTGGACGTGGTTCACAGCGAGGTGGTTTCCAG CAGGGAGGAGGACAACACCAAGGTGGAAGGGGTTATACTCCTCAACCTCAACAGGGAGGTCGTGGTGGTCGTGGATATGG GCAACCACCACAACAGCAACAACAGTATGGAGGACCACAAGAGTACCAAGGAAGAGGAAGAGGAGGACCTCCTCATCAAG GAGGTCGAGGAGGGTATGGCGGTGGCCGTGGAGGTGGACCTTCTTCTGGACCACCGCAGAGACAATCAGTTCCCGAGCTG CATCAAGCTACCTCACCTACTTATCAAGCGGTGTCTTCTCAGCCTACACTGTCTGAGGTGAGTCCTACCCAGGTACCAGA ACCTACTGTTCTGGCTCAGCAATTTGAACAACTCTCTGTTGAACAAGGAGCTCCCAGTCAGGCAATCCAGCCTATACCGT CTTCTAGCAAGGCTTTCAAGTTTCCAATGAGGCCTGGTAAAGGACAGAGTGGAAAGCGTTGCATTGTGAAGGCTAACCAT TTCTTTGCTGAACTGCCTGATAAGGATTTGCACCATTATGATGTTACCATTACTCCGGAAGTTACATCAAGGGGTGTCAA TCGTGCTGTGATGAAACAACTTGTTGATAATTATCGTGATTCTCACCTTGGAAGTCGTCTTCCAGCGTATGATGGTCGAA AAAGTCTTTACACTGCTGGTCCACTTCCCTTTAACTCCAAGGAGTTCAGAATCAATCTTCTTGACGAAGAAGTAGGGGCT GGAGGTCAAAGACGAGAAAGGGAATTTAAAGTTGTGATCAAGCTAGTTGCACGTGCTGATCTGCATCACCTAGGAATGTT TTTGGAGGGGAAACAATCAGATGCCCCACAGGAAGCTCTGCAGGTTCTTGACATTGTTCTTCGTGAGCTGCCGACCTCTA GGTATATTCCGGTGGGCCGGTCCTTTTATTCCCCTGATATAGGAAAAAAACAATCATTGGGGGATGGCTTGGAGAGCTGG CGTGGATTCTACCAAAGCATTCGTCCTACACAGATGGGCTTATCACTCAATATTGATATGTCATCGACAGCCTTCATAGA GGCAAACCCTGTGATTCAGTTTGTCTGTGATTTGCTTAACCGGGATATTTCTTCTCGACCTTTATCTGATGCTGATCGTG TTAAGATAAAAAAGGCTCTTAGAGGTGTCAAAGTTGAAGTGACTCATCGAGGAAACATGCGCCGGAAGTACCGCATTTCC GGTTTGACTGCTGTGGCCACTCGGGAATTGACATTCCCAGTAGATGAAAGAAATACTCAGAAATCTGTTGTAGAATACTT CCACGAAACATATGGTTTTCGCATTCAGCACACTCAACTACCATGCTTGCAAGTTGGGAATTCTAATAGGCCTAATTACT TACCAATGGAGGTATGCAAGATTGTTGAAGGCCAGCGGTATTCCAAAAGATTGAATGAGAGACAGATCACTGCTTTGCTG AAGGTTACCTGTCAGCGCCCGATAGATCGAGAAAAAGATATCTTACAGACGGTGCAACTCAATGATTATGCTAAAGATAA TTATGCTCAAGAGTTTGGCATCAAAATAAGTACTTCTCTGGCTTCTGTTGAGGCTCGTATACTGCCTCCTCCATGGCTTA AGTACCACGAGTCTGGAAGGGAAGGGACTTGTCTGCCACAAGTTGGTCAATGGAACATGATGAATAAGAAAATGATCAAT GGTGGAACGGTGAATAATTGGATCTGCATCAACTTTTCTAGGCAAGTGCAGGACAATCTAGCGCGTACATTTTGTCAGGA ACTTGCTCAAATGTGTTACGTATCTGGCATGGCATTTAATCCGGAACCAGTCCTCCCACCAGTCAGTGCTCGCCCTGAGC AAGTAGAGAAGGTCTTGAAGACTAGATATCATGATGCCACATCAAAACTCTCCCAAGGAAAAGAAATTGATCTGCTTATT GTCATTCTGCCCGATAATAATGGATCATTATACGGTGATTTGAAACGCATATGTGAGACTGAACTCGGCATAGTCTCTCA ATGTTGCCTGACAAAGCATGTCTTTAAGATGAGCAAACAATACATGGCTAATGTTGCGCTGAAGATTAATGTGAAGGTTG GAGGAAGAAACACAGTGCTTGTTGATGCTCTATCTAGGCGGATTCCTCTAGTCAGTGATCGACCCACCATTATATTTGGT GCTGATGTTACCCACCCTCACCCTGGAGAGGATTCAAGCCCATCTATTGCTGCTGTTGTGGCATCTCAGGATTGGCCTGA AATCACTAAATATGCTGGATTAGTTTGCGCTCAAGCGCATAGGCAGGAGCTCATTCAGGATCTGTTCAAAGAGTGGAAGG ATCCTCAGAAAGGTGTGGTGACTGGTGGCATGATAAAGGAGTTGCTCATAGCCTTCCGTAGATCAACTGGGCATAAACCA CTAAGGATCATCTTCTACAGGGATGGAGTCAGTGAGGGACAATTTTACCAAGTTTTGCTCTATGAACTTGATGCCATCCG CAAGGCCTGTGCTTCGCTGGAAGCAGGTTATCAACCACCAGTGACATTTGTGGTGGTGCAGAAGCGTCATCACACGAGGC TGTTTGCTCAGAACCACAATGATCGCCATTCGGTGGACAGAAGTGGGAATATTTTACCTGGCACTGTTGTGGACTCTAAA ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCA CGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGAT GCACACGCTCAGTTTCAATTGTTCCCCCTGCATATTATGCACATCTAGCAGCTTTTAGGGCTCGATTCTACATGGAGCCA GAGACATCAGACAGTGGCTCAATGGCTAGTGGGAGCATGGCACGTGGAGGTGGAATGGCTGGTAGAAGCACACGCGGGCC TAATGTCAATGCTGCAGTGAGGCCACTCCCAGCTCTGAAAGAGAATGTGAAGCGTGTCATGTTCTACTGCTGAGTTGATT CACCCTCTATCTATCTTTATGACCTAAATTAATGAAGAATATCATGTATGCTTTCTAAGACTTATCGTGTGTTTGGATAT TTCATCACTCTTTCTCTATGAGTATGAGATGCTTTATGACTCTTGTTTGACAACTACTAAACTTTATTATTCAAAACAGA
Where are the genes?How are they regulated?What do they do?How do they interact?How did they evolve?What about the rest of
the genome?
Experimental labs need informatics
In many labs bioinformatics is the bottleneck.
Example:• You want to study differences in miRNA expression in cancer
vs. normal tissue.• Short RNAs are extracted. It is mailed to a company for
sequencing.• You get a hard-disk in return full of short sequences.• The rest is bioinformatics.
Definition
The book:”bioinformatics involves the technology that uses computers for analysis, storage, retrieval, manipulation and distribution of information related to biological macromolecules such as DNA, RNA and proteins.”
Definition
Wikipedia: ”… The terms bioinformatics and computational biology are often used interchangeably.
However bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.
Computational biology, on the other hand, refers to hypothesis-driven investigation of a specific biological problem using computers, carried out with experimental and simulated data, with the primary goal of discovery and the advancement of biological knowledge. “
Bioinformatics?
Search for homologs to a protein sequence
Retrieve information about a genome segment
Predict the structure of an RNA molecule
Build a phylogentic tree connecting a set of proteins
Find differentially expressed genes using microarrays
Make a model of protein-protein interactions
Analyze experimental data in a spreadsheet
Make an equation describing a neuronal action potential
Construct cardiac blood-flow model
Differential equations to describe prey/predator dynamics
Some challenges in bioinformatics
How to fully decipher the digital content of the genome
How to analyze expression data
How to extract regulatory networks from the above
How to integrate multiple high-throughput data types
How to visualize and explore large scale multi-dimensional data
How to predict protein structure and function ab initio
How to identify signatures for cellular states (healthy vs diseased)
How to build hierachical models across multiple scales of time and space
How to reduce complex multi-dimensional models to underlying principles
Inspired byLeroy Hood
Example
In which you will learn a bit about:
accessing and searching for information in bio-databases
what microRNAs are
Prediction of RNA structure
Imagine
You are studying the oncogene c-Myc (a transcription factor)You have isolated a complex containing the mRNA for c-Myc
In this complex you find a small RNAYou get excited!You manage to clone and sequence it
• caaagugcuuacagugcagguagu
Now what?
Finding it in the genome
Is this a known molecule?Since the human genome has been fully sequenced:
• We must be able to find out where it is encoded• Go to a genome browser
Wow! It is a microRNASidestep: What are microRNAs?
What are miRNAs?
The RNA revolution
Biology’s Big Bang
• 10 years ago: RNA was considered uninteresting messengers for the proteins
• The non-coding part of the genome (98%) was considered junk
The Economist, June 2007
Beware of the RNA!
• It is your RNA that separates you from a worm – not your proteins!
• It is the RNA that regulates your genes – as much as proteins!
• New types of RNA are discovered every month• Most of a genome is transcribed• 98% of the genome is probably important (my
guess)
The RNA operating system
Genome
Transcriptome
Proteome
Regulation by proteins
siRNA/miRNA
Massive regulation by RNA
Imprinting – methylationSplicing
Ribozymes
MicroRNA
• Small (20-22nt) RNAs• Pre-miRNA forms hairpin structure• Involved in post-transcriptional regulation and
gene silencing (methylation)• Important in development, brain, cancer, etc. • Evolutionarily conserved (?)
miRNA logic
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATmiRNA gene
Pri-miRNA
A microRNA
Inhibit mRNA translation
Drosha
Dicer
Export
Pre-miRNA
Animal & Plant miRNA
Some miRNAs occur in clusters
miRNA targets
• Very few experimentally validated targetsmostly in fly and worm
• We have to rely on bioinformatic target predictions. Probably very noisy.– 10% of all genes regulated by miRNAs ( Enright et
al, 2003)– 30% of all genes regulated by miRNAs (Lewis et al.
2004)– ~all genes regulated to some degree (others)
Three main types of target sites(a) Canonical sites: good
or perfect complementarity Characteristic bulge in the middle.
(b) Dominant seed sites: perfect seed (bases 2-8) match, but poor 3’ end complementarity.
(c) Compensatory sites: mismatch or wobble in seed region. Compensate at the 3’ end.
From Mazière P,. and Enright, A , Drug discovery today, 2007
Principal criteria to predict miRNA targets
• Seed complementary: seed regions (bases 2-8) of miRNA sequences are complementary to the 3’ UTR.
• Target sites are conserved in other genome. (May miss targets of recently evolved miRNAs )
• Target multiplicity: multiple binding sites for miRNA• Thermodynamics of RNA-RNA duplex• Target structure: lack of strong secondary structure
at miRNA-target binding site may be an important feature
Overlap between methods
Hammell et al., Nature Methods 5:813-819
Finding it in the genome
Is this a known molecule?Since the human genome has been fully sequenced:
• We must be able to find out where it is encoded• Go to a genome browser
Wow! It is a microRNASidestep: What are microRNAs?
Let’s assume this was NOT known already
RNA folding
Can it fold as a hairpin?• Get the sequence with flanks (from genome
browser)• Fold it at Vienna RNA
RNA
?
More details: RNA-lecture
Summary so far
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATmiRNA gene
RNA
proteinA microRNA
Prediction of precursorstructure
Identify transcript
What controls the controller?
Find the transcription start site• Use and integrate existing data
• Genome browser: Known transcripts, genome annotation (from cDNA data)
• Auxilary information (not yet in genome browser)• Known 5’ ends (from RIKEN CAGE-tags)• Known RNA polymerase II binding sites (from ChIP)
• Use or construct predictive models• Machine learning / Inference (HMM, Neural Nets, SVM,
GLM)• You need a bioinformatician for this!!
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGAT?-- miR-17 --?
?
Summary so far
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATmiRNA gene
RNA
proteinA microRNA
Prediction of binding sites
Prediction of precursor structure
Identify transcript
Prediction of transcription factor binding sites
CCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGAT
Does certain combinations of TFs occur together?
In certain groups of genes?Is this significant?
What biological meaning does it make?
In another lecture: Motif Search
UCSC transcription factor track
Prediction of microRNA targets
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATTranscriptional unit
RNA
?
Prediction of microRNA targets ?
RNAs interact by forming base pairs (A-U C-G G-U)Align microRNA and target (more details in Alignment-lecture)Build in biology:
• Some part of the miRNA is more important than others• Binding sites conserved in evolution tend to be more functional
MiRanda predictions
Regulatory systems
A microRNA
Regulates other RNA( prevents them from being translated to proteins )
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATTranscriptional unit
RNAMaybe feedback regulation?
A feedback loop? miR-155 and Bach
ATCTGCCACCCTACAGAGTTTGACTTTTACCTCTGTAGTCATGCTGGTATTCAGGGCACTTCTCGACCTGCTCATTACCACGTTCTTTGGGATGAGAACAACTTTACTGCAGATGGACTTCAATCTCTGACCAATAACTTATGTTACACGTATGCAAGATBIC - mir-155
Bach2-binding sites(repressor)
Bach2-proteins miR-155
“Lack of BIC and microRNA miR-155 expression in primary cases of Burkitt lymphoma.” Genes Chromosomes Cancer. 2006 Feb;45(2):147-53
“These results indicate that BACH2 plays important roles in regulation of B cell development.” Oncogene. 2000 Aug 3;19(33):3739-49
Bioinformatics is like LEGO®
Build using different bricks to get Biological knowledge• Databases of experimental data ( sequence, genome
annotation, molecule interactions etc, etc)• Scan for transcription factor binding sites• RNA folding and classification• miRNA target prediction
Or design your own LEGO bricks!• Enter the master’s program
Masters of Bioinformatics
What you have seen
DatabaseUCSC human genome browserUsing known information to find likely transcription start siteThe horror of ids/names
Alignment and sequence searchSequence search with BLAT against human genomemiRanda - Alignment to find miRNA targets
RNARNA folding of miRNA-precursor
Promoter analysisPredicted transcription factor binding sites
Plan for day 1
1. The course1. Registration2. Layout3. Expectations4. Evaluation and exam
2. What is bioinformatics?3. Setup and connect computers
LUNCH
1. 13:00. Setup and connect computers2. Software overview3. CLC Combined Workbench (presentation, installation, demo)4. Install and play with general computer tools