Transcript
Page 1: Introduction to Bioinformatics

Introduction to BioinformaticsPart 0: So You Want To Be a ComputationalBiologist?

Leighton Pritchard and Peter Cock

Page 2: Introduction to Bioinformatics

Bertrand Russell

Page 3: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 4: Introduction to Bioinformatics

What is this “bioinformatics” thing,anyway?

• Bioinformatics: biology using computational andmathematical tools

• A discipline within biology• Loman & Watson (2013) “So you want to be a computational

biologist?” http://dx.doi.org/10.1038/nbt.2740• Welch et al. (2014) “Bioinformatics Curriculum Guidelines:

Toward a Definition of Core Competencies”http://dx.doi.org/10.1371/journal.pcbi.1003496

• Watson (2014) “The only core competency you need”http://bit.ly/1fS4iDJ (blog)

Page 5: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician

• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 6: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 7: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 8: Introduction to Bioinformatics

What it takes to be a bioinformatician

• Patience(problem-solving)

• Suspicion (statistics)

• Biological knowledge

• Social skills (no-oneknows everything: ask!)

• Lots of practice

• Self-confidence (challengeresults and dogma)

• Core domain skills:biology, computer science,statistics

• Deliver results (qualified,honest)

• Watson (2014) “What it takes to be a bioinformatician”http://bit.ly/1jDuQsO (blog)

Page 9: Introduction to Bioinformatics

More general advice?

• Ask us (we do this a lot)

• BioStars (https://www.biostars.org)

• SeqAnswers (http://seqanswers.com/)

• PLoS Comp Biol collections (http://www.ploscollections.org/static/pcbiCollections)

Page 10: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 11: Introduction to Bioinformatics

Why Do It?

• Doing bioinformatics is doing science: keep a lab book!

• You will not remember multiple files, analysis details, etc. in aweek/month/six months/a year/three years

• Noble (2009)http://dx.doi.org/10.1371/journal.pcbi.1000424

• Baggerly & Coombes (2009)http://arxiv.org/pdf/1010.1092.pdf

Page 12: Introduction to Bioinformatics

How To Do It? I

• There is no one correct way, but. . .

• Think about data/docs/project structure before you start

Page 13: Introduction to Bioinformatics

How To Do It? II

• Use plain text where possible

• Use version control

• Keep backups

• Different tools suit different purposes: code vs. data vs.analysis vs. . . .

• Find a way that works for you.

Page 14: Introduction to Bioinformatics

How To Do It? III

• Reproducibility is key!

• Scripts and pipelines are better for this than notes of whatyou did

• Also better for version control, and reuse

• Avoid unnecessary duplication• Someone else may have solved your problem• One (backed up) read-only copy of raw data, keep analyses

separate

Page 15: Introduction to Bioinformatics

Plain Text Files

• README.txt/README.md in each directory/folder

• Plain text is always human-readable• Markdown (https://daringfireball.net/projects/markdown/basics)

• RST (http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html)

Page 16: Introduction to Bioinformatics

Galaxy workflows

• Use through browser, graphical interface

• Reproducible, shareable, documented, reusable analyses

• Wraps standard bioinformatics tools

• Local instance (http://ppserver/galaxy) uses JHI cluster

Page 17: Introduction to Bioinformatics

script

• Writes your terminal activity to a plain text file

• Saves effort copy/pasting and typing commands into a labbook, as you go

• Easy to use with other tools

• use man script at your terminal to find out more

Page 18: Introduction to Bioinformatics

MediaWiki

• Useful for shared projects/data

• Automatic version control and attribution

• Many local instances at JHI (ask around)

Page 19: Introduction to Bioinformatics

A language notebook

• e.g. iPython Notebook, Mathematica, MatLab cells

• Integrates live code and analysis with lab-book

Page 20: Introduction to Bioinformatics

LATEX

• Powerful, versatile typesetting system (e.g. these slides)

• Similar to markup/markdown

• Pros: great for mathematical/computing work, writing a thesis

• Cons: not easy to pick up

Page 21: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 22: Introduction to Bioinformatics

In Conclusion

• Bioinformatics is just biology using computers andmathematics

• You still need to “do science” in the same way:• Keep accurate records• Plan and conduct experiments (with controls)• Follow the literature• Professional development

Page 23: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 1: Golden Rules of Bioinformatics

Leighton Pritchard and Peter Cock

Page 24: Introduction to Bioinformatics

On Confidence

“Ignorance more frequently begets confidence than doesknowledge: it is those who know little, not those who know much,who so positively assert. . .”- Charles Darwin

Page 25: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 26: Introduction to Bioinformatics

Zeroeth Golden Rule of Bioinformatics

• No-one knows everything about everything - talk to people!• local bioinformaticians, mailing lists, forums, Twitter, etc.

• Keep learning - there are lots of resources

• There is no free lunch - no method works best on all data

• The worst errors are silent - share worries, problems, etc.

• Share expertise (see first item)

Page 27: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 28: Introduction to Bioinformatics

First Golden Rule of Bioinformatics

• Always inspect the raw data (trends, outliers, clustering)

• What is the question? Can the data answer it?

• Communicate with data collectors! (don’t be afraid ofpedantry)

• Who? When? How?• You need to understand the experiment to analyse it (easier if

you helped design it).• Be wary of block effects (experimenter, time, batch, etc.)

Page 29: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 30: Introduction to Bioinformatics

Second Golden Rule of Bioinformatics

• Do not trust the software: it is not an authority• Software does not distinguish meaningful from meaningless

data• Software has bugs• Algorithms have assumptions, conditions, and applicable

domains• Some problems are inherently hard, or even insoluble

• You must understand the analysis/algorithm

• Always sanity test

• Test output for robustness to parameter (including data)choice

Page 31: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 32: Introduction to Bioinformatics

Third Golden Rule of Bioinformatics

• Everyone has expectations of their data/experiment• Beware cognitive errors, such as confirmation bias!• System 1 vs. System 2 ≈ intuition vs. reason

• Think statistically!• Large datasets can be counterintuitive and appear to confirm a

large number of contradictory hypotheses• Always account for multiple tests.• Avoid “data dredging”: intensive computation is not an

adequate substitute for expertise

• Use test-driven development of analyses and code• Use examples that pass and fail

Page 33: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 34: Introduction to Bioinformatics

In Conclusion

• Always communicate!• worst errors are silent

• Don’t trust the data• formatting/validation/category errors - check!• suitability for scientific question

• Don’t trust the software• software is not an authority• always benchmark, always validate

• Don’t trust yourself• beware cognitive errors• think statistically• biological “stories” can be constructed from nonsense

Page 35: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 2: BLAST

Leighton Pritchard and Peter Cock

Page 36: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 37: Introduction to Bioinformatics

Learning Outcomes

• How BLAST searches work

• How the way BLAST searches work affects your results

• Why search parameters matter

• Setting search parameters

Page 38: Introduction to Bioinformatics

About Bioinformatics Tools

Page 39: Introduction to Bioinformatics

A Recent Twitter Conversation

Page 40: Introduction to Bioinformatics

A Recent Twitter Conversation

Page 41: Introduction to Bioinformatics

Why So Much Detail?

• You’re going to go away and do lots of BLAST searches

• Everyone uses BLAST - not everyone uses it well

• Easier to fix problems if you know how it works

• Understanding what’s going on helps avoid misuse/abuse

• Understanding what’s going on helps use the tool moreeffectively

• Not so much detail, really• like knowing about Tm and ion concentration effects, not

molecular orbitals or thermodynamics (but ask if you’reinterested ;) )

Page 42: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 43: Introduction to Bioinformatics

What BLAST Is

• BLAST:• Basic (it’s actually sophisticated)• Local Alignment (what it does: local sequence alignment)• Search Tool (what it does: search against a database)

• The most important software package in bioinformatics?

• Fast, robust, sequence similarity search tool

• Does not necessarily produce optimal alignments

• Not foolproof.

Page 44: Introduction to Bioinformatics

What BLAST Is

• BLAST:• Basic (it’s actually sophisticated)• Local Alignment (what it does: local sequence alignment)• Search Tool (what it does: search against a database)

• The most important software package in bioinformatics?

• Fast, robust, sequence similarity search tool

• Does not necessarily produce optimal alignments

• Not foolproof.

Page 45: Introduction to Bioinformatics

What A BLAST Search Is

• Every BLAST search is an in silico hybridisation experiment

• BLAST search = identification of similar sequences in a givendatabase

• Results depend on:• query sequence• BLAST program (including version and BLAST vs BLAST+)• database• parameters

Page 46: Introduction to Bioinformatics

Alignment Search Space

Consider two biological sequences to be aligned. . .

• One sequence on the x-axis, the other on the y -axis

• Each point in space is a pairing of two letters

• Ungapped alignments are diagonal lines in the search space,gapped alignments have short ’breaks’

• There may be one or more ”optimal” alignments

Page 47: Introduction to Bioinformatics

Global vs Local Alignment

• Global alignment: sequences are aligned along their entirelengths

• Local alignment: the best subsequence alignment is found

• Consider an alignment of the same gene from twodistantly-related eukaryotes, where:

• Exons are conserved and small in relation to gene locus size• Introns are not well-conserved but large in relation to gene

locus size

• Local alignment will align the conserved exon regions

• Global alignment will align the whole (mostly unrelated) locus

Page 48: Introduction to Bioinformatics

Global vs Local Alignment

• Global alignment: sequences are aligned along their entirelengths

• Local alignment: the best subsequence alignment is found

• Consider an alignment of the same gene from twodistantly-related eukaryotes, where:

• Exons are conserved and small in relation to gene locus size• Introns are not well-conserved but large in relation to gene

locus size

• Local alignment will align the conserved exon regions

• Global alignment will align the whole (mostly unrelated) locus

Page 49: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 50: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 51: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 52: Introduction to Bioinformatics

Initialise the matrix

Page 53: Introduction to Bioinformatics

Fill the cells

Page 54: Introduction to Bioinformatics

Fill the matrix – represents all possiblealignments & scores

Page 55: Introduction to Bioinformatics

Traceback

Page 56: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 57: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 58: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 59: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 60: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 61: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 62: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 63: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 64: Introduction to Bioinformatics

Steps in the Algorithm

1. Seeding

2. Extension

3. Evaluation

Page 65: Introduction to Bioinformatics

Word Hits

• A word hit is a short sequence and its neighbourhood

• neighbourhood: words of same length whose aligned score isgreater than or equal to a threshold value T

• Three parameters: scoring matrix, word size W , and T

Page 66: Introduction to Bioinformatics

Seeding

• BLAST assumption: significant alignments have words incommon

• BLAST finds word (neighbourhood) hits in the database index

• Word hits are used to seed alignments

Page 67: Introduction to Bioinformatics

Seeding Controls Sensitivity

• Word size W controls number of hits (smaller words =⇒more hits)

• Threshold score T controls number of hits (lower threshold=⇒ more hits)

• Scoring matrix controls which words match

Page 68: Introduction to Bioinformatics

The Two-Hit Algorithm

• BLAST assumption: word hits cluster on the diagonal forsignificant alignments

• The acceptable distance A between words on the diagonal is aparameter of your model

• Smaller distances isolate single words, and reduce search space

Page 69: Introduction to Bioinformatics

Extension

• The best-scoring seeds are extended in each direction

• BLAST does not explore the complete search space, so a rule(heuristic) to stop extension is needed

• Two-stage process:• Extend, keeping alignment score, and drop-off score• When drop-of score reaches a threshold X , trim alignment

back to top score

Page 70: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to the right from the seed T

• The quic• The quie• 123 4565 <- score• 000 0001 <- drop-off score

Page 71: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to the right from the seed T

• The quic• The quie• 123 4565 <- score• 000 0001 <- drop-off score

Page 72: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to drop-off threshold• The quick brown fox jump• The quiet brown cat purr• 123 45654 56789 876 5654 <- score• 000 00012 10000 123 4345 <- drop-off score

Page 73: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Trim back from drop-off threshold to get optimal alignment• The quick brown• The quiet brown• 123 45654 56789 <- score• 000 00012 10000 <- drop-off score

Page 74: Introduction to Bioinformatics

Notes on implementation

• X controls termination of alignment extension, but dependenton:

• substitution matrix• gap opening and extension parameters

Page 75: Introduction to Bioinformatics

Evaluation

• The principle is easy: use a score threshold S to determinestrong and weak alignments

• S is monotonic with E , so an equivalent threshold can becalculated

• Score S is independent of database size and search space. Evalues are not.

• Alignment consistency of HSPs is also a factor in the report

Page 76: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 77: Introduction to Bioinformatics

Log-odds Matrices

• Substitution matrices are your model of evolution

• Substitution matrices are log-odds matrices• Positive numbers indicate likely substitutions/similarity• Negative numbers indicate unlikely substitutions/dissimilarity

BLOSUM62

Page 78: Introduction to Bioinformatics

Choice of Matrix

• Substitution matrix determines the raw alignment score S• S is the sum of pairwise scores in an alignment

• BLAST provides, for proteins:• BLOSUM45 BLOSUM50 BLOSUM62 BLOSUM80 BLOSUM90• PAM30 PAM70 PAM250

• BLOSUM matrices empirically defined from multiple sequencealignments of ≥ n% identity, for BLOSUMn

• For nucleotides: ‘matrix’ defined by match/mismatch(reward/penalty) parameters

Page 79: Introduction to Bioinformatics

Definition

• The Karlin-Altschul equation

E = kmne−λS

• Symbols:• k : minor constant, adjusts for correlation between alignments• m: number of letters in query sequence• n: number of letters in the database• λ: scoring matrix scaling factor• S : raw alignment score

Page 80: Introduction to Bioinformatics

Interpretation

• The Karlin-Altschul equation

E = kmne−λS

• E is the number of alignments of a similar score expected bychance when querying a database of the same size and letterfrequency, where the letters in that database arerandomly-ordered

• Small changes in score S can produce large changes in E

• BUT biological sequence databases are not random!

Page 81: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 82: Introduction to Bioinformatics

Multiple BLAST tools

• BLASTN vs MEGABLAST vs TBLASTX vs ...?

• Korf et al. (2003) BLAST is really good for theory part,but practical examples dated due to changes with BLAST+

Page 83: Introduction to Bioinformatics

Multiple flavours of BLAST

• NCBI “legacy” BLAST• Now obsolete and not being updated• Spawned offshoots including:

• WU-BLAST aka AB-BLAST (commerical)• MPI-BLAST for use on clusters• Versions to run on graphics cards

• NCBI BLAST+• Re-written in 2009 using C++ instead of C• Many improvements• Slightly different output• Different commands used to run it

Page 84: Introduction to Bioinformatics

Multiple ways to run BLAST

• BLAST+ at the command line (today)

• Via a script or programming language

• Via a graphical tool like BioEdit, CLCbio, Blast2GO

• Via the NCBI website

• Via a genome consortium website

• Via a Galaxy web server

• etc

• Offers flexibility but different settings/options/versions

Page 85: Introduction to Bioinformatics

Multiple places to run BLAST

• On the NCBI servers, e.g. via website or tool

• On 3rd party servers, e.g. via websites

• On your own computer

• On our Linux cluster

Page 86: Introduction to Bioinformatics

Core BLAST tools: Query sequences vsDatabase

• Nucleotide vs Nucleotide:• blastn (covering blastn, megablast, dc-megablast)

• Translated nucleotide vs Protein:• blastx

• Protein vs Translated nucleotide:• tblastn

• Protein vs Protein:• blastp, psiblast, phiblast, deltablast

See http://blast.ncbi.nlm.nih.gov/ for a reminder ;)

Page 87: Introduction to Bioinformatics

The BLAST tools have built in help

1 $ blastp -h

2 USAGE

3 blastp [-h] [-help] [-import_search_strategy filename]

4 [-export_search_strategy filename] [-task task_name] [-db database_name]

5 [-dbsize num_letters] [-gilist filename] [-seqidlist filename]

6 [-negative_gilist filename] [-entrez_query entrez_query]

7 [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]

8 [-subject subject_input_file] [-subject_loc range] [-query input_file]

9 [-out output_file] [-evalue evalue] [-word_size int_value]

10 [-gapopen open_penalty] [-gapextend extend_penalty]

11 [-xdrop_ungap float_value] [-xdrop_gap float_value]

12 [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value]

13 [-sum_statistics] [-seg SEG_options] [-soft_masking soft_masking]

14 [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]

15 ...

16 [-max_target_seqs num_sequences] [-num_threads int_value] [-ungapped]

17 [-remote] [-comp_based_stats compo] [-use_sw_tback] [-version]

1819 DESCRIPTION

20 Protein -Protein BLAST 2.2.29+

2122 Use ’-help ’ to print detailed descriptions of command line arguments

Page 88: Introduction to Bioinformatics

Minimal example of BLAST+ at thecommand line

1 $ blastp -query my_input.fasta -db my_database -out my_output.txt

• Replace blastp with the appropriate tool, e.g. blastn

• Replace my input.fasta with your actual filename

• Replace my database with your actual database, e.g. nr

• Replace my output.txt with your desired output filename

• Best to avoid spaces in your folder and filenames!

e.g.

1 $ blastp -query query.fasta -db dbA -out my_output.txt

Page 89: Introduction to Bioinformatics

Setting the BLAST+ output format

1 $ blastp -help

2 USAGE

3 ...

45 *** Formatting options

6 -outfmt <String >

7 alignment view options:

8 0 = pairwise ,

9 1 = query -anchored showing identities ,

10 2 = query -anchored no identities ,

11 3 = flat query -anchored , show identities ,

12 4 = flat query -anchored , no identities ,

13 5 = XML Blast output ,

14 6 = tabular ,

15 7 = tabular with comment lines ,

16 8 = Text ASN.1,

17 9 = Binary ASN.1,

18 10 = Comma -separated values ,

19 11 = BLAST archive format (ASN .1)

2021 ...

22 Default = ‘0’

23 ...

Page 90: Introduction to Bioinformatics

Setting the BLAST+ output format

Default is plain text pairwise alignments, for humans:

1 $ blastp -query query.fasta -db dbA -out my_output.txt

2 ...

XML output can be useful (e.g. for BLAST2GO):

1 $ blastp -query query.fasta -db dbA -out my_output.xml -outfmt 5

2 ...

Tabular output is easiest to filter, sort, etc:

1 $ blastp -query query.fasta -db dbA -out my_output.tab -outfmt 6

2 ...

Page 91: Introduction to Bioinformatics

Setting the e-value threshold

Check the built in help:

1 $ blastp -help

2 USAGE

3 ...

4 -evalue <Real >

5 Expectation value (E) threshold for saving hits

6 Default = ‘10’

7 ...

Example using 0.0001 or 1× 10−5 in scientific notation (1e-5)

1 $ blastp -query query.fasta -db dbA -out my_output.txt -evalue 1e-5

2 ...

Page 92: Introduction to Bioinformatics

In Conclusion

• Every BLAST search is an experiment

• Badly-designed searches can give you bad results

• Knowing how BLAST works helps improve search design

• BLAST results still require inspection and interpretation

Page 93: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 3: Workshop

Leighton Pritchard and Peter Cock

Page 94: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 95: Introduction to Bioinformatics

Learning Outcomes

• Workshop example: bacterial genome annotation(because they’re small and data easy to handle)

• The role of biological insight in a bioinformatics workflow• Visual interaction with sequence data• Using alternative tools• Comparison of tools and outputs• Online tools for automated function prediction

Page 96: Introduction to Bioinformatics

What You Will Be Doing

Illustrative example of concepts: Functional annotation of a draftbacterial genome

1. Gene prediction

2. Genome comparisons

3. Gene comparisons

Page 97: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 98: Introduction to Bioinformatics

Locate your data

• You are in group A, B, C or D - this decides your chromosomesequence:chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta

• Each sequence represents a single stitched, ordered draftbacterial genome comprising a number of contigs.

• You will use your sequence as the basis of the exercises in theworkshop.

Page 99: Introduction to Bioinformatics

Locate your data

• You are in group A, B, C or D - this decides your dataset:chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta

• You also have a GFF file describing the location of assembledcontigschrA contigs.gff, chrB contigs.gff,chrC contigs.gff, chrD contigs.gff

Page 100: Introduction to Bioinformatics

Inspect the data

1 $ head -n 3 chrA.fasta

2 >chrA

3 ttttcttgattgaccttgttcgagtggagtccgccgtgtcactttcgctttggcagcagt

4 gtcttgcccgtttgcaggatgagttacctgccacagaattcagtatgtggatacgcccgt

5 $ head -n 3 chrA_contigs.gff

6 ##gff -version 3

7 chrA stitching contig 1 154993 . . . ID=contig00005_b;Name=contig00005_b

8 chrA stitching contig 155036 241491 . . . ID=contig00018;Name=contig00018

Page 101: Introduction to Bioinformatics

Inspect the data

Starting Artemis

1 $ art &

Page 102: Introduction to Bioinformatics

Load the chromosome sequence

Select the sequence for your group

Page 103: Introduction to Bioinformatics

Load the chromosome sequence

Page 104: Introduction to Bioinformatics

Load the contig GFF

Page 105: Introduction to Bioinformatics

Load the contig GFF

Select the file for your group

Page 106: Introduction to Bioinformatics

Load the contig GFF

Page 107: Introduction to Bioinformatics

Find the stitching sequence

The contigs are stitched with a specific sequence: see if you canfind, and identify it.

Page 108: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 109: Introduction to Bioinformatics

Lines of Evidence

• ab initio genecalling:• Unsupervised methods - not trained on a dataset• Supervised methods - trained on a dataset

• homology matches• alignment to genes from related organisms (annotation

transfer)• from known gene products (e.g. proteins, ncRNA)• from transcripts/other intermediates (e.g. ESTs, cDNA,

RNAseq)

Page 110: Introduction to Bioinformatics

Consensus Methods

• Combine weighted evidence from multiple sources, using linearcombination or graph theoretical methods

• For eukaryotes:• EVM http://evidencemodeler.sourceforge.net/• Jigsaw http://www.cbcb.umd.edu/software/jigsaw/• GLEAN http://sourceforge.net/projects/glean-gene/

Page 111: Introduction to Bioinformatics

Basic Gene Finding

• We could use Artemis to identify the longest coding region ineach ORF, lots of manual steps

• This is the most basic gene finding, and can easily beautomated, e.g. EMBOSS getorf

• Dedicated gene finders usually more appropriate...

Page 112: Introduction to Bioinformatics

Finding Open Reading Frames

• ORF finding is naive, does not consider:• Start codon• Splicing• Promoter/RBS motifs• Wider context (e.g. overlapping genes)

Page 113: Introduction to Bioinformatics

Prokaryotic Prediction Methods

• Prokaryotes “easier” than eukaryotes for gene prediction

• Less uncertainty in predictions (isoforms, gene structure)• Very gene-dense (over 90% of chromosome is coding sequence)• No intron-exon structure• Problem is: “which possible ORF contains the true gene, and

which start site is correct?”• Still not a solved problem

Page 114: Introduction to Bioinformatics

Two ab initio Prokaryotic PredictionMethods

You will be using two tools

• Glimmer• Interpolated Markov models• Can be trained on “gold standard” datasets

• Prodigal• Log-likelihood model based on GC frame plots, followed by

dynamic programming• Can be trained on “gold standard” datasets

Page 115: Introduction to Bioinformatics

Using Glimmer

Supervised - we train on a related complete genome sequence,then run glimmer3

1 $ build -icm -r NC_004547.icm < NC_004547.ffn

2 $ glimmer3 -o 50 -g 110 -t 30 chrA.fasta NC_004547.icm chrA_glimmer3

• -o 50: max overlap bases

• -g 110: min gene length

• -t 30: threshold score

Page 116: Introduction to Bioinformatics

Using Glimmer

glimmer3 output is not standard GFF format:

1 $ head -n 4 chrA_glimmer3.predict

2 >chrA

3 orf00001 36 1430 +3 8.81

4 orf00002 1435 2535 +1 11.51

5 orf00005 2676 3761 +3 8.63

We could Google for help, or use provided conversion script:

1 $ python glimmer_to_gff.py chrA_glimmer3.predict

Page 117: Introduction to Bioinformatics

Using Glimmer

We now have output in GFF

1 $ head -n 3 chrA_glimmer3.gff

2 chrA Glimmer CDS 36 1430 8.81 + 0 ID=orf00001;Name=orf00001

3 chrA Glimmer CDS 1435 2535 11.51 + 0 ID=orf00002;Name=orf00002

4 chrA Glimmer CDS 2676 3761 8.63 + 0 ID=orf00005;Name=orf00005

Page 118: Introduction to Bioinformatics

Using Prodigal

Unsupervised (i.e. untrained) mode

1 $ prodigal -f gff -o chrA_prodigal.gff -i chrA.fasta

Page 119: Introduction to Bioinformatics

Using Prodigal

Prodigal GFF output is correctly formatted and informative

1 $ head -n 6 chrA_prodigal.gff

2 ##gff -version 3

3 # Sequence Data: seqnum =1; seqlen =4727782; seqhdr ="chrA"

4 # Model Data: version=Prodigal.v2.50; run_type=Single;model="Ab initio "; gc_cont

=54.48; transl_table =11; uses_sd =1

5 chrA Prodigal_v2 .50 CDS 3 1430 188.5 + 0 ID=1_1;partial =10; start_type=Edge;

rbs_motif=None;rbs_spacer=None;score =188.54; cscore =185.37; sscore =3.18;

rscore =0.00; uscore =3.18; tscore =0.00

6 chrA Prodigal_v2 .50 CDS 1435 2535 185.6 + 0 ID=1_2;partial =00; start_type=ATG;

rbs_motif=None;rbs_spacer=None;score =185.61; cscore =184.24; sscore =1.36;

rscore = -7.73; uscore =3.48; tscore =4.37

7 chrA Prodigal_v2 .50 CDS 2676 3761 146.2 + 0 ID=1_3;partial =00; start_type=ATG;

rbs_motif=None;rbs_spacer=None;score =146.19; cscore =149.82; sscore = -3.63;

rscore = -7.73; uscore = -0.28; tscore =4.37

Page 120: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 121: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 122: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 123: Introduction to Bioinformatics

Comparing predictions in Artemis

Do ORF(orange)/CDS(green,blue) prediction methods agree?

Page 124: Introduction to Bioinformatics

Comparing predictions in Artemis

Do glimmer(green)/prodigal(blue) CDS prediction methodsagree?

How do we know which (if either) is best?

Page 125: Introduction to Bioinformatics

Using a “Gold Standard”

A general approach for all predictive methods

• Define a known, “correct” set of true/false, positive/negativeetc. examples - the “gold standard”

• Evaluate your predictive method against that set for• sensitivity, specificity, accuracy, precision, etc.

Many methods available, coverage beyond the scope of thisintroduction

Page 126: Introduction to Bioinformatics

Contingency Tables

Condition (Gold standard)True False

Test outcomePositive True Positive False PositiveNegative False Negative True Negative

Sensitivity = TPR = TP/(TP + FN)Specificity = TNR = TN/(FP + TN)FPR = 1− Specificity = FP/(FP + TN)If you don’t have this information, you can’t interpret predictiveresults properly.

Page 127: Introduction to Bioinformatics

Why Performance Metrics Matter

• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where there isdisease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

Page 128: Introduction to Bioinformatics

Why Performance Metrics Matter

• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where there isdisease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

Page 129: Introduction to Bioinformatics

Why Performance Metrics Matter

• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X , youcannot know.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

Page 130: Introduction to Bioinformatics

Why Performance Metrics Matter

• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X , youcannot know.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

Page 131: Introduction to Bioinformatics

Why Performance Metrics Matter

• Imagine a predictor for protein functional class

• Predictor has has sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5

Page 132: Introduction to Bioinformatics

Why Performance Metrics Matter

• Imagine a predictor for protein functional class

• Predictor has has sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5

Page 133: Introduction to Bioinformatics

Bayes’ Theorem

• May seem counter-intuitive: 95% sensitivity, 99% specificity=⇒ 50% chance of any prediction being incorrect

• Probability given by Bayes’ Theorem

• P(X |+) = P(+|X )P(X )

P(+|X )P(X )+P(+|X̄ )P(X̄ )

• This is commonly overlooked in the literature (confirmationbias?)

• e.g. in paper describing novel TTSS predictor:“The surprisingly high number of (false) positives in genomeswithout TTSS exceeds the expected false positive rate”

Page 134: Introduction to Bioinformatics

Interpreting Performance Metrics

• Use Bayes’ Theorem!

• Predictions apply to groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to all other tests, (including proteinfunctional class prediction) - you should not ‘cherry-pick’ forpublication without other evidence

Page 135: Introduction to Bioinformatics

Interpreting Performance Metrics

• Use Bayes’ Theorem!

• Predictions apply to groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to all other tests, (including proteinfunctional class prediction) - you should not ‘cherry-pick’ forpublication without other evidence

Page 136: Introduction to Bioinformatics

“Gold Standard” results

• Tested glimmer and prodigal on two ”gold standards”• Manually annotated (>3 expert person years) close relative• Community-annotated close relative

• Both methods trained directly on the annotated genes in eachorganism!

Page 137: Introduction to Bioinformatics

“Gold Standard” results

genecaller glimmer prodigal

predicted 4752 4287missed 284 (6%) 407 (9%)

Exact Predictionsensitivity 62% 71%

FDR 41% 25%PPV 59% 75%

Correct ORFsensitivity 94% 91%

FDR 10% 3%PPV 90% 97%

Page 138: Introduction to Bioinformatics

“Gold Standard” results

genecaller glimmer prodigal

predicted 4679 4467missed 112 (3%) 156 (3%)

Exact Predictionsensitivity 62% 86%

FDR 31% 14%PPV 69% 86%

Correct ORFsensitivity 97% 97%

FDR 7% 3%PPV 93% 97%

Page 139: Introduction to Bioinformatics

Gene/CDS Prediction

• Many alternative methods, all perform differently

• To assess/choose methods, performance metrics are required

• Even on (relatively simple) prokaryotes, current best methodsimperfect

• Manual assessment and intervention is essential, and usuallythe longest part of the process

Page 140: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 141: Introduction to Bioinformatics

Run a megaBLAST Comparison

BLAST your chromosome against the comparator sequence.Put results in chrA megablast Pba.tab

1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_megablast_Pba.tab -

outfmt 6

2 $ head -n 3 chrA_megablast_Pba.tab

3 chrA gi |50118965| ref|NC_004547 .2|:10948 -12453 80.34 1511 287 10 4579450 4580955

1506 1 0.0 1136

4 chrA gi |50118965| ref|NC_004547 .2|: c33859 -32447 82.04 1409 253 0 4563151 4564559

1 1409 0.0 1201

5 chrA gi |50118965| ref|NC_004547 .2|: c34917 -33868 82.48 1050 184 0 4562093 4563142

1 1050 0.0 920

Note this defaults to using MEGABLAST...

Page 142: Introduction to Bioinformatics

Run a BLASTN Comparison

BLAST your chromosome against the comparator sequencePut results in chrA blastn Pba.tab

1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_blastn_Pba.tab -

outfmt 6 -task blastn

2 $ head -n 3 chrA_blastn_Pba.tab

3 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 79.68 1865 379 0 4584915 4586779

1865 1 0.0 1654

4 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 92.59 27 2 0 4479367 4479393 1254

1280 0.004 41.0

5 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 100.00 17 0 0 4613022 4613038 52 36

2.1 31.9

Note we added -task blastn

Page 143: Introduction to Bioinformatics

Do BLASTN and megaBLAST compar-isons agree?

Check the number of alignments returned with wc

1 $ wc chrA_megablast_Pba.tab

2 2675 32100 242539 chrA_megablast_Pba.tab

3 $ wc chrA_blastn_Pba.tab

4 31792 381504 2850953 chrA_blastn_Pba.tab

What is this telling us?Why do the results differ?

Page 144: Introduction to Bioinformatics

BLASTN vs megaBLAST

• Legacy BLASTN uses the BLAST algorithm, megaBLASTdoes not

• (though BLAST+ BLASTN now uses megaBLAST by default)

• megaBLAST uses a fast, greedy algorithm due to Zhang et al.(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397

• megaBLAST is optimised for• genome-level searches• queries on large sequence sets (automatic query packing)• long alignments of similar sequences, with SNPs/sequencing

errors

• A discontinuous mode (dc-megaBLAST) is recommended formore divergent sequences

Page 145: Introduction to Bioinformatics

BLASTN vs megaBLAST

• Legacy BLASTN uses the BLAST algorithm, megaBLASTdoes not

• (though BLAST+ BLASTN now uses megaBLAST by default)

• megaBLAST uses a fast, greedy algorithm due to Zhang et al.(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397

• megaBLAST is optimised for• genome-level searches• queries on large sequence sets (automatic query packing)• long alignments of similar sequences, with SNPs/sequencing

errors

• A discontinuous mode (dc-megaBLAST) is recommended formore divergent sequences

Page 146: Introduction to Bioinformatics

Viewing alignments in ACT

Start ACT from the command line:

1 $ act &

Page 147: Introduction to Bioinformatics

Use the “File”, “Open...” menu item

Page 148: Introduction to Bioinformatics

Increase the Number of Comparisons

Use more files ...

Page 149: Introduction to Bioinformatics

Select chromosome sequences

Page 150: Introduction to Bioinformatics

Add BLAST/megaBLAST results

Page 151: Introduction to Bioinformatics

Zoom Out

Page 152: Introduction to Bioinformatics

Remove Weak Matches

Use filter sliders

Page 153: Introduction to Bioinformatics

MUMmer

• MUMmer is a suite of alignment programs and scripts• mummer, promer, nucmer, etc.

• Very different to BLAST (suffix tree alignment) - very fast

• Extremely flexible

• Used for genome comparisons, assemblies, scaffolding, repeatdetection, etc.

• Forms the basis for other aligners/assemblers

Page 154: Introduction to Bioinformatics

Run a MUMmer Comparison

Create a new sub-directory for MUMmer output.

1 $ pwd

2 .../ data/workshop/chromosomes

3 $ mkdir nucmer_out

Run nucmer to create chrA NC 004547.delta

1 $ nucmer --prefix=nucmer_out/chrA_NC_004547 chrA.fasta NC_004547.fna

Then filter this file to generate a coordinate table for visualisation

1 $ delta -filter -q nucmer_out/chrA_NC_004547.delta > nucmer_out/chrA_NC_004547.

filter

2 $ show -coords -rcl nucmer_out/chrA_NC_004547.filter > nucmer_out/

chrA_NC_004547_filtered.coords

Page 155: Introduction to Bioinformatics

Run a MUMmer Comparison

MUMmer output is very different from BLAST output

1 $ head nucmer_out/chrA_NC_004547_filtered.coords

2 ...

Page 156: Introduction to Bioinformatics

Run a MUMmer Comparison

Use a one-line shell command to convert to ACT-friendly format:

1 $ tail -n +6 nucmer_out/chrA_NC_004547_filtered.coords | awk ’{print $7" "$10" "

$1" "$2" "$12" "$4" "$5" "$13}’ > chrA_mummer_NC_004547.crunch

2 $ head chrA_mummer_NC_004547.crunch

3 2526 82.49 15 2540 4727782 4985117 4982588 5064019

4 2944 82.29 2676 5619 4727782 4982544 4979600 5064019

5 85 95.29 11092 11176 4727782 758690 758774 5064019

6 1356 81.69 17446 18801 4727782 77639 78994 5064019

Page 157: Introduction to Bioinformatics

Select Files

Select your chromosome, and the megaBLAST/MUMmer output

Page 158: Introduction to Bioinformatics

View Basic Alignment

Page 159: Introduction to Bioinformatics

Filter Weak BLAST Matches

Page 160: Introduction to Bioinformatics

Genome Alignments

• Alignment result depends on algorithm, and parameter choices

• Some algorithms/parameter sets more sensitive than others

• Appropriate visualisation is essential

Much more detail at http://www.slideshare.net/leightonp/comparative-genomics-and-visualisation-part-1

Page 161: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 162: Introduction to Bioinformatics

Reciprocal Best BLAST Hits (RBBH)

• To compare our genecall proteins to NC 004547.faa referenceset...

• BLAST reference proteins against our proteins

• BLAST our proteins against reference proteins

• Pairs with each other as best BLAST Hit are called RBBH

Page 163: Introduction to Bioinformatics

One-way BLAST vs RBBH

One-way BLAST includes many low-quality hits

Page 164: Introduction to Bioinformatics

One-way BLAST vs RBBH

Reciprocal best BLAST hits remove many low-quality matches

Page 165: Introduction to Bioinformatics

Reciprocal Best BLAST Hits (RBBH)

• Pairs with each other as best BLAST hit are called RBBH

• Should filter on percentage identity and alignment length

• RBBH pairs are candidate orthologues• (most orthologues will be RBBH, but the relationship is

complicated)• Outperforms OrthoMCL, etc. (beyond scope of course why

and how. . .)http://dx.doi.org/10.1093/gbe/evs100

http://dx.doi.org/10.1371/journal.pone.0018755

(We have a tool for this on our in-house Galaxy server)

Page 166: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 167: Introduction to Bioinformatics

In Conclusion

• The tools you will need to use will be task-dependent, butsome things are universal. . .

• Good experimental design (including BLAST searches, etc.)• Keeping accurate records for reproduction/replication• Validation/sanity checking of results• Comparison and benchmarking of methods• (Cross-)validation of predictive methods

Remember: everything gets easier with practice, so practicelots!


Top Related