introduction to bioinformatics

167
Introduction to Bioinformatics Part 0: So You Want To Be a Computational Biologist? Leighton Pritchard and Peter Cock

Upload: leighton-pritchard

Post on 10-May-2015

672 views

Category:

Science


0 download

DESCRIPTION

Slides for the afternoon session on "Introduction to Bioinformatics", delivered at the James Hutton Institute, 29th, 20th May and 5th June 2014, by Leighton Pritchard and Peter Cock. Slides cover introductory guidance and links to resources, theory and use of BLAST tools, and a workshop featuring some common tools and tasks.

TRANSCRIPT

Page 1: Introduction to Bioinformatics

Introduction to BioinformaticsPart 0: So You Want To Be a ComputationalBiologist?

Leighton Pritchard and Peter Cock

Page 2: Introduction to Bioinformatics

Bertrand Russell

Page 3: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 4: Introduction to Bioinformatics

What is this “bioinformatics” thing,anyway?

• Bioinformatics: biology using computational andmathematical tools

• A discipline within biology• Loman & Watson (2013) “So you want to be a computational

biologist?” http://dx.doi.org/10.1038/nbt.2740• Welch et al. (2014) “Bioinformatics Curriculum Guidelines:

Toward a Definition of Core Competencies”http://dx.doi.org/10.1371/journal.pcbi.1003496

• Watson (2014) “The only core competency you need”http://bit.ly/1fS4iDJ (blog)

Page 5: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician

• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 6: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 7: Introduction to Bioinformatics

Some uncomfortable truths

• This one-day course will not make you a bioinformatician• But practice will. . .

• The best way to learn is to do (“I don’t know how to do thisyet, but I will find out.”)

• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)

• Most bioinformatics is problem-solving

• We will introduce some useful tools and concepts

Page 8: Introduction to Bioinformatics

What it takes to be a bioinformatician

• Patience(problem-solving)

• Suspicion (statistics)

• Biological knowledge

• Social skills (no-oneknows everything: ask!)

• Lots of practice

• Self-confidence (challengeresults and dogma)

• Core domain skills:biology, computer science,statistics

• Deliver results (qualified,honest)

• Watson (2014) “What it takes to be a bioinformatician”http://bit.ly/1jDuQsO (blog)

Page 9: Introduction to Bioinformatics

More general advice?

• Ask us (we do this a lot)

• BioStars (https://www.biostars.org)

• SeqAnswers (http://seqanswers.com/)

• PLoS Comp Biol collections (http://www.ploscollections.org/static/pcbiCollections)

Page 10: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 11: Introduction to Bioinformatics

Why Do It?

• Doing bioinformatics is doing science: keep a lab book!

• You will not remember multiple files, analysis details, etc. in aweek/month/six months/a year/three years

• Noble (2009)http://dx.doi.org/10.1371/journal.pcbi.1000424

• Baggerly & Coombes (2009)http://arxiv.org/pdf/1010.1092.pdf

Page 12: Introduction to Bioinformatics

How To Do It? I

• There is no one correct way, but. . .

• Think about data/docs/project structure before you start

Page 13: Introduction to Bioinformatics

How To Do It? II

• Use plain text where possible

• Use version control

• Keep backups

• Different tools suit different purposes: code vs. data vs.analysis vs. . . .

• Find a way that works for you.

Page 14: Introduction to Bioinformatics

How To Do It? III

• Reproducibility is key!

• Scripts and pipelines are better for this than notes of whatyou did

• Also better for version control, and reuse

• Avoid unnecessary duplication• Someone else may have solved your problem• One (backed up) read-only copy of raw data, keep analyses

separate

Page 15: Introduction to Bioinformatics

Plain Text Files

• README.txt/README.md in each directory/folder

• Plain text is always human-readable• Markdown (https://daringfireball.net/projects/markdown/basics)

• RST (http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html)

Page 16: Introduction to Bioinformatics

Galaxy workflows

• Use through browser, graphical interface

• Reproducible, shareable, documented, reusable analyses

• Wraps standard bioinformatics tools

• Local instance (http://ppserver/galaxy) uses JHI cluster

Page 17: Introduction to Bioinformatics

script

• Writes your terminal activity to a plain text file

• Saves effort copy/pasting and typing commands into a labbook, as you go

• Easy to use with other tools

• use man script at your terminal to find out more

Page 18: Introduction to Bioinformatics

MediaWiki

• Useful for shared projects/data

• Automatic version control and attribution

• Many local instances at JHI (ask around)

Page 19: Introduction to Bioinformatics

A language notebook

• e.g. iPython Notebook, Mathematica, MatLab cells

• Integrates live code and analysis with lab-book

Page 20: Introduction to Bioinformatics

LATEX

• Powerful, versatile typesetting system (e.g. these slides)

• Similar to markup/markdown

• Pros: great for mathematical/computing work, writing a thesis

• Cons: not easy to pick up

Page 21: Introduction to Bioinformatics

Table of Contents

Introduction

Recording Your Work

Conclusion

Page 22: Introduction to Bioinformatics

In Conclusion

• Bioinformatics is just biology using computers andmathematics

• You still need to “do science” in the same way:• Keep accurate records• Plan and conduct experiments (with controls)• Follow the literature• Professional development

Page 23: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 1: Golden Rules of Bioinformatics

Leighton Pritchard and Peter Cock

Page 24: Introduction to Bioinformatics

On Confidence

“Ignorance more frequently begets confidence than doesknowledge: it is those who know little, not those who know much,who so positively assert. . .”- Charles Darwin

Page 25: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 26: Introduction to Bioinformatics

Zeroeth Golden Rule of Bioinformatics

• No-one knows everything about everything - talk to people!• local bioinformaticians, mailing lists, forums, Twitter, etc.

• Keep learning - there are lots of resources

• There is no free lunch - no method works best on all data

• The worst errors are silent - share worries, problems, etc.

• Share expertise (see first item)

Page 27: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 28: Introduction to Bioinformatics

First Golden Rule of Bioinformatics

• Always inspect the raw data (trends, outliers, clustering)

• What is the question? Can the data answer it?

• Communicate with data collectors! (don’t be afraid ofpedantry)

• Who? When? How?• You need to understand the experiment to analyse it (easier if

you helped design it).• Be wary of block effects (experimenter, time, batch, etc.)

Page 29: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 30: Introduction to Bioinformatics

Second Golden Rule of Bioinformatics

• Do not trust the software: it is not an authority• Software does not distinguish meaningful from meaningless

data• Software has bugs• Algorithms have assumptions, conditions, and applicable

domains• Some problems are inherently hard, or even insoluble

• You must understand the analysis/algorithm

• Always sanity test

• Test output for robustness to parameter (including data)choice

Page 31: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 32: Introduction to Bioinformatics

Third Golden Rule of Bioinformatics

• Everyone has expectations of their data/experiment• Beware cognitive errors, such as confirmation bias!• System 1 vs. System 2 ≈ intuition vs. reason

• Think statistically!• Large datasets can be counterintuitive and appear to confirm a

large number of contradictory hypotheses• Always account for multiple tests.• Avoid “data dredging”: intensive computation is not an

adequate substitute for expertise

• Use test-driven development of analyses and code• Use examples that pass and fail

Page 33: Introduction to Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 34: Introduction to Bioinformatics

In Conclusion

• Always communicate!• worst errors are silent

• Don’t trust the data• formatting/validation/category errors - check!• suitability for scientific question

• Don’t trust the software• software is not an authority• always benchmark, always validate

• Don’t trust yourself• beware cognitive errors• think statistically• biological “stories” can be constructed from nonsense

Page 35: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 2: BLAST

Leighton Pritchard and Peter Cock

Page 36: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 37: Introduction to Bioinformatics

Learning Outcomes

• How BLAST searches work

• How the way BLAST searches work affects your results

• Why search parameters matter

• Setting search parameters

Page 38: Introduction to Bioinformatics

About Bioinformatics Tools

Page 39: Introduction to Bioinformatics

A Recent Twitter Conversation

Page 40: Introduction to Bioinformatics

A Recent Twitter Conversation

Page 41: Introduction to Bioinformatics

Why So Much Detail?

• You’re going to go away and do lots of BLAST searches

• Everyone uses BLAST - not everyone uses it well

• Easier to fix problems if you know how it works

• Understanding what’s going on helps avoid misuse/abuse

• Understanding what’s going on helps use the tool moreeffectively

• Not so much detail, really• like knowing about Tm and ion concentration effects, not

molecular orbitals or thermodynamics (but ask if you’reinterested ;) )

Page 42: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 43: Introduction to Bioinformatics

What BLAST Is

• BLAST:• Basic (it’s actually sophisticated)• Local Alignment (what it does: local sequence alignment)• Search Tool (what it does: search against a database)

• The most important software package in bioinformatics?

• Fast, robust, sequence similarity search tool

• Does not necessarily produce optimal alignments

• Not foolproof.

Page 44: Introduction to Bioinformatics

What BLAST Is

• BLAST:• Basic (it’s actually sophisticated)• Local Alignment (what it does: local sequence alignment)• Search Tool (what it does: search against a database)

• The most important software package in bioinformatics?

• Fast, robust, sequence similarity search tool

• Does not necessarily produce optimal alignments

• Not foolproof.

Page 45: Introduction to Bioinformatics

What A BLAST Search Is

• Every BLAST search is an in silico hybridisation experiment

• BLAST search = identification of similar sequences in a givendatabase

• Results depend on:• query sequence• BLAST program (including version and BLAST vs BLAST+)• database• parameters

Page 46: Introduction to Bioinformatics

Alignment Search Space

Consider two biological sequences to be aligned. . .

• One sequence on the x-axis, the other on the y -axis

• Each point in space is a pairing of two letters

• Ungapped alignments are diagonal lines in the search space,gapped alignments have short ’breaks’

• There may be one or more ”optimal” alignments

Page 47: Introduction to Bioinformatics

Global vs Local Alignment

• Global alignment: sequences are aligned along their entirelengths

• Local alignment: the best subsequence alignment is found

• Consider an alignment of the same gene from twodistantly-related eukaryotes, where:

• Exons are conserved and small in relation to gene locus size• Introns are not well-conserved but large in relation to gene

locus size

• Local alignment will align the conserved exon regions

• Global alignment will align the whole (mostly unrelated) locus

Page 48: Introduction to Bioinformatics

Global vs Local Alignment

• Global alignment: sequences are aligned along their entirelengths

• Local alignment: the best subsequence alignment is found

• Consider an alignment of the same gene from twodistantly-related eukaryotes, where:

• Exons are conserved and small in relation to gene locus size• Introns are not well-conserved but large in relation to gene

locus size

• Local alignment will align the conserved exon regions

• Global alignment will align the whole (mostly unrelated) locus

Page 49: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 50: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 51: Introduction to Bioinformatics

Our Goal

• We aim to align the words• COELACANTH• PELICAN

• Each identical letter (match) scores +1

• Each different letter (mismatch) scores -1

• Each gap scores -1

• All sequence alignment is maximisation of an alignment score- a mathematical operation.

Page 52: Introduction to Bioinformatics

Initialise the matrix

Page 53: Introduction to Bioinformatics

Fill the cells

Page 54: Introduction to Bioinformatics

Fill the matrix – represents all possiblealignments & scores

Page 55: Introduction to Bioinformatics

Traceback

Page 56: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 57: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 58: Introduction to Bioinformatics

Algorithms

• Global: Needleman-Wunsch (as in example)

• Local: Smith-Waterman (differs from example)

• Biological information encapsulated only in the scoringscheme (matches, mismatches, gaps)

• NW/SW are guaranteed to find the optimal match withrespect to the scoring system being used

• BUT the optimal alignment is a biological approximation: noscoring scheme encapsulates biological “truth”

• Any pair of sequences can be aligned: finding meaning is upto you

Page 59: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 60: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 61: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 62: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 63: Introduction to Bioinformatics

BLAST Is A Heuristic

• BLAST does not use Needleman-Wunsch or Smith-Waterman

• BLAST approximates dynamic programming methods

• BLAST is not guaranteed to give a mathematically optimalalignment

• BLAST does not explore the complete search space

• BLAST uses heuristics (loosely-defined rules) to refineHigh-scoring Segment Pairs (HSPs)

• BLAST reports only “statistically-significant” alignments(dependent on parameters)

Page 64: Introduction to Bioinformatics

Steps in the Algorithm

1. Seeding

2. Extension

3. Evaluation

Page 65: Introduction to Bioinformatics

Word Hits

• A word hit is a short sequence and its neighbourhood

• neighbourhood: words of same length whose aligned score isgreater than or equal to a threshold value T

• Three parameters: scoring matrix, word size W , and T

Page 66: Introduction to Bioinformatics

Seeding

• BLAST assumption: significant alignments have words incommon

• BLAST finds word (neighbourhood) hits in the database index

• Word hits are used to seed alignments

Page 67: Introduction to Bioinformatics

Seeding Controls Sensitivity

• Word size W controls number of hits (smaller words =⇒more hits)

• Threshold score T controls number of hits (lower threshold=⇒ more hits)

• Scoring matrix controls which words match

Page 68: Introduction to Bioinformatics

The Two-Hit Algorithm

• BLAST assumption: word hits cluster on the diagonal forsignificant alignments

• The acceptable distance A between words on the diagonal is aparameter of your model

• Smaller distances isolate single words, and reduce search space

Page 69: Introduction to Bioinformatics

Extension

• The best-scoring seeds are extended in each direction

• BLAST does not explore the complete search space, so a rule(heuristic) to stop extension is needed

• Two-stage process:• Extend, keeping alignment score, and drop-off score• When drop-of score reaches a threshold X , trim alignment

back to top score

Page 70: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to the right from the seed T

• The quic• The quie• 123 4565 <- score• 000 0001 <- drop-off score

Page 71: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to the right from the seed T

• The quic• The quie• 123 4565 <- score• 000 0001 <- drop-off score

Page 72: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Extend to drop-off threshold• The quick brown fox jump• The quiet brown cat purr• 123 45654 56789 876 5654 <- score• 000 00012 10000 123 4345 <- drop-off score

Page 73: Introduction to Bioinformatics

Example

• Consider two sentences (match=+1, mismatch=-1)• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Trim back from drop-off threshold to get optimal alignment• The quick brown• The quiet brown• 123 45654 56789 <- score• 000 00012 10000 <- drop-off score

Page 74: Introduction to Bioinformatics

Notes on implementation

• X controls termination of alignment extension, but dependenton:

• substitution matrix• gap opening and extension parameters

Page 75: Introduction to Bioinformatics

Evaluation

• The principle is easy: use a score threshold S to determinestrong and weak alignments

• S is monotonic with E , so an equivalent threshold can becalculated

• Score S is independent of database size and search space. Evalues are not.

• Alignment consistency of HSPs is also a factor in the report

Page 76: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 77: Introduction to Bioinformatics

Log-odds Matrices

• Substitution matrices are your model of evolution

• Substitution matrices are log-odds matrices• Positive numbers indicate likely substitutions/similarity• Negative numbers indicate unlikely substitutions/dissimilarity

BLOSUM62

Page 78: Introduction to Bioinformatics

Choice of Matrix

• Substitution matrix determines the raw alignment score S• S is the sum of pairwise scores in an alignment

• BLAST provides, for proteins:• BLOSUM45 BLOSUM50 BLOSUM62 BLOSUM80 BLOSUM90• PAM30 PAM70 PAM250

• BLOSUM matrices empirically defined from multiple sequencealignments of ≥ n% identity, for BLOSUMn

• For nucleotides: ‘matrix’ defined by match/mismatch(reward/penalty) parameters

Page 79: Introduction to Bioinformatics

Definition

• The Karlin-Altschul equation

E = kmne−λS

• Symbols:• k : minor constant, adjusts for correlation between alignments• m: number of letters in query sequence• n: number of letters in the database• λ: scoring matrix scaling factor• S : raw alignment score

Page 80: Introduction to Bioinformatics

Interpretation

• The Karlin-Altschul equation

E = kmne−λS

• E is the number of alignments of a similar score expected bychance when querying a database of the same size and letterfrequency, where the letters in that database arerandomly-ordered

• Small changes in score S can produce large changes in E

• BUT biological sequence databases are not random!

Page 81: Introduction to Bioinformatics

Table of Contents

Introduction

Alignment

BLAST

BLAST Statistics

Using BLAST

Page 82: Introduction to Bioinformatics

Multiple BLAST tools

• BLASTN vs MEGABLAST vs TBLASTX vs ...?

• Korf et al. (2003) BLAST is really good for theory part,but practical examples dated due to changes with BLAST+

Page 83: Introduction to Bioinformatics

Multiple flavours of BLAST

• NCBI “legacy” BLAST• Now obsolete and not being updated• Spawned offshoots including:

• WU-BLAST aka AB-BLAST (commerical)• MPI-BLAST for use on clusters• Versions to run on graphics cards

• NCBI BLAST+• Re-written in 2009 using C++ instead of C• Many improvements• Slightly different output• Different commands used to run it

Page 84: Introduction to Bioinformatics

Multiple ways to run BLAST

• BLAST+ at the command line (today)

• Via a script or programming language

• Via a graphical tool like BioEdit, CLCbio, Blast2GO

• Via the NCBI website

• Via a genome consortium website

• Via a Galaxy web server

• etc

• Offers flexibility but different settings/options/versions

Page 85: Introduction to Bioinformatics

Multiple places to run BLAST

• On the NCBI servers, e.g. via website or tool

• On 3rd party servers, e.g. via websites

• On your own computer

• On our Linux cluster

Page 86: Introduction to Bioinformatics

Core BLAST tools: Query sequences vsDatabase

• Nucleotide vs Nucleotide:• blastn (covering blastn, megablast, dc-megablast)

• Translated nucleotide vs Protein:• blastx

• Protein vs Translated nucleotide:• tblastn

• Protein vs Protein:• blastp, psiblast, phiblast, deltablast

See http://blast.ncbi.nlm.nih.gov/ for a reminder ;)

Page 87: Introduction to Bioinformatics

The BLAST tools have built in help

1 $ blastp -h

2 USAGE

3 blastp [-h] [-help] [-import_search_strategy filename]

4 [-export_search_strategy filename] [-task task_name] [-db database_name]

5 [-dbsize num_letters] [-gilist filename] [-seqidlist filename]

6 [-negative_gilist filename] [-entrez_query entrez_query]

7 [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]

8 [-subject subject_input_file] [-subject_loc range] [-query input_file]

9 [-out output_file] [-evalue evalue] [-word_size int_value]

10 [-gapopen open_penalty] [-gapextend extend_penalty]

11 [-xdrop_ungap float_value] [-xdrop_gap float_value]

12 [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value]

13 [-sum_statistics] [-seg SEG_options] [-soft_masking soft_masking]

14 [-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]

15 ...

16 [-max_target_seqs num_sequences] [-num_threads int_value] [-ungapped]

17 [-remote] [-comp_based_stats compo] [-use_sw_tback] [-version]

1819 DESCRIPTION

20 Protein -Protein BLAST 2.2.29+

2122 Use ’-help ’ to print detailed descriptions of command line arguments

Page 88: Introduction to Bioinformatics

Minimal example of BLAST+ at thecommand line

1 $ blastp -query my_input.fasta -db my_database -out my_output.txt

• Replace blastp with the appropriate tool, e.g. blastn

• Replace my input.fasta with your actual filename

• Replace my database with your actual database, e.g. nr

• Replace my output.txt with your desired output filename

• Best to avoid spaces in your folder and filenames!

e.g.

1 $ blastp -query query.fasta -db dbA -out my_output.txt

Page 89: Introduction to Bioinformatics

Setting the BLAST+ output format

1 $ blastp -help

2 USAGE

3 ...

45 *** Formatting options

6 -outfmt <String >

7 alignment view options:

8 0 = pairwise ,

9 1 = query -anchored showing identities ,

10 2 = query -anchored no identities ,

11 3 = flat query -anchored , show identities ,

12 4 = flat query -anchored , no identities ,

13 5 = XML Blast output ,

14 6 = tabular ,

15 7 = tabular with comment lines ,

16 8 = Text ASN.1,

17 9 = Binary ASN.1,

18 10 = Comma -separated values ,

19 11 = BLAST archive format (ASN .1)

2021 ...

22 Default = ‘0’

23 ...

Page 90: Introduction to Bioinformatics

Setting the BLAST+ output format

Default is plain text pairwise alignments, for humans:

1 $ blastp -query query.fasta -db dbA -out my_output.txt

2 ...

XML output can be useful (e.g. for BLAST2GO):

1 $ blastp -query query.fasta -db dbA -out my_output.xml -outfmt 5

2 ...

Tabular output is easiest to filter, sort, etc:

1 $ blastp -query query.fasta -db dbA -out my_output.tab -outfmt 6

2 ...

Page 91: Introduction to Bioinformatics

Setting the e-value threshold

Check the built in help:

1 $ blastp -help

2 USAGE

3 ...

4 -evalue <Real >

5 Expectation value (E) threshold for saving hits

6 Default = ‘10’

7 ...

Example using 0.0001 or 1× 10−5 in scientific notation (1e-5)

1 $ blastp -query query.fasta -db dbA -out my_output.txt -evalue 1e-5

2 ...

Page 92: Introduction to Bioinformatics

In Conclusion

• Every BLAST search is an experiment

• Badly-designed searches can give you bad results

• Knowing how BLAST works helps improve search design

• BLAST results still require inspection and interpretation

Page 93: Introduction to Bioinformatics

An Introduction to BioinformaticsToolsPart 3: Workshop

Leighton Pritchard and Peter Cock

Page 94: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 95: Introduction to Bioinformatics

Learning Outcomes

• Workshop example: bacterial genome annotation(because they’re small and data easy to handle)

• The role of biological insight in a bioinformatics workflow• Visual interaction with sequence data• Using alternative tools• Comparison of tools and outputs• Online tools for automated function prediction

Page 96: Introduction to Bioinformatics

What You Will Be Doing

Illustrative example of concepts: Functional annotation of a draftbacterial genome

1. Gene prediction

2. Genome comparisons

3. Gene comparisons

Page 97: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 98: Introduction to Bioinformatics

Locate your data

• You are in group A, B, C or D - this decides your chromosomesequence:chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta

• Each sequence represents a single stitched, ordered draftbacterial genome comprising a number of contigs.

• You will use your sequence as the basis of the exercises in theworkshop.

Page 99: Introduction to Bioinformatics

Locate your data

• You are in group A, B, C or D - this decides your dataset:chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta

• You also have a GFF file describing the location of assembledcontigschrA contigs.gff, chrB contigs.gff,chrC contigs.gff, chrD contigs.gff

Page 100: Introduction to Bioinformatics

Inspect the data

1 $ head -n 3 chrA.fasta

2 >chrA

3 ttttcttgattgaccttgttcgagtggagtccgccgtgtcactttcgctttggcagcagt

4 gtcttgcccgtttgcaggatgagttacctgccacagaattcagtatgtggatacgcccgt

5 $ head -n 3 chrA_contigs.gff

6 ##gff -version 3

7 chrA stitching contig 1 154993 . . . ID=contig00005_b;Name=contig00005_b

8 chrA stitching contig 155036 241491 . . . ID=contig00018;Name=contig00018

Page 101: Introduction to Bioinformatics

Inspect the data

Starting Artemis

1 $ art &

Page 102: Introduction to Bioinformatics

Load the chromosome sequence

Select the sequence for your group

Page 103: Introduction to Bioinformatics

Load the chromosome sequence

Page 104: Introduction to Bioinformatics

Load the contig GFF

Page 105: Introduction to Bioinformatics

Load the contig GFF

Select the file for your group

Page 106: Introduction to Bioinformatics

Load the contig GFF

Page 107: Introduction to Bioinformatics

Find the stitching sequence

The contigs are stitched with a specific sequence: see if you canfind, and identify it.

Page 108: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 109: Introduction to Bioinformatics

Lines of Evidence

• ab initio genecalling:• Unsupervised methods - not trained on a dataset• Supervised methods - trained on a dataset

• homology matches• alignment to genes from related organisms (annotation

transfer)• from known gene products (e.g. proteins, ncRNA)• from transcripts/other intermediates (e.g. ESTs, cDNA,

RNAseq)

Page 110: Introduction to Bioinformatics

Consensus Methods

• Combine weighted evidence from multiple sources, using linearcombination or graph theoretical methods

• For eukaryotes:• EVM http://evidencemodeler.sourceforge.net/• Jigsaw http://www.cbcb.umd.edu/software/jigsaw/• GLEAN http://sourceforge.net/projects/glean-gene/

Page 111: Introduction to Bioinformatics

Basic Gene Finding

• We could use Artemis to identify the longest coding region ineach ORF, lots of manual steps

• This is the most basic gene finding, and can easily beautomated, e.g. EMBOSS getorf

• Dedicated gene finders usually more appropriate...

Page 112: Introduction to Bioinformatics

Finding Open Reading Frames

• ORF finding is naive, does not consider:• Start codon• Splicing• Promoter/RBS motifs• Wider context (e.g. overlapping genes)

Page 113: Introduction to Bioinformatics

Prokaryotic Prediction Methods

• Prokaryotes “easier” than eukaryotes for gene prediction

• Less uncertainty in predictions (isoforms, gene structure)• Very gene-dense (over 90% of chromosome is coding sequence)• No intron-exon structure• Problem is: “which possible ORF contains the true gene, and

which start site is correct?”• Still not a solved problem

Page 114: Introduction to Bioinformatics

Two ab initio Prokaryotic PredictionMethods

You will be using two tools

• Glimmer• Interpolated Markov models• Can be trained on “gold standard” datasets

• Prodigal• Log-likelihood model based on GC frame plots, followed by

dynamic programming• Can be trained on “gold standard” datasets

Page 115: Introduction to Bioinformatics

Using Glimmer

Supervised - we train on a related complete genome sequence,then run glimmer3

1 $ build -icm -r NC_004547.icm < NC_004547.ffn

2 $ glimmer3 -o 50 -g 110 -t 30 chrA.fasta NC_004547.icm chrA_glimmer3

• -o 50: max overlap bases

• -g 110: min gene length

• -t 30: threshold score

Page 116: Introduction to Bioinformatics

Using Glimmer

glimmer3 output is not standard GFF format:

1 $ head -n 4 chrA_glimmer3.predict

2 >chrA

3 orf00001 36 1430 +3 8.81

4 orf00002 1435 2535 +1 11.51

5 orf00005 2676 3761 +3 8.63

We could Google for help, or use provided conversion script:

1 $ python glimmer_to_gff.py chrA_glimmer3.predict

Page 117: Introduction to Bioinformatics

Using Glimmer

We now have output in GFF

1 $ head -n 3 chrA_glimmer3.gff

2 chrA Glimmer CDS 36 1430 8.81 + 0 ID=orf00001;Name=orf00001

3 chrA Glimmer CDS 1435 2535 11.51 + 0 ID=orf00002;Name=orf00002

4 chrA Glimmer CDS 2676 3761 8.63 + 0 ID=orf00005;Name=orf00005

Page 118: Introduction to Bioinformatics

Using Prodigal

Unsupervised (i.e. untrained) mode

1 $ prodigal -f gff -o chrA_prodigal.gff -i chrA.fasta

Page 119: Introduction to Bioinformatics

Using Prodigal

Prodigal GFF output is correctly formatted and informative

1 $ head -n 6 chrA_prodigal.gff

2 ##gff -version 3

3 # Sequence Data: seqnum =1; seqlen =4727782; seqhdr ="chrA"

4 # Model Data: version=Prodigal.v2.50; run_type=Single;model="Ab initio "; gc_cont

=54.48; transl_table =11; uses_sd =1

5 chrA Prodigal_v2 .50 CDS 3 1430 188.5 + 0 ID=1_1;partial =10; start_type=Edge;

rbs_motif=None;rbs_spacer=None;score =188.54; cscore =185.37; sscore =3.18;

rscore =0.00; uscore =3.18; tscore =0.00

6 chrA Prodigal_v2 .50 CDS 1435 2535 185.6 + 0 ID=1_2;partial =00; start_type=ATG;

rbs_motif=None;rbs_spacer=None;score =185.61; cscore =184.24; sscore =1.36;

rscore = -7.73; uscore =3.48; tscore =4.37

7 chrA Prodigal_v2 .50 CDS 2676 3761 146.2 + 0 ID=1_3;partial =00; start_type=ATG;

rbs_motif=None;rbs_spacer=None;score =146.19; cscore =149.82; sscore = -3.63;

rscore = -7.73; uscore = -0.28; tscore =4.37

Page 120: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 121: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 122: Introduction to Bioinformatics

Comparing predictions in Artemis

Page 123: Introduction to Bioinformatics

Comparing predictions in Artemis

Do ORF(orange)/CDS(green,blue) prediction methods agree?

Page 124: Introduction to Bioinformatics

Comparing predictions in Artemis

Do glimmer(green)/prodigal(blue) CDS prediction methodsagree?

How do we know which (if either) is best?

Page 125: Introduction to Bioinformatics

Using a “Gold Standard”

A general approach for all predictive methods

• Define a known, “correct” set of true/false, positive/negativeetc. examples - the “gold standard”

• Evaluate your predictive method against that set for• sensitivity, specificity, accuracy, precision, etc.

Many methods available, coverage beyond the scope of thisintroduction

Page 126: Introduction to Bioinformatics

Contingency Tables

Condition (Gold standard)True False

Test outcomePositive True Positive False PositiveNegative False Negative True Negative

Sensitivity = TPR = TP/(TP + FN)Specificity = TNR = TN/(FP + TN)FPR = 1− Specificity = FP/(FP + TN)If you don’t have this information, you can’t interpret predictiveresults properly.

Page 127: Introduction to Bioinformatics

Why Performance Metrics Matter

• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where there isdisease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

Page 128: Introduction to Bioinformatics

Why Performance Metrics Matter

• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where there isdisease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

Page 129: Introduction to Bioinformatics

Why Performance Metrics Matter

• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X , youcannot know.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

Page 130: Introduction to Bioinformatics

Why Performance Metrics Matter

• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X , youcannot know.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

Page 131: Introduction to Bioinformatics

Why Performance Metrics Matter

• Imagine a predictor for protein functional class

• Predictor has has sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5

Page 132: Introduction to Bioinformatics

Why Performance Metrics Matter

• Imagine a predictor for protein functional class

• Predictor has has sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5

Page 133: Introduction to Bioinformatics

Bayes’ Theorem

• May seem counter-intuitive: 95% sensitivity, 99% specificity=⇒ 50% chance of any prediction being incorrect

• Probability given by Bayes’ Theorem

• P(X |+) = P(+|X )P(X )

P(+|X )P(X )+P(+|X̄ )P(X̄ )

• This is commonly overlooked in the literature (confirmationbias?)

• e.g. in paper describing novel TTSS predictor:“The surprisingly high number of (false) positives in genomeswithout TTSS exceeds the expected false positive rate”

Page 134: Introduction to Bioinformatics

Interpreting Performance Metrics

• Use Bayes’ Theorem!

• Predictions apply to groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to all other tests, (including proteinfunctional class prediction) - you should not ‘cherry-pick’ forpublication without other evidence

Page 135: Introduction to Bioinformatics

Interpreting Performance Metrics

• Use Bayes’ Theorem!

• Predictions apply to groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to all other tests, (including proteinfunctional class prediction) - you should not ‘cherry-pick’ forpublication without other evidence

Page 136: Introduction to Bioinformatics

“Gold Standard” results

• Tested glimmer and prodigal on two ”gold standards”• Manually annotated (>3 expert person years) close relative• Community-annotated close relative

• Both methods trained directly on the annotated genes in eachorganism!

Page 137: Introduction to Bioinformatics

“Gold Standard” results

genecaller glimmer prodigal

predicted 4752 4287missed 284 (6%) 407 (9%)

Exact Predictionsensitivity 62% 71%

FDR 41% 25%PPV 59% 75%

Correct ORFsensitivity 94% 91%

FDR 10% 3%PPV 90% 97%

Page 138: Introduction to Bioinformatics

“Gold Standard” results

genecaller glimmer prodigal

predicted 4679 4467missed 112 (3%) 156 (3%)

Exact Predictionsensitivity 62% 86%

FDR 31% 14%PPV 69% 86%

Correct ORFsensitivity 97% 97%

FDR 7% 3%PPV 93% 97%

Page 139: Introduction to Bioinformatics

Gene/CDS Prediction

• Many alternative methods, all perform differently

• To assess/choose methods, performance metrics are required

• Even on (relatively simple) prokaryotes, current best methodsimperfect

• Manual assessment and intervention is essential, and usuallythe longest part of the process

Page 140: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 141: Introduction to Bioinformatics

Run a megaBLAST Comparison

BLAST your chromosome against the comparator sequence.Put results in chrA megablast Pba.tab

1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_megablast_Pba.tab -

outfmt 6

2 $ head -n 3 chrA_megablast_Pba.tab

3 chrA gi |50118965| ref|NC_004547 .2|:10948 -12453 80.34 1511 287 10 4579450 4580955

1506 1 0.0 1136

4 chrA gi |50118965| ref|NC_004547 .2|: c33859 -32447 82.04 1409 253 0 4563151 4564559

1 1409 0.0 1201

5 chrA gi |50118965| ref|NC_004547 .2|: c34917 -33868 82.48 1050 184 0 4562093 4563142

1 1050 0.0 920

Note this defaults to using MEGABLAST...

Page 142: Introduction to Bioinformatics

Run a BLASTN Comparison

BLAST your chromosome against the comparator sequencePut results in chrA blastn Pba.tab

1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_blastn_Pba.tab -

outfmt 6 -task blastn

2 $ head -n 3 chrA_blastn_Pba.tab

3 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 79.68 1865 379 0 4584915 4586779

1865 1 0.0 1654

4 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 92.59 27 2 0 4479367 4479393 1254

1280 0.004 41.0

5 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 100.00 17 0 0 4613022 4613038 52 36

2.1 31.9

Note we added -task blastn

Page 143: Introduction to Bioinformatics

Do BLASTN and megaBLAST compar-isons agree?

Check the number of alignments returned with wc

1 $ wc chrA_megablast_Pba.tab

2 2675 32100 242539 chrA_megablast_Pba.tab

3 $ wc chrA_blastn_Pba.tab

4 31792 381504 2850953 chrA_blastn_Pba.tab

What is this telling us?Why do the results differ?

Page 144: Introduction to Bioinformatics

BLASTN vs megaBLAST

• Legacy BLASTN uses the BLAST algorithm, megaBLASTdoes not

• (though BLAST+ BLASTN now uses megaBLAST by default)

• megaBLAST uses a fast, greedy algorithm due to Zhang et al.(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397

• megaBLAST is optimised for• genome-level searches• queries on large sequence sets (automatic query packing)• long alignments of similar sequences, with SNPs/sequencing

errors

• A discontinuous mode (dc-megaBLAST) is recommended formore divergent sequences

Page 145: Introduction to Bioinformatics

BLASTN vs megaBLAST

• Legacy BLASTN uses the BLAST algorithm, megaBLASTdoes not

• (though BLAST+ BLASTN now uses megaBLAST by default)

• megaBLAST uses a fast, greedy algorithm due to Zhang et al.(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397

• megaBLAST is optimised for• genome-level searches• queries on large sequence sets (automatic query packing)• long alignments of similar sequences, with SNPs/sequencing

errors

• A discontinuous mode (dc-megaBLAST) is recommended formore divergent sequences

Page 146: Introduction to Bioinformatics

Viewing alignments in ACT

Start ACT from the command line:

1 $ act &

Page 147: Introduction to Bioinformatics

Use the “File”, “Open...” menu item

Page 148: Introduction to Bioinformatics

Increase the Number of Comparisons

Use more files ...

Page 149: Introduction to Bioinformatics

Select chromosome sequences

Page 150: Introduction to Bioinformatics

Add BLAST/megaBLAST results

Page 151: Introduction to Bioinformatics

Zoom Out

Page 152: Introduction to Bioinformatics

Remove Weak Matches

Use filter sliders

Page 153: Introduction to Bioinformatics

MUMmer

• MUMmer is a suite of alignment programs and scripts• mummer, promer, nucmer, etc.

• Very different to BLAST (suffix tree alignment) - very fast

• Extremely flexible

• Used for genome comparisons, assemblies, scaffolding, repeatdetection, etc.

• Forms the basis for other aligners/assemblers

Page 154: Introduction to Bioinformatics

Run a MUMmer Comparison

Create a new sub-directory for MUMmer output.

1 $ pwd

2 .../ data/workshop/chromosomes

3 $ mkdir nucmer_out

Run nucmer to create chrA NC 004547.delta

1 $ nucmer --prefix=nucmer_out/chrA_NC_004547 chrA.fasta NC_004547.fna

Then filter this file to generate a coordinate table for visualisation

1 $ delta -filter -q nucmer_out/chrA_NC_004547.delta > nucmer_out/chrA_NC_004547.

filter

2 $ show -coords -rcl nucmer_out/chrA_NC_004547.filter > nucmer_out/

chrA_NC_004547_filtered.coords

Page 155: Introduction to Bioinformatics

Run a MUMmer Comparison

MUMmer output is very different from BLAST output

1 $ head nucmer_out/chrA_NC_004547_filtered.coords

2 ...

Page 156: Introduction to Bioinformatics

Run a MUMmer Comparison

Use a one-line shell command to convert to ACT-friendly format:

1 $ tail -n +6 nucmer_out/chrA_NC_004547_filtered.coords | awk ’{print $7" "$10" "

$1" "$2" "$12" "$4" "$5" "$13}’ > chrA_mummer_NC_004547.crunch

2 $ head chrA_mummer_NC_004547.crunch

3 2526 82.49 15 2540 4727782 4985117 4982588 5064019

4 2944 82.29 2676 5619 4727782 4982544 4979600 5064019

5 85 95.29 11092 11176 4727782 758690 758774 5064019

6 1356 81.69 17446 18801 4727782 77639 78994 5064019

Page 157: Introduction to Bioinformatics

Select Files

Select your chromosome, and the megaBLAST/MUMmer output

Page 158: Introduction to Bioinformatics

View Basic Alignment

Page 159: Introduction to Bioinformatics

Filter Weak BLAST Matches

Page 160: Introduction to Bioinformatics

Genome Alignments

• Alignment result depends on algorithm, and parameter choices

• Some algorithms/parameter sets more sensitive than others

• Appropriate visualisation is essential

Much more detail at http://www.slideshare.net/leightonp/comparative-genomics-and-visualisation-part-1

Page 161: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 162: Introduction to Bioinformatics

Reciprocal Best BLAST Hits (RBBH)

• To compare our genecall proteins to NC 004547.faa referenceset...

• BLAST reference proteins against our proteins

• BLAST our proteins against reference proteins

• Pairs with each other as best BLAST Hit are called RBBH

Page 163: Introduction to Bioinformatics

One-way BLAST vs RBBH

One-way BLAST includes many low-quality hits

Page 164: Introduction to Bioinformatics

One-way BLAST vs RBBH

Reciprocal best BLAST hits remove many low-quality matches

Page 165: Introduction to Bioinformatics

Reciprocal Best BLAST Hits (RBBH)

• Pairs with each other as best BLAST hit are called RBBH

• Should filter on percentage identity and alignment length

• RBBH pairs are candidate orthologues• (most orthologues will be RBBH, but the relationship is

complicated)• Outperforms OrthoMCL, etc. (beyond scope of course why

and how. . .)http://dx.doi.org/10.1093/gbe/evs100

http://dx.doi.org/10.1371/journal.pone.0018755

(We have a tool for this on our in-house Galaxy server)

Page 166: Introduction to Bioinformatics

Table of Contents

Introduction

Workshop Data

Gene Prediction

Genome Comparisons

Gene Comparisons

Conclusions

Page 167: Introduction to Bioinformatics

In Conclusion

• The tools you will need to use will be task-dependent, butsome things are universal. . .

• Good experimental design (including BLAST searches, etc.)• Keeping accurate records for reproduction/replication• Validation/sanity checking of results• Comparison and benchmarking of methods• (Cross-)validation of predictive methods

Remember: everything gets easier with practice, so practicelots!