quality control of illumina data mick watson director of ark-genomics the roslin institute

19
Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Upload: kirsten-alloway

Post on 01-Apr-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Quality Control of Illumina Data

Mick WatsonDirector of ARK-Genomics

The Roslin Institute

Page 2: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

QUALITY SCORES

Page 3: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Quality scores• The sequencer outputs base calls at each position of a read• It also outputs a quality value at each position

– This relates to the probability that that base call is incorrect

• The most common Quality value is the Sanger Q score, or Phred score– Qsanger -10 * log10(p)– Where p is the probability that the call is incorrect– If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect– If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect– If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect

• Using the equation:– p=0.05, Qsanger = 13

– p=0.01, Qsanger = 20

– p=0.001, Qsanger = 30

Page 4: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

For the geeks….• In R, you can investigate this:

sangerq <- function(x) {return(-10 * log10(x))}sangerq(0.05)sangerq(0.01)sangerq(0.001)

plot(seq(0,1,by=0.00001),sangerq(seq(0,1,by=0.00001)), type="l")

Page 5: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

The plot

Page 6: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

For the geeks….• And the other way round….

qtop <- function(x) {return(10^(x/-10))}qtop(30)qtop(20)qtop(13)

plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

Page 7: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

The important stuff

• Q30 – 1 in 1000 chance base is incorrect• Q20 – 1 in 100 chance base is incorrect

Page 8: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

QUALITY ENCODING

Page 9: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Quality Encoding• Bioinformaticians do not like to make your life easy!• Q scores of 20, 30 etc take two digits • Bioinformaticians would prefer they only took 1

• In computers, letters have a corresponding ASCII code:

• Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme

Page 10: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

The process in full• p (probability base is wrong) : 0.01• Q (-10 * log10(p)) : 30• Add 33 : 63• Encode as character : ?

P Q Code

0.05 13 .

0.01 20 5

0.001 30 ?

Page 11: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

For the geeks….

code2Q <- function(x) { return(utf8ToInt(x)-33) }code2Q(".")code2Q("5")code2Q("?")

code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) }code2P(".")code2P("5")code2P("?")

Page 12: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

QC OF ILLUMINA DATA

Page 13: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

FastQC• FastQC is a free piece of software• Written by Babraham Bioinformatics group• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/• Available on Linux, Windows etc• Command-line or GUI

Page 14: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Read the documentationFollow the course notes

Page 15: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Per sequence quality• One of the most important plots from FastQC• Plots a box at each position• The box shows the distribution of quality values at that position across all

reads

Page 16: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Obvious problems

Page 17: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Less obvious problems

Page 18: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Really bad problems

Page 19: Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Other useful plots

• Per sequence N content– May identify cycles that are unreliable

• Over-represented sequences– May identify Illumina adapters and primers