bionf/beng 203: functional genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...bionf/beng...

89
1 BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1: Next generation sequencing

Upload: others

Post on 21-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

1

BIONF/BENG 203:

Functional Genomics

Trey Ideker and Vineet Bafna

TA: Martin Smith

Topic 1: Next generation sequencing

Page 2: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Bafna/Ideker Bix 3

The Dynamic nature of the cell

• The molecules in the body, RNA, and proteins are constantly turning over. – New ones are ‘created’

through transcription, translation

– Proteins are modified post-translationally

– Active molecules interact with each other in functional networks

– ‘Old’ molecules are degraded

April 12

Page 3: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

2) Molecular Networks 1) Molecular States

3) Phenotypic traits

Classes of biological measurements

Protein-protein interactions:

Two-hybrid system, coIP, protein

antibody array

Protein-DNA interactions:

Chromatin IP (chip) sequencing

Protein-compound

DNA sequence / genotype: Next-gen sequencing, SNP & CNV arrays

Gene expression:

DNA microarrays, mRNA sequencing

Protein levels, locations, mods:

Mass spectrometry, fluorescence

microscopy, protein arrays

Physiological or disease state, binary or quantitative

Growth rate, response to stimulus or stress

Behaviors

Page 4: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Topics Covered By This Course

①Signal detection in bioinformatics

②Large-scale data generation platforms

③Understanding next-gen sequencing data

④Understanding mass spectrometry data

⑤Clustering and Classification

⑥Genotype-phenotype association

⑦Understanding physical & genetic networks

⑧Gene network inference and evolution

4

Page 5: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Grading

• 40% Problem Sets (best 4 of 5)

• 30% Midterm

• 30% Final Project

Page 6: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Bafna/Ideker Bix 3

Dynamic aspects of cellular function

• A key part of functional genomics is to observe the (changes in) Molecular states via abundance of functional molecules

• Expressed transcripts

– Microarray hybridization to ‘count’ the number of copies of RNA

– RNA-seq

• Expressed proteins

– Mass spectrometry is used to ‘count’ the number of copies of a protein sequences

April 12

Page 7: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Expression analysis

• Preprocessing functional genomics data: our goal is to create an expression matrix – rows are molecules (transcripts,

proteins, peptides…)

– columns are experiments (samples)

– Entries are normalized abundance values.

• Week 1: Creating the matrix for transcripts

• Week 3: creating the matrix for proteins

• Week 4-7: Analysis of expression data

April 12 Bafna/Ideker Bix 3

transcripts/proteins

Sample/experiment

Page 8: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Page 9: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Sequencing By Synthesis

(Illumina GenomeAnalyzer or HiSeq)

Page 10: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 11: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 12: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 13: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 14: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 15: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 16: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 17: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:
Page 18: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Bridge

Amplification

Page 19: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Next Generation sequencing: mapping

April 12 Bafna/Ideker Bix 3

Page 20: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

NGS and expression

• Sample RNA, sequence via the RNA-seq protocol

• Mapfragments to the genome

• Normalize and create and Expression matrix – Rows are transcripts

– Columns are samples/experiments.

• Use clustering/classification ideas

T1 T2 T3 T4

April 12 Bafna/Ideker Bix 3

Page 21: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The computational challenge

• Input: m bp sequenced from the sample (usually as short reads), a database of length n.

• Output: the mapping coordinates for each of the short reads.

T1 T2 T3 T4

April 12 Bafna/Ideker Bix 3

Page 22: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The computational challenge

• A single sequencing Illumina run can

produce billions (~3B) of reads (length

100bp).

• Each read must be aligned to the

reference human assembly (3Gb).

April 12 Bafna/Ideker Bix 3

Page 23: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Alignment

• Recall that local alignment of two strings of size n,m requires ~nm steps.

• Typically: n=3.109, m≅3.1012

RNA seq

Human reference

April 12 Bafna/Ideker Bix 3

Page 24: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

On the other hand…

• While alignment is prohibitively expensive, the situation changes if we match without any errors (indels/substitutions).

• Typically n+m steps are needed

• Two ideas: – Preprocess and index the queries (sampled

reads) in O(n) steps, then scan the database in O(m) steps for ALL queries.

– Preprocess and index the database in O(m) steps, then search each query in time proportional to its length.

April 12 Bafna/Ideker Bix 3

Page 25: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Reconciling

• Can we use the ideas of exact matching to

speed up approximate matching in

practice?

April 12 Bafna/Ideker Bix 3

Page 26: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

The Pigeonhole principle

• True or False: No two persons in San

Diego have exactly the same number of

hair.

• Not counting bald people

Page 27: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Page 28: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The Pigeonhole principle of Combinatorics

• If there are n pigeonholes and n+1

pigeons then any assignment will require

at least 2 pigeons in some pigeonhole hole

April 12 Bafna/Ideker Bix 3

Page 29: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Observation 1

• Applying the pigeonhole principle

Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one must match the database string exactly

Page 30: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Expected number of exact Matches is small!

• Idea: Do an exact match for keywords of a small size (k=20), and a full alignment only around the exact match.

– Pigeonhole principle suggests that the true mapping will not be missed.

– What about speed?

• Expected number of matches = mn*0.25k

– If n=3.109, m=2.1011, k=30 – Then, expected number of matches = 516

• Number of computations: time for scanning+time for alignment – Here, the time for alignment is minimal, and the total time is

around 2.1011

Page 31: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

String matching to speed up computations

• Idea: Find exact matches of query substrings in the database. Do an alignment only near the queries.

RNA seq

Human reference

April 12 Bafna/Ideker Bix 3

Page 32: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Fast mapping

• All mapping algorithms use the same basic strategy

1. Use exact matching to identify db locations where a query string might match

1. Speedup by indexing databases (EX: bwa)

2. Speedup by indexing query strings (EX: Blast)

3. Some tools also do approximate matching

2. Use fast alignment techniques to quickly extend the alignments

1. Alignment is only applied to the few locations where exact match has occured

2. Engineering plays as important a role as algorithms

April 12 Bafna/Ideker Bix 3

Page 33: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Exploiting NGS properties

• In most NGS technologies, the sequence

has very high quality at the beginning of

the read.

• In bwa for example, we allow for at most 2

errors in the first 32 bp.

April 12 Bafna/Ideker Bix 3

Page 34: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Indexing sequence databases

April 12 Bafna/Ideker Bix 3

Page 35: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Indexing sequence databases

• String hash tables

April 12 Bafna/Ideker Bix 3

a c a a c g

m

k a a a

a a t

a a g

a a c 3

Pros: • fast search time O(k) per query string Cons: • large memory requirement (4k+m)

•430=1018

Page 36: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Suffix trees

• Trie of all suffixes

• Pros: – low theoretical memory

requirement O(m), as redundancies get merged into the trie.

– Fast search (lin. in size of query)

• Cons – memory requirement large in

practice. The constant in O(m) is ~60

– Suffix arrays might help a bit.

April 12 Bafna/Ideker Bix 3

c a a c g $ a a c g $

a c g $ c g $

g $

a c a a c g $

a

g

c

c a

a

a

g

g

acg$

$ cg$

acg$

$ $

Page 37: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The BW transform

• Clever, memory

efficient index

• Add a special

symbol

• Rotate m times

to get m strings

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a a c g $ a c a a a c g $ a c c a a c g $ a

Page 38: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The BW transform

• Rotate m times

to get m strings

• Sort

lexicographically

• Use

– last column,

– first column

– positions

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c g

0 0 1

0 1 1

0 1 1

1 1 1

2 1 1

3 1 1

3 2 1

Occ Row

Page 39: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Lexicographic sort

• Lexicographic:

Dictionary Sort

• Note the connection to

the suffix tree

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

a

g

c

a a

a

c

g

g

acg$

$

cg$

acg$

$ $

Page 40: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Maintaining the First column

• For every symbol σ, define Next(σ), and Prev(σ) – Next(a)=c

– Prev(a)=$

• Note that the first column is sorted – All we need to do is to keep

the index of the first occurrence of each symbol

– Pos(a) = 1

• Memory used? – O(|Σ|) (alphabet size)

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

Page 41: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The BW transform: (index on the last column)

• B[i] : last symbol in

the i-th substring

– B[3]=a

• Space requirement

– 2n bits (assuming 2

bits per symbol)

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

Page 42: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The BWT properties I

• The first

character is

preceded by the

last character in

the actual string.

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

Page 43: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

BWT property 2: LF property

• The i-th occurrence of a in the last column, and the i-th occurrence of a in first column corresponds to the same character

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

Page 44: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

LF property proof

• Number each symbol by its occurrence in the string.

• Lemma: if σi < σj in the first column, iff σi < σj in the last column.

• Proof: • Suppose σi < σj

• Let σix and σjy denote the corresponding suffixes.

• As the first symbol is the same, then Loc(x)<Loc(y)

• Then B[Loc(x)] < B[Loc(y)]

• However, B[Loc(x)]= σi and B[Loc(y)]= σi

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c

c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a1 c1 a2 a3c2 g1

Page 45: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Using the BW transform to query

• Given word q, does q exist in the database string D (represented by the BWT transform)?

• Note that all occurrences of q are next to each other,

• We only need to find the range (F,L) of positions.

• We proceed recursively

April 12 Bafna/Ideker Bix 3

q F

L

Page 46: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Matching single symbol strings

• If (|q|=1) Then

– F = Pos(q)

– L = Pos(Next(q))-1

April 12 Bafna/Ideker Bix 3

Page 47: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The general case

• Let q = σw

• Recursively,

– let F = first position of

w

– L = last position of w

– Can we find the first

and last positions of

σw?

April 12 Bafna/Ideker Bix 3

w F

L

Page 48: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The general case

• Consider the occurrences of σ in the first column.

• Clearly, σw is within the range, but we cannot tell where.

• If we knew two values, we’d be done – o1 (number of

occurrences of σ before σw), and

– o2, (number of occurrences of σw)

April 12 Bafna/Ideker Bix 3

w F

L

σw

σ

σw σ

Pos(σ) o1

o2

Page 49: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The general case

• Next, consider the occurrences of σ in the BW Transform (last column)

• Claim: – o2 is the number of

strings that start with w and end in σ

– (Proof: BW Prop. 1)

– o1 is the number of occurrences of σ in the last column before the first occurrence of w (Prof: BW property 2)

April 12 Bafna/Ideker Bix 3

w F

L

σw σ

σw σ

Pos(σ)

σ

σ

o1

o2

o2

o1

σ

σ

Page 50: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

• Pos

• BW transform B

• Occ(σ,i): number of occurrences of σ in B[0]….B[i]

• Space = 2n+|Σ|n log n bits (2n+4n log n bits for DNA)

• Now – o1=Occ(σ,F-1)

– o2=Occ(σ,L)-o1

April 12 Bafna/Ideker Bix 3

w F

L

σw σ

σw σ

Pos(σ)

σ

σ

o1

o2

o2

o1

σ

σ

The BW transform: the final data structure

Page 51: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Exact matching of string

GetRange(σw) //Ex: σw=ac

(F,L)=GetRange(w) //constant time F=4, L=5

o1=Occ[σ,F-1]

p1=Occ[σ,L]

//here o1=1, p1=3

F1=Pos[σ]+o1

L1=Pos[σ]+p1-1

return(F1,L1)

April 12 Bafna/Ideker Bix 3

2. a c a a c g $

0. $ a c a a c g

6. g $ a c a a c 5. c g $ a c a a

3. a c g $ a c a

1. a a c g $ a c

4. c a a c g $ a

a c a a c g

a c g

0 0 1

0 1 1

0 1 1

1 1 1

2 1 1

3 1 1

3 2 1

Occ

Page 52: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The BW transform

• With some tricks, the BW transform

becomes a memory efficient data structure

to query for exact matches.

• It has many other properties

April 12 Bafna/Ideker Bix 3

Page 53: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The Pigeonhole principle revisited

• Suppose we are looking to match 96bp

string with up to 5 errors

• How would we use exact matching so as

to guarantee sensitivity?

– Break up the string into 6 pieces of size 16bp.

Sensitivity is 100%.

April 12 Bafna/Ideker Bix 3

Page 54: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

The Pigeonhole principle revisited

• Break up the string into 6 pieces of size 16bp. Sensitivity is 100%.

• What about speed? – Number of random hits?

• 3*109*2*1011*4-16=1.4 *1011

– If we did an alignment (104 steps) around each hit, the total computation is 1.4*1015 steps.

– If we could only choose larger words, we could gain in speed. For 25-mers, number of hits is very small: ~530K hits only

– To maintain speed-sensitivity tradeoffs, should we try and look for approximate matches?

April 12 Bafna/Ideker Bix 3

Page 55: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Approximate matching

• Consider query aag. Find all matches with at most one mismatch

• Consider the suffix tree first:

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

a

g

c

a a

a

c

g

g

acg$

$

cg$

acg$

$ $

Page 56: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Approximate matching

• We do a BFS, maintaining errors seen in reaching a node.

• Worst case time is exponential (4w)

April 12 Bafna/Ideker Bix 3

a c a a c g $

$ a c a a c g

g $ a c a a c c g $ a c a a

a c g $ a c a

a a c g $ a c

c a a c g $ a

a c a a c g

a

g

c

a a

a

c

g

g

acg$

$

c

cg$

$ $

g$

1

a 1

1

1

Page 57: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Speedup 1: Branch and bound

• Goal is to match q with <= z errors

• Pre-compute D[i]: minimum number of errors needed to match the i-suffix of the query

• Suppose in the context of BFS, we reach a node with e errors, after matching the first i-1 symbols.

• If (e+D[i]>z), no match is possible and we can stop.

• This pruning reduces search space considerably in practice.

April 12 Bafna/Ideker Bix 3

Pre-compute D[i]: minimum number of errors needed to match the i-suffix of the query

i

e

Page 58: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Speedup 2

• BWA requires that in the prefix (high

quality region) we have a tighter match.

• In the first 32 bp (of 70bp queries), at most

2 errors allowed.

• Other BWA heuristics:

– Score appropriately for indels versus

mismatches.

April 12 Bafna/Ideker Bix 3

Page 59: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Fast search for exact matching

• There are two strategies.

• Build an index on the reference – O(n) preprocessing, O(m) search + Time to

create alignments

– Ex: suffix trees, suffix arrays, Burrows Wheeler transforms

• Automaton on queries; search genome with those queries. – O(m) preprocessing time, O(n) search time +

Time to create alignments.

– Ex: Aho corasick tries

April 12 Bafna/Ideker Bix 3

Page 60: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Dictionary Matching

• Q: Given k words (si has length li), and a

database of size n, find all matches to

these words in the database string.

• How fast can this be done?

1:POTATO 2:POTASSIUM 3:TASTE

P O T A S T P O T A T O

dictionary

database

Page 61: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Dict. Matching & string matching

• How fast can you do it, if you only had one word of length m?

– Trivial algorithm O(nm) time

– Pre-processing O(m), Search O(n) time.

• Dictionary matching

– Trivial algorithm (l1+l2+l3…)n

– Using a keyword tree, lpn (lp is the length of the longest pattern)

– Aho-Corasick: O(n) after preprocessing O(l1+l2..)

• We will consider the most general case

Page 62: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Direct Algorithm

P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O

Observations:

• When we mismatch, we (should) know something about where

the next match will be.

• When there is a mismatch, we (should) know something about

other patterns in the dictionary as well.

Page 63: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

P O T A T O

T U I S M

S E T A

The Trie Automaton

• Construct an automaton A from the dictionary

– A[v,x] describes the transition from node v to a node w upon reading x.

– A[u,’T’] = v, and A[u,’S’] = w

– Special root node r

– Some nodes are terminal, and labeled with the index of the dictionary word.

1:POTATO 2:POTASSIUM 3:TASTE

1

2

3

w

v u

S

r

Page 64: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

An O(lpn) algorithm for keyword matching

• Start with the first position in the db,

and the root node.

• If successful transition

– Increment current pointer

– Move to a new node

– If terminal node “success”

• Else

– Retract ‘current’ pointer

– Increment ‘start’ pointer

– Move to root & repeat

Page 65: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration:

P O T A T O

T U I S M

S E T A

P O T A S T P O T A T O

l c

v

S

1

2

3

Page 66: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Idea for improving the time

P O T A S T P O T A T O

• Suppose we have partially matched pattern i (indicated by l, and

c), but fail subsequently. If some other pattern j is to match

– Then prefix(pattern j) = suffix [ first c-l characters of

pattern(i))

l c

1:POTATO 2:POTASSIUM 3:TASTE

P O T A S S I U M T A S T E

Pattern i

Pattern j

Page 67: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

P O T A T O

T U I S M

S E T A

v S

1 n1

n7

n6 n5 n4 n3 n2

n9 n8

n10

• Every node v corresponds to a string sv that is a prefix of some pattern.

• Define F[v] to be the node u such that su is the longest suffix of sv

• If we fail to match at v, we should jump to F[v], and commence matching from there

• Let lp[v] = |su|

Failure function

Page 68: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A T O

T U I S M

S E T A

v S

1 n1

n7

n6 n5 n4 n3 n2

n9 n8

n10

• What is F(n10)?

• What is F(n5)?

• F(n3)?

• Lp(n10)?

Page 69: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 1

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 1

n10

Page 70: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 1

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 2

n10

Page 71: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 1

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 6

n10

Page 72: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 3

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 6

n10

Page 73: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 3

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 7

n10

n11

Page 74: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 7

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 7

n10

Page 75: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 7

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 8

n10

Page 76: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Illustration

P O T A S T P O T A T O

P O T A T O

T U I S M

S E T A

S

1

l = 7

n1

n7

n6 n5 n4 n3 n2

n9 n8

v

c = 12

n10

Page 77: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

April 12 Bafna/Ideker Bix 3

Time analysis

• In each step, either c is

incremented, or l is

incremented

• Neither pointer is ever

decremented (lp[v] < c-l).

• l and c do not exceed n

• Total time <= 2n

P O T A S T P O T A T O

l c

Page 78: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Reviewing

• Steps for mapping of reads:

1. Build an index on reads (e.g. Aho-Corasick trie), or on the database (e.g. BW transform)

2. Search for exact matches to k-mers (k~25).

3. When an exact match is found, extend using a Smith-Waterman alignment

4. Report matches with good scores.

April 12 Bafna/Ideker Bix 3

Page 79: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Using NGS for measuring RNA expression

• The mapping gives us raw counts at a

locus

• What is an appropriate measure?

– We must normalize for number of reads

sequenced, as well as length of

manuscript

– FPKM: fragments per 1000bp per million

reads

– What about bias in mapping location?

April 12 Bafna/Ideker Bix 3

Page 80: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

• There is a huge variation in true abundance values.

• The top 10% of the expressed genes contribute 60% of the reads.

• Small changes in high abundance genes lead to large changes in expression values of low abundance genes

• Perhaps better to normalize with the 75%ile

April 12 Bafna/Ideker Bix 3

Page 81: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Multiple isoforms and bias

• Assume that each of the exons is 500bp

• We have 2 transcripts of length 1K each – Expression(t1) ≅ 1+1=2 FPKM – Expression(t2) ≅ 4+3=7 FPKM

• What if sequence specific bias suggests that exon 1 is sampled three times exon 2?

• Bias in length distribution can also help when the reads are mapped in a paired-end fashion

April 12 Bafna/Ideker Bix 3

Page 82: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Modeling for bias correction

• Use empirical fragment length distribution

• The parameter to be estimated is the expression value ρt for all transcripts t.

• G: set of loci

• βg=relative abundance of locus g

• γt= relative abundance of t within its locus (multiple spliced isoforms exist at any locus)

• ρt=βg.γt

• F: set of fragments

• Xg: set of fragments mapping to locus g.

April 12 Bafna/Ideker Bix 3

Page 83: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

A generative model for a fragment

• Consider a fragment f that maps to

position (i,j) of transcript t, at locus g

April 12 Bafna/Ideker Bix 3

βg

g t γt

l

D b (i,j) f

 

Pr( f | t,b,g ) = bgg tD(l)bt (i, j)

Pr( f | b,g ) = bg g tD(l)bt (i, j)tÎgå

Page 84: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Likelihood

• Assuming that fragments are generated

independently

April 12 Bafna/Ideker Bix 3

βg

g t γt

l

D b (i,j) f

 

Pr(F | b,g ) = bgX g

g

Õæ

è ç ç

ö

ø ÷ ÷ g tD(l)bt (i, j)tÎg

åg

Õæ

è ç ç

ö

ø ÷ ÷

Page 85: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Estimating Bias

• Bias towards a location is computed as a function of the sequence, and the position relative to the 3’ or 5’ end of the transcript.

April 12 Bafna/Ideker Bix 3

Page 86: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Estimating parameters

• We have to estimate ρt=βgγt, and also D, and bt for all genes (loci) g, and for all transcripts t.

• The length distribution D can be measured empirically.

• Note that we cannot estimate bias unless we have a gold standard where we know ρ, and we cannot estimate ρ unless we know the bias.

• Roberts et al. use 2-step iteration to get an ML estimate

• Use uniform bias to get an initial estimate of ρ

• Use initial ρ to estimate bias

• Reestimate ρ

April 12 Bafna/Ideker Bix 3

Page 87: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Correlation between quantitative PCR and RNA-seq

April 12 Bafna/Ideker Bix 3 Roberts, Genome Biol. 2011

Page 88: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Correcting across platforms

April 12 Bafna/Ideker Bix 3

Page 89: BIONF/BENG 203: Functional Genomicscseweb.ucsd.edu/classes/sp12/cse283-a/lecturenotes/...BIONF/BENG 203: Functional Genomics Trey Ideker and Vineet Bafna TA: Martin Smith Topic 1:

Conclusion

• The processing of

mapped RNA reads

allows us to generate

a column of the

transcript abundance

matrix

April 12 Bafna/Ideker Bix 3

transcript