single nucleotide polymorphisms and applications usman roshan bnfo 601

Single nucleotide polymorphisms and

applicationsUsman Roshan

BNFO 601

• DNA sequence variations that occur when a single nucleotide is altered.

• Must be present in at least 1% of the population to be a SNP.

• Occur every 100 to 300 bases along the 3 billion-base human genome.

• Many have no effect on cell function but some could affect disease risk and drug response.

Toy example

SNPs on the chromosome

Chromosome

Bi-allelic SNPs

• Most SNPs have one of two nucleotides at a given position

• For example:– A/G denotes the varying nucleotide as

either A or G. We call each of these an allele

– Most SNPs have two alleles (bi-allelic)

SNP genotype

• We inherit two copies of each chromosome (one from each parent)

• For a given SNP the genotype defines the type of alleles we carry

• Example: for the SNP A/G one’s genotype may be– AA if both copies of the chromosome have A– GG if both copies of the chromosome have G– AG or GA if one copy has A and the other has G– The first two cases are called homozygous and latter

two are heterozygous

SNP genotyping

Real SNPs

• SNP consortium: snp.cshl.org

• SNPedia: www.snpedia.com

Application of SNPs: association with disease

• Experimental design to detect cancer associated SNPs:– Pick random humans with and without

cancer (say breast cancer)– Perform SNP genotyping– Look for associated SNPs – Also called genome-wide association study

Case-control example

• Study of 100 people:– Case: 50 subjects with

cancer

– Control: 50 subjects without cancer

• Count number of alleles and form a contingency table

#Allele1 #Allele2

Case 10 90

Control 2 98

Effect of population structure on genome-wide association

studies• Suppose our sample is drawn from a

population of two groups, I and II• Assume that group I has a majority of allele

type I and group II has mostly the second allele.

• Further assume that most case subjects belong to group I and most control to group II

• This leads to the false association that the major allele is associated with the disease

Effect of population structure on genome-wide association

studies• We can correct this effect if case and

control are equally sampled from all sub-populations

• To do this we need to know the population structure

Population structure prediction

• Treated as an unsupervised learning problem (i.e. clustering)

Clustering

• Suppose we want to cluster n vectors in Rd into two groups. Define C1 and C2 as the two groups.

• Our objective is to find C1 and C2 that minimize

where mi is the mean of class Ci

|| x j −mi ||2

x j ∈C i

∑i=1

K-means algorithm for two clusters

Input: Algorithm:

1. Initialize: assign xi to C1 or C2 with equal probability and compute means:

2. Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||, otherwise assign to C2

3. Recompute means m1 and m2

4. Compute objective

5. Compute objective of new clustering. If difference is smaller than then stop, otherwise go to step 2.

x i ∈ Rd ,i =1K n

x ixi ∈C1

x ixi ∈C2

|| x j −mi ||2

x j ∈C i

∑i=1

K-means

• Is it guaranteed to find the clustering which optimizes the objective?

• It is guaranteed to find a local optimal

• We can prove that the objective decreases with subsequence iterations

Proof sketch of convergence of k-means

|| x j −mi ||2

x j ∈C i

∑i=1

∑ ≥

|| x j −mi ||2

x j ∈C i*

∑i=1

∑ ≥

|| x j −mi* ||2

x j ∈C i*

∑i=1

Justification of first inequality: by assigning xj to the closest mean the objective decreases or stays the same

Justification of second inequality: for a given cluster its mean minimizes squared error loss

single nucleotide polymorphisms and applications usman roshan bnfo 601

chromosome slide

disease slide

clustering slide

population structure

snp genotyping slide

heterozygous slide

associated snps

alleles biallelic slide

Documents

roshan licht

bnfo 602 lecture 1 usman roshan. bio background dna...

bnfo 615 data analysis in bioinformatics instructor zhi wei

lecture 4 bnfo 235 usman roshan. iupac nucleic acid symbols

roshan beevi

roshan belokar

lecture 2 bnfo 135 usman roshan. perl variables scalar...

bnfo 602 phylogenetics usman roshan. summary of last time...

lecture 1 bnfo 135 usman roshan. course overview perl...

lecture 1 bnfo 601 usman roshan. course overview perl...

http://creativecommons.org/licenses/by-sa/2.0/. bnfo 602,...

bnfo 602 lecture 2 usman roshan. bioinformatics problems...

bnfo 602 phylogenetics

roshan new.ppt

bnfo 135: programming for bioinformatics

lecture 1 - web.njit.edu · lecture 1 bnfo 136 usman...

bnfo 602 lecture 2 usman roshan. sequence alignment widely...

types of polymorphisms i. protein/enzyme polymorphisms blood...

roshan cyberstalking

http://creativecommons.org/licenses/by-sa/2.0/. bnfo 602,...