finding allelic frequencies using mapreduce/hadoop

Finding Allelic FrequenciesUsing MapReduce/Hadoop

Mahmoud ParsianPh.D in Computer Science

Senior Architect @ illumina1

2014 Hadoop SummitAmsterdam, Netherlands

April 3, 2014

1www.illumina.comMahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 1 / 46

Table of Contents

1 Biography

2 Overview

3 Basic Definitions

4 Source of Data for Allelic Frequency

5 Allelic Frequency Analysis

6 Allelic Frequency Using MapReduce/Hadoop

7 Running Allelic Frequency Analysis

Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 2 / 46

Biography

Outline

1 Biography

2 Overview

3 Basic Definitions






Biography

Who am I?

Name: Mahmoud Parsian

Education: Ph.D in Computer Science

Works: as Senior Architect @Illumina, Inc

Lead Big Data Team @IlluminaDevelop scalable regression algorithmsDevelop DNA-Seq and RNA-Seq workflowsUse Java/MapReduce/Hadoop/HBase

Author: of two books

JDBC Recipies (Apress)JDBC MetaData Recipies (Apress)


Overview

Outline

1 Biography

2 Overview

3 Basic Definitions






Overview

Overview

Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.

Patient Sample −→ ... −→ VCF

Magnitude of this data is challenging to store and analyze:

several million variants per patientseveral billions for groups of patients

The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.

Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.

Find allelic frequencies (use MapReduce/Hadoop)

Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)


Basic Definitions

Outline

1 Biography

2 Overview

3 Basic Definitions






Basic Definitions

Some Basic Definitions

Chromosome

Bioset

Bioset Record

Allele

Allelic Frequency


Basic Definitions

Chromosome

The term chromosome comes from the Greek words for color(chroma) and body (soma)

A chromosome is an organized structure of DNA,protein, and RNA found in cells.

Human cells have 23 pairs of chromosomes labeledas {1, 2, ..., 22, X, Y}.Humans have a total of 46 chromosomes.

How are chromosomes inherited? In humans, onecopy of each chromosome is inherited from thefemale parent and the other from the male parent.


Basic Definitions

Chromosome in Picture


Basic Definitions

Cells to DNA


Basic Definitions

What is a Bioset?

Individually analyzed data signatures are referred to as ”biosets”.”Biosets” encompass data in the form of experimental samplecomparisons as well as genotype signatures

A bioset most commonly referred to as a ”gene signature”. A samplerecord of a bioset will contain a chromosome, its start and stoppositions, two alleles, and other related information.

The number of entries/records for a germline bioset can have 4.3million records

A patient may have any number of biosets

Each bioset has a set of genes


Basic Definitions

VCF to Bioset

Sample −→ FASTQ1 Data

FASTQ Data −→ DNA-Seq

DNA-Seq −→ VCF2

VCF −→ Bioset

Bioset −→ Ready for Analysis

1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.

2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46

Basic Definitions

Sample Record of a Bioset?

A bioset can have 4.3 million records

A sample record of a bioset will contain

a chromosome (chromosomeID: 1, 2, 3, ...)Start positionStop positionTwo alleles: Allele1, Allele2Genome Referenceand other related information such as mutation class, ...


Basic Definitions

What is an Allele?

Allele is a viable DNA coding that occupies a given locus (position)on a chromosome. There are two alleles per chromosome position andthey are called allele1 and allele2.

Allelic frequency is defined as ”the percentage of a population of aspecies that carries a particular allele on a given chromosome locus.”

Alternatively, ”allele frequency” can be defined as the frequency of anallele relative to that of other alleles of the same gene in a population.

The Fisher’s Exact Test is used to calculate the ”p-value” for AllelicFrequency.


Basic Definitions

Two Alleles: allele1, allele2


Basic Definitions

Two Alleles: allele1, allele2

An allele is one of two or more versions of a gene. An individualinherits two alleles for each gene, one from each parent.


Source of Data for Allelic Frequency

Outline

1 Biography

2 Overview

3 Basic Definitions







VCF to Bioset

Sample → FASTQ Data → DNA-Seq → VCF → Bioset

Bioset Record Elements:

1. chromosomeID

2. startPosition

3. stopPosition

4. allele1

5. allele2

6. referenceGenome

7. mutationClass

...



Size of Data for Analysis

One Bioset = 4.3 million records

For Allelic frequency: form two groups: Group-A, Group-B

Keep two sets of the same data:

one set for Group-Aone set for Group-B

Group-A = 6,000 Biosets

Group-B = 9,000 Biosets

6,000 + 9,000 = 15,000

15,000 Total Biosets to analyze

15,000 x 4.3M = 64.5 Billion records


Allelic Frequency Analysis

Outline

1 Biography

2 Overview

3 Basic Definitions








Given

Group-A = set of biosets = {A1,A2, ...,An}Group-B = set of biosets = {B1,B2, ...,Bm}

Find

Allelic Frequecy for every chromosomeID, start, stop, allele

Find p-value for every chromosomeID, start, stop, allele

Find top-100 p-values



Allelic Frequency by Example

Group-A: 6 biosets

Bioset-ID Allele-1 Allele-2

1 A C2 A A3 A C4 G G5 A A6 AC T



Allelic Frequency by Example...

Group-B: 5 biosets

Bioset-ID Allele-1 Allele-2

7 A A8 C C9 A C10 A A11 A A




Create Frequency Table for Group-A and Group-B:

Allele Group-A Group-A Group-B Group-BKnown Others Known Others

A 6 6 7 3C 2 10 3 7G 2 10 0 10AC 1 11 0 10T 1 11 0 10




Create a Contigency Table for each Allele: for Allele A:

Known Others

Group-A 6 6Group-B 7 3

Now we can apply the Fisher’s Exact Test or other tests for analysis...



Fisher’s Exact Test Using R

# R (version 2.15.1)

> mytable = rbind( c(6, 6), c(7, 3) );

> mytable

[,1] [,2]

[1,] 6 6

[2,] 7 3

> fisher.test(mytable)

Fisher’s Exact Test for Count Data

data: mytable

p-value = 0.4149



Fisher’s Exact Test Definition

Note that a, b, c , d refers to the values that wegenerate as a 2× 2 contingency table shown below:

Known Others Row Totals

Group-A a b a + bGroup-B c d c + d

Column Totals a + c b + d n = a + b + c + d

p =

(a + b

a

)(c + d

c

)(

n

a + c

) =(a + b)! (c + d)! (a + c)! (b + d)!

a! b! c! d! n!


Allelic Frequency Using MapReduce/Hadoop

Outline

1 Biography

2 Overview

3 Basic Definitions








MapReduce PHASE-1:

Eliminate Duplicate Bioset Records

MapReduce PHASE-2:

Allelic Frequency using Fisher’s Exact Test

MapReduce PHASE-3:

Find Top-100



MapReduce PHASE-1: Eliminate Duplicate Records

Mapper:

// key = chrID:start:stop:group:allele1:allele2:reference

// group = {a, b}

// value = mutationClass

map(key, value) {

emit(key, value);

}

Reducer:


// values = List<mutationClass>

reduce(key, values) {

maxMC = max(values); // max. mutationClass

outputKey = chrID:start:stop

outputValue = group:allele1:allele2:reference:maxMC

emit(outputKey, outputValue);

}



MapReduce PHASE-1: Eliminate Duplicate Records

Mapper:


// group = {a, b}

// value = mutationClass

map(key, value) {

emit(key, value);

}

Reducer:


// values = List<mutationClass>


maxMC = max(values); // max. mutationClass

outputKey = chrID:start:stop

outputValue = group:allele1:allele2:reference:maxMC

emit(outputKey, outputValue);

}Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 31 / 46


MapReduce PHASE-2: Allelic Frequency using Fisher’sExact Test: Mapper

Mapper:

// key = chrID:start:stop

// group = {a, b}

// value = group:allele1:allele2:reference:mutationClass

map(key, value) {

emit(key, value);

}



MapReduce PHASE-2: Allelic Frequency using Fisher’sExact Test: Reducer

Reducer:

// key = chrID:start:stop

// values = List<group:allele1:allele2:reference:mutationClass>

// group = {a, b}


setOfAlleles = all alleles in group A and group B;

freqTableA = (allele, known, others);

freqTableB = (allele, known, others);

for (String allele : setOfAlleles) {

contingecyTable = (allele, N11, N12, N21, N22);

pvalue = FishersExactTest(contingecyTable);

emit (value, entireRecored)

}

}



MapReduce PHASE-3: Find Top-100

Now that we have:

p-value:chrID:start:stop:allele

How we can find top-100 p-values (close to 0.00)?

SQL solution:

SELECT *

FROM allele_frequency_table

ORDER BY pvalue LIMIT 100;



MapReduce PHASE-3: Find Top-100

top100() defined as:

Let P = {p1, p2, ..., pn}Then top100(P) = {s1, s2, ..., s100}where si ∈ P and s1 ≤ s2 ≤ ... ≤ s100

NOTE: top100 for Allelic Frequency means: find smallest p-values,which are closer to 0.00



Find Top-100 p-values

1 Mapper:

Each mapper finds its local top-100 p-values

and sends that top-100 list to the reducer.

We will use many mappers.

2 Reducer:

The reducer finds the final top-100 p-values

from the top-100 lists sent from the mappers.

We will use a single reducer for final top-100.



Top-100 p-values Creates a Monoid

Associativity:top100(x, top100(y, z)) = top100( top100(x, y), z)

Identity:top100(x, {}) = top100({}, x) = top100(x)

Therefore, we can have a combiner as well:

The combiner finds the top-100 p-values

from the top-100 lists sent from the mappers.



MapReduce for Top-100 p-values: Mapper

public class Top100Mapper ... {

private SortedMap<Double, String> top100 =

new TreeMap<Double, String>();

// key is the pvalue of double type and range is 0.00 to 1.00

// value is the entire record of allelic frequency

// output (includes pvalue)

map(Double key, String entireRecord) {

top100.put(key, value); // sort by pvalue

if (top100.size() > 100) {

// remove the greatest pvalue

top100.remove(top100.lastKey());

}

}

// called once at the end of the mapper task.

cleanup() { ...}

}



MapReduce for Top-100 p-values: Mapper

public class Top100Mapper ... {

private SortedMap<Double, String> top100 =


map(Double key, String entireRecord) {...}

// called once at the end of the mapper task.

cleanup() {

for (Map.Entry<Double, String> entry : top100.entrySet() {

Double pvalue = entry.getKey();

String entireRecord = entry.getValue();

String outputValue = pair(pvalue, entireRecord);

// NULL key will send all key-value

// pairs to a single reducer only

emit(NULL, outputValue);

}

}

}



MapReduce for Top-100 p-values: Reducer

reduce(NullWritable key, Iterable<pair<Double, String>> values) {

SortedMap<Double, String> finalTop100 =


for (pair(Double, String) value : values) {

Double pvalue = value.pvalue;

String entireRecord = value.entireRecord;

finalTop100.put(pvalue, entireRecord);

if (finalTop100.size() > 100) {

// remove the greatest pvalue

finalTop100.remove(finalTop100.lastKey());

}

}

// now, we have the final top 100 list

emitFinalTop100();

}



MapReduce for Top-100 p-values: Reducer

reduce(NullWritable key, Iterable<pair<Double, String>> values) {

...

// now, we have the final top 100 list

// emitFinalTop100();

for (Map.Entry<Double, String> entry : finalTop100.entrySet() {

Double pvalue = entry.getKey();

String entireRecord = entry.getValue();

emit(pvalue, entireRecord);

}

}


Running Allelic Frequency Analysis

Outline

1 Biography

2 Overview

3 Basic Definitions







Sample Run

$ cat allelic_freq_test_100_by_100.sh}

#!/bin/bash

client=AllelicFrequencyClient

groupA=bioset_ids.txt.100.a

groupB=bioset_ids.txt.100.b

$client interactive 0 $groupA $groupB

$ wc -l bioset_ids.txt.100.a bioset_ids.txt.100.b}

100 bioset_ids.txt.100.a

100 bioset_ids.txt.100.b

$ cat bioset_ids.txt.100.a

427033

427039

...

$ cat bioset_ids.txt.100.b

656714

656720

...



Sample Run

$ ./allelic_freq_test_100_by_100.sh

Wed Feb 12 15:27:10 PST 2014

Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - executionType: interactive

Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - requestID: 0

Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - GroupA: bioset_ids.txt.100.a

Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - GroupB: bioset_ids.txt.100.b

...

Feb 12 2014 15:27:12 [main] [INFO ] [JobClient] - Running job: job_201401170112_0644

Feb 12 2014 15:27:13 [main] [INFO ] [JobClient] - map 0% reduce 0%


...



Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Job complete: job_201401170112_0644

...

Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map-Reduce Framework

Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map output materialized bytes=134376521

Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map input records=9,352,649

Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Reduce input groups=134,894

Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Reduce output records=53,557

Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - run(): jobSucceeded=true

Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - run(): Job Finished in 94.423 seconds

Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - submitJob(): runStatus=0



Sample Run

$ hadoop fs -cat /biomarker/output/germline/0/part* | sort -g | head

3.9437604668735787E-115:1:2483112:2483112:32872,20773539,20774078,:8:C:C:198:2:0:200:147567

3.9437604668735787E-115:12:51372604:51373968:100191,:15:1365BP:1365BP:198:2:0:200:null

7.770768062434234E-115:13:113588869:113588869:10323,:8:G:G:1:199:199:1:40972240

2.668611249251343E-113:13:113587440:113587440:10323,:8:G:G:197:3:0:200:10286004

5.206158192319811E-111:13:113587440:113587440:10323,:8:C:C:1:199:197:3:10286004

7.693839401259585E-111:1:16682451:16684181:79290,:15:1,731BP:1,731BP:2:198:198:2:null

5.580066122186588E-110:13:113588869:113588869:10323,:8:C:C:195:5:0:200:null

2.6288489416374975E-109:17:36760271:36779253:15243,15247,:15:18,983BP:18,983BP:1:199:196:4:null

1.915822701950223E-108:17:36760271:36779253:15243,15247,:15:18983BP:18983BP:194:6:0:200:null

5.665361418625481E-107:1:2483112:2483112:32872,20773539,20774078,:8:G:G:0:200:193:7:147567



References

WikipediaAllele Frequencyhttp://en.wikipedia.org/wiki/Allele_frequency

Max Kuhn and Kjell JohnsonApplied Predictive ModelingSpringer, 2013


http://en.wikipedia.org/wiki/Allele_frequency

finding allelic frequencies using mapreduce/hadoop

Data & Analytics