seqpig script language for large bioinformatic datasets

25
SeqPig A simple and scalable scripting language for large sequencing data sets in Hadoop arian pasquali june 6, 2014

Upload: arian-pasquali

Post on 09-Jun-2015

196 views

Category:

Data & Analytics


0 download

DESCRIPTION

presenting alternatives for processing large bioinformatics datasets

TRANSCRIPT

Page 1: Seqpig   script language for large bioinformatic datasets

SeqPigA simple and scalable scripting language for

large sequencing data sets in Hadoop

arian pasqualijune 6, 2014

Page 2: Seqpig   script language for large bioinformatic datasets

/me

Arian PasqualiMaster's student in Data MiningData engineer at Semasio

background- engineering - cloud computing- data mining on big data - social networks

Page 4: Seqpig   script language for large bioinformatic datasets

but first, some background

● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a

single computer● in order to handle big data sets we have to

master parallel programming models

Page 5: Seqpig   script language for large bioinformatic datasets

Parallel programming models

some high-performance programming models- Serial (doesn’t scale)- MPI (expensive)- MapReduce

- Hadoop (cheap and scalable)

Page 6: Seqpig   script language for large bioinformatic datasets

hadoop

Hadoop is an open source implementation of that enables you to run MapReduce programs.

It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios.

http://hadoop.apache.org/

Page 7: Seqpig   script language for large bioinformatic datasets

how mapreduce works on hadoopProvides a framework for MapReduce, a fault-tolerant parallel programing model- easier to write programs than other paradigms- easier means cheaper- runs on clusters with commodity hardware - scales horizontally

- need more power? just add more nodes

Page 8: Seqpig   script language for large bioinformatic datasets

an application: BLAST algorithm

MapReduce Tasks- load data- map sequences- partitionate- reduce (merge)- output results

Page 9: Seqpig   script language for large bioinformatic datasets

MapReduce is easier, but not trivial

Page 10: Seqpig   script language for large bioinformatic datasets

Apache Pig tries to solve that

Apache Pig solves that. Under the hood it applies MapReduce paradigmIt hides all the pitfalls about writing MapReduce code

Page 11: Seqpig   script language for large bioinformatic datasets

Pig version of the same code

Page 12: Seqpig   script language for large bioinformatic datasets

Apache Pig in BioinformaticsIt is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

It can be easier

Page 13: Seqpig   script language for large bioinformatic datasets

SeqPigScalable scripting language based on Apache Pig for large scale sequence

analysis

Page 14: Seqpig   script language for large bioinformatic datasets

SeqPig

● a script language,● a library,● and a collection of tools to manipulate,

analyze and query sequencing datasets in a scalable and simple manner

http://seqpig.sourceforge.net/

Page 15: Seqpig   script language for large bioinformatic datasets

SeqPig and data format support

Currently it supports BAMSAMFastQQseq input and outputFASTA input

Page 16: Seqpig   script language for large bioinformatic datasets

possible use cases

● converting data formats● filters regions of a chromossome● computing base frequencies● alignments● collecting read-mapping-quality-statistics

Page 17: Seqpig   script language for large bioinformatic datasets

code example run scripts/filter_defs.pig

A = load 'input.bam' using BamLoader('yes');

B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);

C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD');

D = FOREACH C GENERATE FLATTEN($0);

base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;

base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);

base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount;

base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);

base_stats = FOREACH base_stats_grouped {

TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;

TMP2 = ORDER TMP1 BY bcount desc;

GENERATE group.$0, group.$1, TMP2;

}

STORE base_stats into 'outputfile_readstats.txt';

Page 18: Seqpig   script language for large bioinformatic datasets

resultsA 0 {(A,19),(G,2)}

A 1 {(A,10)}

A 2 {(A,18)}

A 3 {(A,16)}

A 4 {(A,14)}

A 5 {(A,15)}

A 6 {(A,16),(G,2)}

...

A 98 {(A,7)}

A 99 {(A,14)}

C 0 {(C,6)}

C 1 {(C,11)}

C 2 {(C,9)}

Page 19: Seqpig   script language for large bioinformatic datasets

results plotted

Page 20: Seqpig   script language for large bioinformatic datasets

scalability test● 61Gb dataset● running some

FastQC stats

* speed in minutes

Page 21: Seqpig   script language for large bioinformatic datasets

related workBiodoop: Bioinformatics on Hadoophttp://dl.acm.org/citation.cfm?id=1679817

BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journalshttp://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528

Page 22: Seqpig   script language for large bioinformatic datasets

some cloud computing solutions

Amazon AWS , general use purpousehttp://aws.amazon.com/

Mortar Data , focused on data sciencehttp://www.mortardata.com/

CloudGene, focused on bioinformatics usershttp://cloudgene.uibk.ac.at/

Page 23: Seqpig   script language for large bioinformatic datasets

cloudgene, mapreduce for bioinformatics

Page 24: Seqpig   script language for large bioinformatic datasets

conclusionsBioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science.

Neural networks in Artificial Intelligence and Machine learning is an example.Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.