seqpig script language for large bioinformatic datasets

SeqPigA simple and scalable scripting language for

large sequencing data sets in Hadoop

arian pasqualijune 6, 2014

/me

Arian PasqualiMaster's student in Data MiningData engineer at Semasio

background- engineering - cloud computing- data mining on big data - social networks

study case

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K.

Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.

http://www.ncbi.nlm.nih.gov/pubmed/24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Schumacher%20A%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Pireddu%20L%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Niemenmaa%20M%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Kallio%20A%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Korpelainen%20E%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Schumacher%20A%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Zanetti%20G%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Heljanko%20K%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed?term=Zanetti%20G%5BAuthor%5D&cauthor=true&cauthor_uid=24149054

http://www.ncbi.nlm.nih.gov/pubmed/24149054#

but first, some background

● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a

single computer● in order to handle big data sets we have to

master parallel programming models

Parallel programming models

some high-performance programming models- Serial (doesn’t scale)- MPI (expensive)- MapReduce

- Hadoop (cheap and scalable)

hadoop

Hadoop is an open source implementation of that enables you to run MapReduce programs.

It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios.

http://hadoop.apache.org/

how mapreduce works on hadoopProvides a framework for MapReduce, a fault-tolerant parallel programing model- easier to write programs than other paradigms- easier means cheaper- runs on clusters with commodity hardware - scales horizontally

- need more power? just add more nodes

an application: BLAST algorithm

MapReduce Tasks- load data- map sequences- partitionate- reduce (merge)- output results

MapReduce is easier, but not trivial

Apache Pig tries to solve that

Apache Pig solves that. Under the hood it applies MapReduce paradigmIt hides all the pitfalls about writing MapReduce code

Pig version of the same code

Apache Pig in BioinformaticsIt is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.

It can be easier

SeqPigScalable scripting language based on Apache Pig for large scale sequence

analysis

SeqPig

● a script language,● a library,● and a collection of tools to manipulate,

analyze and query sequencing datasets in a scalable and simple manner

http://seqpig.sourceforge.net/

SeqPig and data format support

Currently it supports BAMSAMFastQQseq input and outputFASTA input

possible use cases

● converting data formats● filters regions of a chromossome● computing base frequencies● alignments● collecting read-mapping-quality-statistics

code example run scripts/filter_defs.pig

A = load 'input.bam' using BamLoader('yes');

B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);

C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD');

D = FOREACH C GENERATE FLATTEN($0);

base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;

base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);

base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount;

base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);

base_stats = FOREACH base_stats_grouped {

TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;

TMP2 = ORDER TMP1 BY bcount desc;

GENERATE group.$0, group.$1, TMP2;

}

STORE base_stats into 'outputfile_readstats.txt';

resultsA 0 {(A,19),(G,2)}

A 1 {(A,10)}

A 2 {(A,18)}

A 3 {(A,16)}

A 4 {(A,14)}

A 5 {(A,15)}

A 6 {(A,16),(G,2)}

...

A 98 {(A,7)}

A 99 {(A,14)}

C 0 {(C,6)}

C 1 {(C,11)}

C 2 {(C,9)}

results plotted

scalability test● 61Gb dataset● running some

FastQC stats

* speed in minutes

related workBiodoop: Bioinformatics on Hadoophttp://dl.acm.org/citation.cfm?id=1679817

BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journalshttp://bioinformatics.oxfordjournals.org/content/early/2013/09/10/bioinformatics.btt528

some cloud computing solutions

Amazon AWS , general use purpousehttp://aws.amazon.com/

Mortar Data , focused on data sciencehttp://www.mortardata.com/

CloudGene, focused on bioinformatics usershttp://cloudgene.uibk.ac.at/

cloudgene, mapreduce for bioinformatics

conclusionsBioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science.

Neural networks in Artificial Intelligence and Machine learning is an example.Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.

thank [email protected]

mailto:[email protected]

mailto:[email protected]

seqpig script language for large bioinformatic datasets

Data & Analytics

data mining data engineer

mortar data

data science http

group base

large data sets

foreach base

data analysis programs

basepos base