efficient data handling in large-scale sequence database...

14
1 1 Efficient Data Handling in Large-Scale Sequence Database Searches H. Lin , X. Ma , W. Feng * , A. Geist , and N. Samatova NCSU * Virginia Tech ORNL 2 Outline Sequence database search Parallel BLAST background mpiBLAST & pioBLAST New release: mpiBLAST-pio GreenGene: search NT against NT practice (SC|05, StorCloud Demo)

Upload: others

Post on 14-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

1

1

Efficient Data Handling in Large-ScaleSequence Database Searches

H. Lin†, X. Ma†, W. Feng*,A. Geist‡, and N. Samatova‡

†NCSU *Virginia Tech ‡ORNL

2

Outline

Sequence database search Parallel BLAST background mpiBLAST & pioBLAST New release: mpiBLAST-pio GreenGene: search NT against NT practice

(SC|05, StorCloud Demo)

Page 2: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

2

3

Sequence Database Search is Criticalfor Biomedical Science

Routinely used in biomedical research Search similarities between query sequences and sequence

database Predict structures and functions of new sequences

Analogous to web search engines (e.g. Google)

Score (Similarity)Closeness & rankSorted by

DB sequences similar tothe query

Related web pagesOutput

Known sequence databaseInternetSearch spaceQuery sequence(s)Keyword(s)Input

Sequence DB SearchWeb Search Engine

4

Challenge for Sequence DB Search

Sequence databases aregrowing exponentially in size

Sequence DB Search is Hampered by the Growing Gapbetween Sequence Growth and Processor Memory

Because of this gap: there is a lot of repeated I/O introduced by loadingsequence data back and forth from the file system to the memory. Thisadversely affects the performance.

Page 3: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

3

5

BLAST: At the Core of Sequence DBSearch

Widely used search tool: Approximately 75%-90% of all compute cycles in life

sciences are devoted to BLAST searches But, it is:

Computationally demanding, O(n2) Requires huge database to be stored in memory Generates gigabytes of output file for large database

searches Parallel BLAST as a means to address

computational challenge

6

BLAST Parallelization:Query Segmentation

>gi|3123744|dbj|AB013447.1|AB013447TTGGTATCCACGGAAGAGAGAGAAAATGTTGGGAATTTTCAGCGGACGTATAGTATCATTGCCGGAAGAGCTGGTGGCTGCCGGGAACC

>gi|221778|dbj|D00026.1|HS2HSV2P4GGAGGGTGGCTGGTGGGTATTGGCGGCCCGACCGATCTGCCCCGACCGACGGCTCCTGCCACCCGAACATG

>gi|7328961|dbj|AB032155.1|AB032154S2TTTTTTTCTTGATGCTGAAATCTATCCAAACATCACCAGTCCTCACGAGTCCTTGACCAAATTCTTGCTTTCTGGCACAATCTGAAGCCCAAAGGC

Database

>Perilla Frutescens CDS 0001TTGGTATCCACGGAAGAGAGAGAAAATGTTGGGAATTTTCAGCGGACGTATAGTATCATTGCCGGAAGAGCTGGTGGCTGCCGGGAACC

>Perilla Frutescens CDS 0002GGAGGGTGGCTGGTGGGTATTGGCGGCCCGACCGATCTGCCCCGACCGACGGCTCCTGCCACCCGAACATGTGATAGAAAGGAQQQQQQQQ

>Perilla Frutescens CDS 0003TTTTTTTCTTGATGCTGAAATCTATCCAAACATCACCAGTCCTCACGAGTCCTTGACCAAATTCTTGCTTTCTGGCACAATCTGAAGCCCAAAGGC

Queries

Worker Nodes

W. Feng at. el. “mpiBLAST on the GreenGene DistributedSupercomputer”, SC|05

Page 4: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

4

7

Pros and Cons of Query Segmentation

Advantages Low parallelization overhead Linear speedup when database fits into single

processor memory Disadvantages

Suffers repeated I/O when database cannot fit intomain memory

Resource under-utilization / load imbalance when#queries smaller than or comparable to #processors

8

BLAST Parallelization:Database Segmentation

>gi|3123744|dbj|AB013447.1|AB013447TTGGTATCCACGGAAGAGAGAGAAAATGTTGGGAATTTTCAGCGGACGTATAGTATCATTGCCGGAAGAGCTGGTGGCTGCCGGGAACC

>gi|221778|dbj|D00026.1|HS2HSV2P4GGAGGGTGGCTGGTGGGTATTGGCGGCCCGACCGATCTGCCCCGACCGACGGCTCCTGCCACCCGAACATG

>gi|7328961|dbj|AB032155.1|AB032154S2TTTTTTTCTTGATGCTGAAATCTATCCAAACATCACCAGTCCTCACGAGTCCTTGACCAAATTCTTGCTTTCTGGCACAATCTGAAGCCCAAAGGC

Database

>Perilla Frutescens CDS 0001TTGGTATCCACGGAAGAGAGAGAAAATGTTGGGAATTTTCAGCGGACGTATAGTATCATTGCCGGAAGAGCTGGTGGCTGCCGGGAACC

>Perilla Frutescens CDS 0002GGAGGGTGGCTGGTGGGTATTGGCGGCCCGACCGATCTGCCCCGACCGACGGCTCCTGCCACCCGAACATGTGATAGAAAGGAQQQQQQQQ

>Perilla Frutescens CDS 0003TTTTTTTCTTGATGCTGAAATCTATCCAAACATCACCAGTCCTCACGAGTCCTTGACCAAATTCTTGCTTTCTGGCACAATCTGAAGCCCAAAGGC

Worker NodesQueries

W. Feng at. el. “mpiBLAST on the GreenGene DistributedSupercomputer”, SC|05

Page 5: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

5

9

Pros and Cons of DatabaseSegmentation

Advantages Fitting large database into aggregate memory Able to utilize large machines regardless of

#queries Disadvantages

Higher parallel search overhead, local results needto be merged globally

Challenge Reduce result merging & processing overhead

10

mpiBLAST: A Specific Implementationof Database Segmentation

Open-source parallel BLAST developed atLANL: http://mpiblast.lanl.gov or http://www.mpiblast.org

Increasingly popular: more than 10,000downloads in less than 2 years

Integrated with NCBI BLAST Based on database segmentation Performance

Achieves super linear speedup when using small #processors

Problem: overhead in data handling limits scalability

Page 6: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

6

11

mpiBLAST System Design

Master-slave model: one master, p-1 workers Searching done in workers

Search all queries against a subset of DB frags Generate partial results – meta data of alignments

(ASN.1 format, include seq id, scores, etc.) Output processing done in master

Merge partial results from all workers Fetch correspondent result sequence data Compute and output alignments

12

mpiBLAST 1.2 Input Databases partitioned statically before search

Inflexible execution time sensitive to # fragments re-partitioning required to use different # procs

Management overhead generating large number of small files, hard to manage, migrate and

share

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 50 100 150 200

Number of Fragments

To

tal

Exe

cuti

on

Tim

e

# Fragments sensitivity test

- Search 150k queries against nrdatabase

- Using 32 processors

Execution Time Vs. # Fragment

Page 7: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

7

13

mpiBLAST 1.2 Output

Master

result 1result 2result 3

….

Alignment1 Alignment2 Alignment3

Worker1

Worker2

Worker3

Seq idSeq data

Serialized by the masterGlobal output file

Master must cache all results

result 1

result 2

result 3

DB Frag

DB Frag

DB Frag

Seq data sent over network

14

mpiBLAST 1.2 Scalability

Consequence of inefficient data handling:rapidly growing non-search overhead as No. of procs increases Output data size increases

0

500

1000

1500

2000

2500

3000

3500

4000

4500

4 8 16 32 64

Number of Processors

Tim

e (S

econ

ds)

Other time

Search time

Execution Time vs. # Procs

-Search 150k queries against nr

- Vary number of processors

- Database evenly partitionedaccording to # processors

Page 8: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

8

15

pioBLAST

Research prototype of efficient parallel BLASTdeveloped at ORNL & NCSU

Built on top of mpiBLAST1.2 Apply parallel/collective I/O techniques

Enable dynamic partitioning Parallel database input and result output

Highly efficient result processing Workers compute alignments in parallel Workers buffer and write local output in parallel Enhanced worker-master communication for reducing

data transfer volume

16

Dynamic Partitioning of pioBLAST No pre-partitioning

One single database image to search against Virtual fragments generated dynamically at run time

Workers read inputs in parallel with MPI-IO interface Fragment size configurable at run time

Easily supports dynamic load balancing

Worker1

Frag1

Frag1 Frag2

Global Sequence Data

Worker2

Frag2

Worker3

Frag3

Worker n

FragN

Frag3 FragN

Page 9: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

9

17

Output Processing of pioBLAST

Master

Worker1result 1.1result 1.2result 1.3

1.1 1.2 1.3

Worker2result 2.1result 2.2result 2.3

2.1 2.2 2.3

Worker3result 3.1result 3.2result 3.3

3.1 3.2 3.3

1.1 1.22.1 2.2 2.33.1 3.2

scores, sizes output offsets

Global output file

Reducecommunication

DB Frag DB Frag DB FragProcessing and

caching results inparallel

Collective writing: I/Oin parallel

18

mpiBLAST 1.2 vs. pioBLAST:Node Scalability

Platform: SGI Altix at ORNL 256 processors (1.5GHz Itanium2), 8GB memory/proc, XFS

Database: nr (1GB) Node scalability

mpiBLAST: non-search overhead increases fast pioBLAST: non-search time remains low

0

500

1000

1500

2000

2500

3000

3500

4000

4500

mpi-4

pio

-4

mpi-8

pio

-8

mpi-16

pio

-16

mpi-32

pio

-32

mpi-64

pio

-64

Program-No. of processes

Execu

tio

n t

ime (

s)

Search time Other time

Search 150k NR queries on different #procs

Page 10: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

10

19

mpiBLAST 1.2 vs. pioBLAST:Output Scalability

Same platform and database Varied query size to generate different output size

0

500

1000

1500

2000

2500

3000

3500

4000

mpi-1

1M

pio-1

1M

mpi-4

7M

pio-4

7M

mpi-9

6M

pio-9

6M

mpi-1

53M

pio-1

53M

Program-Output Size

Exe

cuti

on

Tim

e (s

)

Search Other

Search different #query seqs on 64 procs

20

mpiBLAST Evolves: v1.4

Exact e-value statistics Improved result processing

Reduce worker-master communication by packingpartial biosequences along with ASN.1 results

Alleviate master bottleneck with query pipe-lining Not ready for the large DB search

Output processing still serialized Partial results and result sequences data for a single

query could be huge (gigabytes) Performance

Efficient in handling queries with small output Hang or perform slow for queries with large output

Page 11: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

11

21

mpiBLAST + pioBLAST =mpiBLAST-pio

Highly efficient, open source parallel BLAST (availableat http://mpiblast.lanl.gov/)

Joint effort between mpiBLAST and pioBLASTresearch teams

Current release based on mpiBLAST 1.4 Exact e-value statistics Keep scheduling (query pipelining) and data distribution

Efficient parallel output processing from pioBLAST Worker compute and buffer local output in parallel Non-collective parallel write to better support query pipelining Modifications on NCBI BLAST less than 30 lines Support all but anchor output formats

22

mpiBLAST-pio Meet The GrandChallenge: searching NT vs. NT

SC|05 StorCloud demo (Nov. 13 - Nov. 17) Team

Institutions: LANL, NCSU, U. Utah, and Virginia Tech Vendors: Intel, Panta Systems, and Foundry Networks

Sequencing NT against itself (16GB raw size)

Why? Provide insightful knowledge to catalog NT database Demonstrate scalability of mpiBLAST(pio) to larger problem Meet the computation challenge with power of distributed

parallel computing How?

GreenGene Distributed Supercomputer > 3000 processors from 4 distributed sites of super computers

Page 12: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

12

23

GreenGene Distributed Supercomputer

How?

Intel(Dupont)

SC2005Showroom

Floor

U.Utah

Va Tech

W. Feng et al., “mpiBLAST on the GreenGene Distributed Supercomputer”, SC|05

24

Combine query segmentation anddatabase segmentation

NT Replica NT Replica

GroupMaster GroupMaster GroupMaster

SuperMaster

NT Replica

Page 13: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

13

25

Lessons Learned from NT vs NT Search

Results Finish 526,000 sequences (1/7 NT) in one day

Single supercomputer not enough Database segmentation is necessary to deal with

“Hard Queries” – parallelize the computation Case 1: 122k single query, take 64 procs 7 hours to

finish, 1.8G output size (448hrs on single processor) Case 2: 2M single query, not finished on 128 procs

within 12 hours mpiBLAST-pio demonstrate capability of

conducting large database against databasesequence alignment

26

Acknowledgements The work of pioBLAST was funded in part or in full

by the US Department of Energy’s Genomes to Lifeprogram under the ORNL-PNNL project,”Exploratory Data Intensive Computing for ComplexBiological Systems”.

The work of integrating data access optimizations ofpioBLAST into mpiBLAST-pio was supported throughLos Alamos National Laboratory contract W-7405-ENG-36.

Other mpiBLAST-pio development contributorsJeremy Archuleta (LANL), Avery Ching (Northwestern),Pavan Balaji (OSU)

Page 14: Efficient Data Handling in Large-Scale Sequence Database ...people.cs.vt.edu/feng/pubs/talks/lin-pp06-datahandling_talk.pdf · 3 5 BLAST: At the Core of Sequence DB Search Widely

14

27

References

“Efficient Data Access for Parallel BLAST,”19th Int’l Parallel & Distributed ProcessingSymp., April 2005.

“The Design, Implementation, and Evaluationof mpiBLAST” Best Paper: Applications Track,4th Int’l Conf. on Linux Clusters, Jun. 2003.

“mpiBLAST: Delivering Super-Linear Speedupwith an Open-Source Parallelization ofBLAST,” Pacific Symp. on Biocomputing, Jan.2003.