grid parallelization and tests

1

Grid parallelization and tests

CERN

GRACE Final ReviewAmsterdam, 15-16 February 2005

2GRACE Review February 2005 - Amsterdam

1. Two GRACE Grid integration models: M1, M2

2. Pre-conditions for the tests

3. Work performed

4. General test results

5. Model 1 test results

6. Simulation of Model 2

7. Model 2 tests results

8. Comparison

9. Conclusions

Contents


Application workflow

Single search Grid workflow M1M2

Approach used: M1 - M2


• Adopted Content and Categorization Engines release 4.45. These components have been later on improved and optimized by the partners

• A convenient testing corpus of documents has been selected (English documents, correct pdf to txt conversion, small and large sizes)

• Configuration problems of GILDA replica manager have been solved (intervention of site administrators)

• Search result set size is considered in average between 0.1 and 4 MBs of text

• The Usage of DAG for the job model in GILDA has been discarded

Pre-conditions


• Preparation of a test plan and report template

• Creation of the testing corpus of documents

• Verification of testing pre-conditions

• Creation of the test scripts for semi-automatic testing

• Testing on Gilda testbed

• Creation of scripts for validation of output and parsing of logging

• Collection and analysis of the results

Work performed


M1 M2 General Total

Total number of jobs submitted 58 727 395 1180

• general (RM, RB, functional, etc.) tests started in October 2004 • main testing period November 2004 • submitted more than 1000 jobs

Testing: job submission

Model 1General tests

Model 2


V1 Input Data size

V2 Worker Node Specifications

V3 Number of parallel jobs

V1 ID 0 1 2 3 4 5 6

Size 0,1 MB 0,5 MB 1,0 MB 1,5 MB 2,0 MB 3,0 MB 4,0 MB

V3 ID 0 1 2 3 4 5 6 7 8 9 10 11 12 13

JobsN 1 2 3 4 5 6 7 8 9 10 11 12 14 16

Variable Parameters

V2 ID 0 / “Spec1” 1 / “Spec2” 2 / “Spec3”

Specifications PIV 2.4 GHz, 512 MB RAM PIII 800 MHz, 1GB RAM PIII 1000 MHz, 2 GB RAM

Comment The fastest machine in the GILDA testbed

The slowest machine in the GILDA testbed

The most common machine in the GILDA testbed


G1 Total Execution Time Execution time (P4) as a function of input data size (V1) on worker nodes with different specifications (V2).

G2 Detailed Execution Time Execution time (P4) as a function of input data size (V1) split by text normalization and categorization. V2 is fixed.

G3 Output Size Output size (P9) as a function of input data size (V1).

G4 UI Waiting Time UI waiting time (P7) as a function of the number of sub-jobs (V3) with fixed input data size (V1).

G5 Spent Computing Time Spent computing time (P8) as a function of the number of sub-jobs (V3) with fixed input size (V1).

G6 Optimal number of Jobs Optimal number of jobs (FN1, FN2) as a function of the input size (V1).

G7 Optimal UI Waiting Time UI waiting time (P7) as a function of the input size (V1) when applying the optimal splitting (FN1, FN2).

G8 Spent Computing Time Optimal Spent Computing Time (P8) as a function of the input size (V1) when applying the optimal splitting (FN1, FN2).

Graphs


M1 - G1

0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

40,0

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Exe

cuti

on

Tim

e in

Ho

urs

Spec1 Spec2 Spec3

M1 - G2

0

5

10

15

20

25

30

35

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Exe

cuti

on

Tim

e in

Ho

urs

Categorization

Normalization

M1 - G3

0

500

1000

1500

2000

2500

3000

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Siz

e in

KB

Categories OutputSandbox (compressed) Index Files

Jobs Per Day (triggered, probably executed later)

0

50

100

150

200

250

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Day (November)

Nu

mb

er o

f Jo

bs

M1 Jobs M2 Jobs General Jobs M2 - G2 - V1=Spec3 (TT3B)

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,50

5,00

0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Tim

e in

Ho

urs

CategorizationEngine

ContentEngine

M2 - G6 - V2=Spec3 (TT3B)

0

2

4

6

8

10

12

14

16

18

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Op

tim

al n

um

be

r o

f jo

bs

M2 - G7

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

UI W

ait

ing

Tim

e in

Ho

urs

V2=Spec3

V2=Spec2

V2=*

M2 - Spent computing time with optimal number of jobs

0,00

10,00

20,00

30,00

40,00

50,00

60,00

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Sp

en

t C

om

pu

tin

g T

ime

in H

ou

rs

V2=Spec3

V2=Spec2

V2=*

Comparing P8: M1 - M2, V2=Spec3

0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

40,0

45,0

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Sp

en

t C

om

pu

tin

g T

ime

in H

ou

rs

M1

M2

ResultsResults collected and published on a study and test report


General tests


F1 Job submission Submission of a job to the Grid

F2 Job Status check Status checking while the job is running

F3 Results Retrieval Retrieving the output sandbox after successful execution

F4 Results Validation Validate that the results are complete. Output files exist and ane not empty: indexes, NDF, categories

F5 Error Testing Testing if error conditions return the proper error messages:

Input data not available, GRACE application not available,

ContentEngine failure, CategorizationEngine failure

The functional tests were successful. Problems related to the Grid nodes configuration were experienced and fixed:

• RB Configuration Problems• RM/SE Configuration Problems

Functional tests


M1 [sec] M2 [sec]

P1 Job submission time 66,9 ± 23,9 34,9 ± 7,3

P2 Job brokering time 28,5 ± 5,1 26,6 ± 3,9

P3 Job queuing time 72,5 ± 19,0 68,6 ± 21,8

P4 Job execution time 0,69 + 7,88 * I See graphs

P5 Job retrieving time 18,1 ± 5,9 17,6 ± 1,4

Average Grid overhead 3.1 min 2.5 min

Depends on input data size

On empty queues

Depends on output data size

Depends on GRACE performance

I = Input Size in MB

Variable

Performance tests (I)


Grid overhead

Grid overhead is 3 minutes in average

submission

brokering

queuing

retrieving


M1 M2

P6 Job failure rate 19,0 % 15,3 %

11 failed out of 58 jobs 22 failed out of 144 jobs

We identified as main cause of failure the misbehavior of the resource broker (RB) which needed re-initialization (performed by the GILDA team).

After re-initialization 23 jobs were executed, all successfully

Aborted at the broker.

Not considered: failures due to one CE which broke

Job success rate > 80%

Performance tests (II)

The Grid performed well, job success rate > 80%


Model 1


0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

40,0

0 0,5 1 1,5 2 2,5 3 3,5 4

InputSize in MB

Exe

cuti

on

Tim

e in

Ho

urs

Spec1 Spec2 Spec3

Tests performed on machines with different specifications

The normalization job is the most demanding

M1 performaces: execution time/input size

CategorizationNormalization


Model 2


• Search results are split outside the Grid

• Grid parallel jobs execute Text normalization

• Jobs are monitored for status

• Results are stored on the Grid (Replica Manager)

• Grid Categorization job executes:– normalized documents merging from SEs– categorization processing

• Job is monitored and results retrieved

M2 description


Kopt Ideal optimal splitting number

infinite-worker-nodes Grid

any splitting is possible

Function which minimizes the UI waiting time with resource-saving parameter α

Keff Real optimal splitting number

available worker nodes

input data file size

splitting sequence

Kopt considering constraints

M2 Simulation

Increase due to job submission

overhead

KoptKopt1

α {

Kopt2

α

Computing timeWaiting time at UI


M2 performances

execution time/n. of parallel jobs

execution time/input size

Grid overheadUI waiting time

CategorizationNormalization

Input size=2MB

Splitting parameter = 9


Comparison M1 and M2

execution time/input size

Model 1

Model 2

computing time/input size

Model 1

Model 2


• Grid performed well: low failure rate, prompt reply of Grid administrators to problems, good coordination with Gilda team

• Parallelization proved to improve application performances and lower the query failure rate

Conclusions

grid parallelization and tests

Documents

function of input data

fixed input size v1

fixed input data size

time ui

time p7

g3output size output

g5spent computing time

number of subjobs v3