babar mc production babar mc production software farm @ vu (amsterdam university) a lot of computers...

18
BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question: How can we run BaBar software on EDG grid sites?

Upload: gervais-gray

Post on 12-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

BaBar MC production

BaBar MCproduction software

Farm @ VU(Amsterdam University)

A lot of computers

EDG testbed(NIKHEF)Jobs

Results

The simple question: How can we run BaBar software on EDG grid sites?

Page 2: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

ParrotChirp

Introduction of Parrot

BaBar MCproduction software

Farm @ VU(Amsterdam University)

A lot of computers

EDG testbed(NIKHEF)Jobs

Results

We need transparent access to the Objectivity Database(requires local file access)

Page 3: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Parrot functionalityBaBar

MC production

The Parrot Virtual File System

HTTP FTP RFIO NeST Chirp

LocalCache

HTTPServer

FTPServer

(POSIX Interface)

Whole File I/O(get/put)

Partial File I/O(open,close,read,write, lseek)

RFIOServer

NeSTServer

ChirpServer

CondorProxy

SecureRemote

RPC

CondorShadow

Integrationwith Castor

TraditionalI/O Services

Allocationand Mgmt

Full UNIXSemantics

Integrationwith Condor

(Ptrace trap)Not yet

x509

Optimize

Page 4: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Private networkRelay

GCB

Parrot

Chirp

NF

S

The introduction of GCB

BaBar MCproduction software

Farm @ VU(Amsterdam University)

EDG testbed(NIKHEF)

Condor-G Jobs

Results

Some computers A lot of computers

Jobs

Results

Page 5: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

GCB functionality

GCBServer

CentralManage

r

A

B

P

Private network

Pers

iste

nt con

nect

ion

Relay

NAT

Page 6: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager72 hour jobs

Can’t wait for queues

Private network

NF

SBaBar MC

production softwareQueue

Batchjob

Condor-G Job

GlideIn

EDG testbed(NIKHEF)

RelayPrivate network

Relay

RelayParrot

Chirp

The introduction of GlideInFarm @ VU

(Amsterdam University)

Jobs

Results

Some computers A lot of computers

Jobs

Results

GCB

Page 7: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

GlideIn functionality

Job Submission Machine

Job Execution Site

Job

Condor-G GridManager

GASS Server

Condor-G Scheduler

Persistant Job Queue

End User Requests

Condor Shadow

Process for Job X

Condor-G Collector

Globus Daemons +

Local Site Scheduler

[See Figure 1]

Condor Daemons

Job X

Condor System Call

Trapping & Checkpoint Library

Resource

Information

Transfer Job X

Redirected

System Call Data

Page 8: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Private network

PBS job manager72 hour jobs

Can’t wait for queues

Private network

NF

SBaBar MC

production softwareQueue

Batchjob

Condor-G Job

GlideIn

EDG testbed(NIKHEF)

Relay

Relay

RelayParrot

Chirp

Overview of complete setupFarm @ VU

(Amsterdam University)

Jobs

Results

Some computers A lot of computers

Jobs

Results

GCB

Page 9: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

Parrot

Chirp

Leave only the componentsFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB

Page 10: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

Parrot

Chirp

The interesting dependenciesFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB

NAT box

Different MDSscheme

Objectivity database• LOCK server sockets• NFS problems• UID / hostname checks

• Dropping UDP packages• Timeout 2 minutes

• Inactive sockets• Inactive File I/O

Page 11: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Consequences

• Different MDS scheme– Implemented EDG scheme for GlideIn

• Objectivity– A lot of debugging– Made Parrot mimic hostname and uid– Tricked Objectivity to use standard NFS libraries

• Aggressive NAT box– Changed GCB to use TCP instead of UDP– Used Parrot to keep sockets alive– Parrot recovers File I/O when TCP connection is lost

• We are the first to run Objectivity cross-domain

Page 12: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Performance

500 1000 1500 2000Events

Tim

e (m

inut

es)

500

1000

1500

2000

2500

3000

Application Initializes10 times slower

Production3 times slower

Production onlocal machine

Production onEDG testbed

Page 13: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

Parrot

Chirp

Possible improvementsFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB Parrot: Caching• On per directory basis• Requires debugging

Create more sophisticated tool to acquire resources• Resource planning, distribution, etc.• Maybe something fancy already exists?

Page 14: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

ParrotChirp

Move chirp servers to private nodesFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB

Use Condor/GCB machinery for chirp server• Solves security issues• Allows chirp server to be on private nodes• Requires new chirp-condor implementation

Page 15: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

ParrotChirp

Move GCB to head nodeFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB

Move GCB to same machine as Central Manager• Solution required for port conflicts• Temporary solution: Move CM to a private node

Page 16: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

ParrotChirp

Use EDG data storageFarm @ VU

(Amsterdam University)

Some computers A lot of computers

GCB

EDG data storage

Write events to EDG data storage (gsiFTP)• Requires debugging

Page 17: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

PBS job manager

NF

SBaBar MC

production softwareQueueGlideIn

EDG testbed(NIKHEF)

Private networkPrivate network

ParrotChirp

Use more sites

Farm @ VU(Amsterdam University)

Some computers A lot of computers

GCB

Private network

A lot of computers

Other testbed

EDG data storage

Let GCB manage several private networks at the same time• Requires solution for conflicting private addresses

Page 18: BaBar MC production BaBar MC production software Farm @ VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Conclusions• It works– BaBar MC production runs successfully on NIKHEF EDG testbed– All this experimental software actually works when used together

• It looks easy– Our GRID setup is complicated, but….– Parrot hides problems related to local file access– GCB hides problems related to network configurations– GlideIn hides complications with resource gathering– The user can just submit his/her jobs to a local batch system

• There is some work to do– Performance could be better

• Initialization 10 times slower• Production 3 times slower

– Caching and (semi-) local event storage should improve this

– Usability could be improved• GlideIn should have a tool to acquire them• Several improvements proposed for GCB/Parrot

• The improvements are done at the level of the “grid” tools– The user benefits without rewriting code