roadmap to alien v2-20
DESCRIPTION
Roadmap to AliEn v2-20. A. Abramyan , L. Betev , D. Goyal , A. Grigoras , C. Grigoras , M. Litmaath , N . Manukyan , M. Martinez, J . Porter, P. Saiz, S. Sankar , S. Schreiner. What’s new. Plenty of new improvements Catalogue simplification Client UI Extreme Job Brokering - PowerPoint PPT PresentationTRANSCRIPT
Experiment Support
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBES
A. Abramyan, L. Betev, D. Goyal, A. Grigoras,
C. Grigoras, M. Litmaath, N. Manukyan,
M. Martinez, J. Porter, P. Saiz,
S. Sankar, S. Schreiner
Roadmap to AliEn v2-20
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
29 Mar 2012 Pablo Saiz ALICE offline week
• Plenty of new improvements– Catalogue simplification– Client UI– Extreme Job Brokering– Removal of PackMan – New JDL fields– Proxy renewal– Job Memory checkup
• And baseline for new development
What’s new
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
39 Mar 2012 Pablo Saiz ALICE offline week
Catalogue Simplification
• Up to now, catalogue divided in multiple DB:– Simplifies scalibility– Logic slightly more complicated
• Changing username/userid– Smaller tables
Thanks Dushyant, S
ubho
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
49 Mar 2012 Pablo Saiz ALICE offline week
PackMan
• Removing the PackMan/PackManMaster services
• Functionality stays in client UI/JA– JA can install packages directly– Very powerful if combined with torrent
• Speeds up most of the packman operations
Thanks Narin
e, Arm
enuhi
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
59 Mar 2012 Pablo Saiz ALICE offline week
New JDL fields
• MaxWaitingTime: amount of time that job can stay in ‘WAITING’– If time exceeded, job ends up in error– New state: ERROR_EW (Expired Waiting)
• Retrial:– Number of times that a single job can be
resubmitted– Resubmission done by central services
• Reusing JobId in resubmission• Direct removal of KILLED jobs
Thanks Miguel
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
69 Mar 2012 Pablo Saiz ALICE offline week
Extreme Brokering
• Postpone splitting of job until last moment• Decide data to be analyzed based on
current location of JA & files not analyzed yet
• Can define Max/Min number of files to be analyzed– Even if the files are not local
• Less subjobs:– Easier merging
Thanks Pablo
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
79 Mar 2012 Pablo Saiz ALICE offline week
Current situation
Works nicely if one replica per file
JobManager
JOBJOB
JOBJOB
A bit more complex with 3 SE and 2 replicas
JobManager
JOB
JOB JOB
JOB
JOB
JOB
JOB
And a lot more with
50 SE and 3 replicasJob
ManagerJOB
JOB JOB
JOB
JOB
JOB
JOB
JOB JOB
JOB
JOB
JOB
JOB
JOBJOB JOBJOB
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
89 Mar 2012 Pablo Saiz ALICE offline week
Example
Site A Site B Site C
File 1
File 2
File 3
File 4
File 5
Current schemaSubmit 4 jobs:
File1File 4
File2 File3 File 5
Broker per fileSubmit 3 empty subjobs
File1,2,4,5
When a job starts, analyze as much as possible
File 3
If nothing left, just exit
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
99 Mar 2012 Pablo Saiz ALICE offline week
Proxy renewal system
• Replaces vobox-proxy-renewal service• Can receive ‘validity’ or proxies
– Simplifies CREAM-CE job submission
• No corruption of proxies• Can be started by non-root user• Already deployed at CERN
– And for some CMS sites…
• Can already be deployed
Thanks Maarte
n
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
109 Mar 2012 Pablo Saiz ALICE offline week
New development
• More than 1 year since last mayor update• Some backward incompatible changes
– Change of catalogue schema
• What to do with new requests, bugs:– Debug current system?– Debug in new version?– Both!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
119 Mar 2012 Pablo Saiz ALICE offline week
AliEn deployment for ALICE
catalogue
TaskQueue Transfers
LDAP
Central Services
Api
Api
Api
Api
aliensh
vobox
ROOT
3 machines (+1 slave, backups)
12 machines
8 machines
80 sites
3 machines (+1 slave, backups)
AliEn v2-17
12 machinesAliEn v2-19**, v2-17
8 machinesAliEn v2-19**
80 sitesAliEn v2-19.(80-163)
JA
40.000 wn40.000 wnAliEn v2-19.(80-163)
BACKUP
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
129 Mar 2012 Pablo Saiz ALICE offline week
How to test new versions…
• Build system:– Multiple platforms– Integration & basic functionality tests
• No API/access from ROOT tests
– Similar to the AliROOT, ROOT build systems– Running the whole system on a single machine– http://alienbuild.cern.ch:8888
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
139 Mar 2012 Pablo Saiz ALICE offline week
Already deployed for PANDA
• Running since September– 12th PANDA Grid Workshop and 2nd AliEn
Developers Week
• Multiple sites, smaller load than ALICE– No API services– ‘Old’ v2.20 version
Thanks PANDA
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
149 Mar 2012 Pablo Saiz ALICE offline week
Previous major update
• Stopping the whole system– 1 week to redeploy– 1 month ironing out details
Not an o
ption!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
159 Mar 2012 Pablo Saiz ALICE offline week
Second set of services:
catalogue
TaskQueue Transfers
LDAP
Central Services
Api
Api
Api
Api
aliensh
CE
ROOT
JA
catalogue
TaskQueue Transfers
LDAP
Central Services
Api
Api
Api
Api
aliensh
CE
ROOT
JA
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
169 Mar 2012 Pablo Saiz ALICE offline week
Second set of services
• Copy of the catalogue• 3 different central machines, 3 voboxes,
same SE
• What to do with output– Throw away (easiest)– Incorporate back (easy if output in a different
directory)
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
179 Mar 2012 Pablo Saiz ALICE offline week
Timeline
Now:1 week: Investigate test system1 week: Test Catalogue migration1 week: Define New VO1 week: Verify quotas
1 month:New hardware for CS2 days: Central deployment from backup3 days: First site working (CERN)2 weeks: At least 2 external sites (CCIN2P3, ?)After that works, keep adding sites
2 months:1 day: Switch VO1 day: Overall site upgrade
Mar Apr May
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
189 Mar 2012 Pablo Saiz ALICE offline week
Summary
• AliEn v2.20 ready for deployment– With plenty of new features and bug fixes
• Minimize upgrade downtime– Create testing setup with several sites, and with
all the SE– More effort on testing (also from site admins)
• Deploy Test V0 with ALICE sites• And say goodbye to v2-19 in two months
Thank you!!
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
199 Mar 2012 Pablo Saiz ALICE offline week
xrootd
Job execution
JobManager
JOBTASKQUEUE
Job Broker
CEMonALISA
xrootd
Site A
JOB
MonALISAxrootd
Site BMonALISA
Site C
File catalogue
LFN GUIDMetadata
JOBJOB
CE
CEJA
JA
JA