alien v2-20
DESCRIPTION
AliEn v2-20. A. Abramyan , L. Betev , D. Goyal , A. Grigoras , C. Grigoras , M. Litmaath , N . Manukyan , M. Martinez, J . Porter, P. Saiz, S. Sankar , S. Schreiner. Content. New features on v2.20 TaskQueue Catalogue Service communication Deployment Summary. - PowerPoint PPT PresentationTRANSCRIPT
Experiment Support
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBES
A. Abramyan, L. Betev, D. Goyal, A. Grigoras,
C. Grigoras, M. Litmaath, N. Manukyan,
M. Martinez, J. Porter, P. Saiz,
S. Sankar, S. Schreiner
AliEn v2-20
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
25 Oct 2012 Pablo Saiz ALICE offline week
Content
• New features on v2.20– TaskQueue– Catalogue– Service communication
• Deployment• Summary
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
35 Oct 2012 Pablo Saiz ALICE offline week
Database Layout
• Single DB• Innodb tables
– Row locking– Foreign keys– Transactions
• not used…
• Lookup tables• 2 JDLs per job• JDL fields mapped to
columns • Link to full graph
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
45 Oct 2012 Pablo Saiz ALICE offline week
Brokering
• Avoid Classad matching– Less fields to parse
• Match in a single SQL statement.
• Two attempts at matching:– With packages already installed– With any packages– (Add a third attempt with remote data??)
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
55 Oct 2012 Pablo Saiz ALICE offline week
File brokering
Site A Site B Site C
File 1
File 2
File 3
File 4
File 5
Current schemaSubmit 4 jobs:
File1File 4
File2 File3 File 5
Broker per fileSubmit 3 empty subjobs
File1,2,4,5
When a job starts, analyze as much as possible
File 3
If nothing left, just exit
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
65 Oct 2012 Pablo Saiz ALICE offline week
More TaskQueue
• MaxWaitingTime: amount of time that job can stay in ‘WAITING’– If time exceeded, job ends up in error– New state: ERROR_EW (Expired Waiting)
• Retrial:– Number of times that a single job can be
resubmitted– Resubmission done by central services
• Reusing JobId in resubmission• Direct removal of KILLED jobs
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
75 Oct 2012 Pablo Saiz ALICE offline week
Some results…
• DB time to insert a job, and 8 change status:
Time to process all 230M ALICE jobs:
4.8 days
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
85 Oct 2012 Pablo Saiz ALICE offline week
Service communication
• Replacing SOAP with JSON– Less overhead (no XML encoding)– Easier to interact with other clients– And even from a web browser
• Backward incompatible change
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
95 Oct 2012 Pablo Saiz ALICE offline week
SOAP vs JSON
• Apache web server
• 32 hosts for clients – 16 cores– 8000 calls
per client
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
105 Oct 2012 Pablo Saiz ALICE offline week
Catalogue
• Innodb tables– Row locking– Transactions– Foreign keys
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
115 Oct 2012 Pablo Saiz ALICE offline week
Deployment
• All the features already deployed on ALICE_TEST
• Instead of one single big-bang release, divide it in three:– TaskQueue– JSON– Catalogue
• Reduces amount of downtime, – Increases complexity of deployment…
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
125 Oct 2012 Pablo Saiz ALICE offline week
Central Services
catalogue
TaskQueue Transfers
LDAP
Central Services
Api
Api
Api
Api
aliensh
vobox
ROOT
3 machines (+1 slave, backups)
12 machines
8 machines
80 sites
3 machines (+1 slave, backups)
AliEn v2-17
12 machinesAliEn v2-19**, v2-17
8 machinesAliEn v2-19**
80 sitesAliEn v2-19.(80-163)
JA
40.000 wn40.000 wnAliEn v2-19.(80-163)
BACKUP
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
135 Oct 2012 Pablo Saiz ALICE offline week
Deployment of TaskQueue
• Only needed on the central services• Database migration of 1 hour (24 GB)• Already done!
– Monday, 1st Oct• Downtime of 12 hours
• Method:– Install new version– Stop services– Convert DB– Start services
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
145 Oct 2012 Pablo Saiz ALICE offline week
Deployment of JSON
• Full deployment– Once Central Services updated, old installation
won’t be able to connect
• No database migration• Plan:
– Install new version everywhere– Stop all services– Restart everything with new version
• When:– ?
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
155 Oct 2012 Pablo Saiz ALICE offline week
Deployment of catalogue
• Only needed on central services• Very delicate operation• Database migration of 24 hours
– 430 GB, 290 big tables
• Plan:– Prepare a hybrid version– Install v2-20 and hybrid– Restart services with hybrid– Convert DB– Restart services with v2-20
• When:?
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
165 Oct 2012 Pablo Saiz ALICE offline week
Summary
• Parts of AliEn v2.20 already deployed!• TaskQueue speed improved drastically
– 40 times insertion rate– 20 times resubmission time– Improved concurrency
• Need to schedule 2 more upgrades– JSON: Improve service communication– New catalogue layout