27 th weekly operation report on dirac distributed computing yan tian from 2015-07-01 to 2015-07-08

11
27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015- 07-08

Upload: timothy-cole

Post on 19-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

27th Weekly Operation Report on

DIRAC Distributed ComputingYAN Tian

From 2015-07-01 to 2015-07-08

Page 2: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Weekly Running Jobs by User

Notes:1. CEPC production user weiyq keeps

running jobs. 2. BES user zhus run by-run sra jobs.

item value

active users 2

max running jobs 758

average running jobs 333

total executed jobs 23.6 k

Page 3: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Final Status of Running JobsFailed Reason percent

upload/download failed 36.9%

stalled 1.05%

application error 1.29%

other 1.41%

StoRM SE backend down. Fixed

Page 4: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Output Data Generated and Transfered

Total: 3.33 TB~0.476TB/day

quality: good except IHEP-STORM downtime; WHU-USER act as failover SE

Page 5: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Running job by Site• 7 sites in production: :

– OpenStack, OpenNebula, – WHU, USTC, UMN– CLOUD.TORINO, GRID.JINR

Page 6: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Job Final Status at Each Site (inputSandbox error and pending request ignored)

OpenStack, 2314 jobs, 97.2% done,

OpenNebula, 3259 jobs, 91.2% done ,7.1% stalled since SE problem

WHU, 5356 jobs, 98.5% done

UMN, 1968 jobs, 94.8% done

Page 7: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Job Final Status at Each Site

USTC, 1242 jobs, 99.6% done,

CLOUD.TORINO, 128 jobs, 89.1% doneerror duo to randomtrg download

JINR, 120 jobs, 94.4% done

Page 8: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Failed Types at Site: Description• All sites are good this week.• StoRM SE’s backend server crashed. Reason unkown. Fixed by restart.• Cloud.Torino has some randomtrg download error.• Some jobs in OpenNebula stalled because inputsandbox download error. The reason is SE

down.

Page 9: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Cumulative User Jobs• Total user jobs: 23.6 k • weiyq 84.1%

• zhus 15.9%

Page 10: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

本周运维日志• 7.3 凌晨 4-5 点, StoRM SE backend server 故障, 10:39 重启恢复。故障原因不明。从

5:10 开始 BE&FE log 突然中断, FE 在 8:37 出现 acceptRequest : Error in soap_socket. Error 24 。故障后,正在运行的 325 个作业将结果上传到 WHU SE ,传输正常,但由于 Request Manager System 不能正常工作,数据无法传回,作业处于 Pending Request 状态,这批作业最后被 reschedule.

• 7.6 UMN 文件系统恢复。赵祥虎测试 721 个作业,成功率 99.86%• 7.6 dCache 挂载到 lxslc601—616• 7.6 StoRM 加入 ganglia 监控,在 BES-SRM 一组。

Page 11: 27 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From 2015-07-01 to 2015-07-08

Andrei’s reply on CS 1024 problemFirst, 1024 limit is certainly low, so you are right to have increased it.Second, if even this limit is not enough, then the problem is most likely elsewhere. We had in the past cases where a faulty hardware of one of the clients caused dropping connections to the DIRAC services and then making more and more new connectionsresulting in too many unclosed sockets. You can try to analyse the logs to spot the IPs which are dropping connections to the CS service and either ban this IP or try to understand the reason of failures there.

Cheers, Andrei