27 th weekly operation report on dirac distributed computing yan tian from 2015-07-01 to 2015-07-08
Post on 19-Jan-2016
214 Views
Preview:
TRANSCRIPT
27th Weekly Operation Report on
DIRAC Distributed ComputingYAN Tian
From 2015-07-01 to 2015-07-08
Weekly Running Jobs by User
Notes:1. CEPC production user weiyq keeps
running jobs. 2. BES user zhus run by-run sra jobs.
item value
active users 2
max running jobs 758
average running jobs 333
total executed jobs 23.6 k
Final Status of Running JobsFailed Reason percent
upload/download failed 36.9%
stalled 1.05%
application error 1.29%
other 1.41%
StoRM SE backend down. Fixed
Output Data Generated and Transfered
Total: 3.33 TB~0.476TB/day
quality: good except IHEP-STORM downtime; WHU-USER act as failover SE
Running job by Site• 7 sites in production: :
– OpenStack, OpenNebula, – WHU, USTC, UMN– CLOUD.TORINO, GRID.JINR
Job Final Status at Each Site (inputSandbox error and pending request ignored)
OpenStack, 2314 jobs, 97.2% done,
OpenNebula, 3259 jobs, 91.2% done ,7.1% stalled since SE problem
WHU, 5356 jobs, 98.5% done
UMN, 1968 jobs, 94.8% done
Job Final Status at Each Site
USTC, 1242 jobs, 99.6% done,
CLOUD.TORINO, 128 jobs, 89.1% doneerror duo to randomtrg download
JINR, 120 jobs, 94.4% done
Failed Types at Site: Description• All sites are good this week.• StoRM SE’s backend server crashed. Reason unkown. Fixed by restart.• Cloud.Torino has some randomtrg download error.• Some jobs in OpenNebula stalled because inputsandbox download error. The reason is SE
down.
Cumulative User Jobs• Total user jobs: 23.6 k • weiyq 84.1%
• zhus 15.9%
本周运维日志• 7.3 凌晨 4-5 点, StoRM SE backend server 故障, 10:39 重启恢复。故障原因不明。从
5:10 开始 BE&FE log 突然中断, FE 在 8:37 出现 acceptRequest : Error in soap_socket. Error 24 。故障后,正在运行的 325 个作业将结果上传到 WHU SE ,传输正常,但由于 Request Manager System 不能正常工作,数据无法传回,作业处于 Pending Request 状态,这批作业最后被 reschedule.
• 7.6 UMN 文件系统恢复。赵祥虎测试 721 个作业,成功率 99.86%• 7.6 dCache 挂载到 lxslc601—616• 7.6 StoRM 加入 ganglia 监控,在 BES-SRM 一组。
Andrei’s reply on CS 1024 problemFirst, 1024 limit is certainly low, so you are right to have increased it.Second, if even this limit is not enough, then the problem is most likely elsewhere. We had in the past cases where a faulty hardware of one of the clients caused dropping connections to the DIRAC services and then making more and more new connectionsresulting in too many unclosed sockets. You can try to analyse the logs to spot the IPs which are dropping connections to the CS service and either ban this IP or try to understand the reason of failures there.
Cheers, Andrei
top related