d0 grid data production initiative: coordination mtg

D0 Grid Data Production

1

D0 Grid Data Production D0 Grid Data Production Initiative:Initiative:

Coordination MtgCoordination Mtg

Version 1.0 (meeting edition)08 January 2009

Rob Kennedy and Adam LyonAttending: …


2

OverviewOverview• SummarySummary

– System ran smoothly over holidays. SUCCESS!

– Resource Utilization metrics are all high, 95%+. SUCCESS!

– Events/day increased, but short of goal level. Higher luminosity data?

– Still some tasks to follow up on: doc, pkg, new SAM-Grid state feature

• NewsNews

– Exec Mtg w/D0 Spokespeople and Vicky on Dec 12. Much positive feedback.

• Requested a Phase 1 close-out executive meeting in late Jan/early Feb after more operational experience, investigation into events/day = f(luminosity), etc.

– No Coordination Meeting next week (1/15/2009), resumes weekly on 1/22/2009.

• AgendaAgenda

– Phase 1 Open Tasks, Close-out

– Review of Metrics and (value) Goals

– Understanding: nEvents/day = f(cpu, L, …)


3

Phase 1 Close-OutPhase 1 Close-Out

Status of Open Tasks

Current Configuration

Close-out


4

Phase 1 Follow-up NotesPhase 1 Follow-up Notes• Assign to: January 2009, Phase 2 (Feb-Apr), or Future Wish ListAssign to: January 2009, Phase 2 (Feb-Apr), or Future Wish List

• Deployment 1: Split Data/MC Production Services – Completed, with follow-up on:Deployment 1: Split Data/MC Production Services – Completed, with follow-up on:

– 1. Queuing nodes auto-update code in place (v0.3), not enabled to avoid possible format confusion.

• Defer downtime needed to cleanly enable auto-update. Hand edit gridmap files until then.

– 2. AL’s Queuing Node monitoring still being productized, not running on QUE2 yet.

– 3. Switch assignments: QUE1 = MC, and QUE2 = DATA PROD

• Keep Data Production information on QUE1 for expected time.

• Remote MC users no longer need to change their usage with this switch, simpler to implement.

– 4. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess… No hurry.

– 5. Integrating experience into installation procedures (see Deployment 1 Review notes)

• Deployment 2: Optimize Data and MC Production Configurations – DEFERRED to January 2009Deployment 2: Optimize Data and MC Production Configurations – DEFERRED to January 2009

– 1. Config: Optimize Configurations separately for Data and MC Production

• Increase Data Production “queue” length to reduce number of “touches” per day, avoid empty queue conditions

– 2. New SAM-Grid Release with support for new Job status value at Queuing node

• Defect in new Condor (Qedit, old version OK) prevents feature from working.

• Kluge alternative or wait for Condor fix? Get schedule estimate for fixed release before deciding.

– 3. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5

– 4. Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade)


5

Deployment ConfigurationDeployment Configuration (Green = now, (Green = now, Blue = in progress,Blue = in progress, Yellow = future)Yellow = future)

• RecoReco– FWD1: GridMgrMaxSubmitJobs/Resource = 1250 (was 750, default 100)– FWD5: 1250

• MC, MC MergeMC, MC Merge– FWD2: 1250 (was 750, default 100)– FWD4: 1250

• Reco MergeReco Merge– FWD3: 750/300 grid each

• QUE1:QUE1: Reco, Reco Merge – keep here to maintain history• QUE2:QUE2: MC, MC Merge - not used by MC Prod at first, now is.

– Future: Switch these to simplify transition... Remote MC and test users then make no change since default in jim client = QUE1.

• SAM Station: SAM Station: All job types• Jim Client: Jim Client: submit to QUE1 or QUE2 depending on qualifier, QUE1 default


6

Phase 1 Close-Out DiscussionPhase 1 Close-Out Discussion• Anything else left to do from Phase 1?Anything else left to do from Phase 1?

– 1. …

• Comments before “phase 1 close-out”Comments before “phase 1 close-out”– 1. …

• Onward!Onward!


7

Metrics and GoalsMetrics and Goals

CPU Utilization (2)

Job Slot Utilization

Unmerged Events/Day Produced


8Before/After Plots on slides

Metrics RelationshipsMetrics Relationships

Job Slot Utilization

CPU Utilization

Events Produced/Day(for given N job slots)

Effort to CoordinateProduction/Day

(for given level of production)

Top-most Customer (D0) View:

Grid/Batch Level:

Compute Level:

Job ProcessingStability

Timely InputData DeliveryInfrastructure Level:


9

Metric: CPU Utilization (1)Metric: CPU Utilization (1)• Metric: Wall/CPU time ratioMetric: Wall/CPU time ratio

– Wall clock / CPU time used by “d0farm” as reported by CAB accounting.

• Before Dec. 11: Unsteady and falling offBefore Dec. 11: Unsteady and falling off– Fell from ~95% to less than 80%

– 87% in Oct seems inconsistent with low job slot utilization at that time. Interpretation issue?

• After Dec. 11: Steady and highAfter Dec. 11: Steady and high– Since deployment, fcpd stabilized: >95%!

– SUCCESS. Goal = consistent > 95%

• Note: Note: CPU/Job IncreaseCPU/Job Increase Recently Recently– Implies increased CPU/event? May help explain

the underwhelming nEvents/day increase.

– Or side effect of high job success rate?

• Source Link: CAB Accounting: CAB Accounting

Avg CPU Time per Job [hours]

19.36

16.71

13.43

11.9911.4711.52 11.14 10.77 11.09 11.64

0.00

5.00

10.00

15.00

20.00

25.00

May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 EarlyDec-08

Mid/LateDec-08

Jan-09

CPU/Wall Time Ratio

94.9%

86.0%

94.2% 92.1% 91.6%

86.5% 83.9% 75.0%

95.5% 96.3%

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

CPU/Job

Climbing!

Deployment, fcpd fixed

Deployment, fcpd fixed

http://fermigrid.fnal.gov/monitor/d0osg2x1-pbs-monitor-summary.html


10

Metric: CPU Utilization (2)Metric: CPU Utilization (2)• Before: Occupancy & Load DipsBefore: Occupancy & Load Dips

– Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not

– Top: dips only in load (black trace) are due to a file transfer daemon failure mode (fixed)

– Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement.

• After: Occupancy & Load SteadyAfter: Occupancy & Load Steady

– Bottom: Steady Load (little 4am bump is OK)

– SUCCESS. Consistently ~100% Utilization

• Source Link

Mid-

Dec ‘08

fcpd fails. Jobs wait for datadeployment

Mid-

Dec ‘08A few minor empty queue instances

http://fermigrid.fnal.gov/monitor/d0osg2x1-pbs-monitor-summary.html


11

Metric: Job Slot UtilizationMetric: Job Slot Utilization• Before: Job Slot InefficiencyBefore: Job Slot Inefficiency

– Bad: dips in green trace

– 10% unused job slots in Sep ‘08 ... Part of resource utilization problem.

– Smaller effect: job queue going empty (blue trace hits bottom)

• After: Steady and efficient!After: Steady and efficient!

– Bottom: negligible dips in green trace.

– Issue: Few instances of empty queue. Treatable partly via config tweaks.

– SUCCESS. Consistently ~100% w/treatable issue.

• Source Link

– See plot near page bottom.

deployment

http://d0om.fnal.gov/d0admin/cab/qstat.php?tscale=week&server=d0cabsrv2


12

Unmerged Events/day Produced

0

2000000

4000000

6000000

8000000

10000000

12000000

4/2

9/2

008

5/1

3/2

008

5/2

7/2

008

6/1

0/2

008

6/2

4/2

008

7/8

/2008

7/2

2/2

008

8/5

/2008

8/1

9/2

008

9/2

/2008

9/1

6/2

008

9/3

0/2

008

10/1

4/2

008

10/2

8/2

008

11/1

1/2

008

11/2

5/2

008

12/9

/2008

12/2

3/2

008

1/6

/2009

Unmerged Events/day Produced

0

1000000

2000000

3000000

4000000

50000006000000

7000000

8000000

9000000

10000000

12/2

/2008

12/4

/2008

12/6

/2008

12/8

/2008

12/1

0/2

008

12/1

2/2

008

12/1

4/2

008

12/1

6/2

008

12/1

8/2

008

12/2

0/2

008

12/2

2/2

008

12/2

4/2

008

12/2

6/2

008

12/2

8/2

008

12/3

0/2

008

1/1

/2009

1/3

/2009

1/5

/2009

Metric: Unmerged Events/DayMetric: Unmerged Events/Day• Before: Wild swings, “low” averageBefore: Wild swings, “low” average

– Top-level Metric. Dependent on all...

– Production output wildly varying

– Includes >10% “known” utilization inefficiency from job slot utilization

– May ‘08: 5.8 MEvts/day

– Sep-Nov ‘08: 5.2 MEvts/day

• After: Not as high as expectedAfter: Not as high as expected

– Dec 2-Jan 6: 5.7 MEvts/day

• Low days still, just no ~0 days.

– Eventual goal: 7.5 – 8 MEvts/day with existing system (node count, etc.) and after addressing more subtleties.

– Ops stable. Resources well used. But, Production output not much greater.

• Why not much more? And…Why not much more? And…

– Why large day-to-day variations?

– Is CPU/event increasing over “time”?

• Luminosity effect?

– Can Production keep up w/Raw?

• Source Link

Sep-Nov ‘08Average5.2 Mevts/day

Recent Average5.7 Mevts/day

Sep-Nov ‘08Average5.2 Mevts/day

http://www-d0.fnal.gov/computing/reprocessing/primary/current_stats/sum_total.dat


13

Understanding nEvents/dayUnderstanding nEvents/day

Identify Dependencies

Measure Dependencies

Automate Measurement: Monitoring


14

NEvents/day = f(cpu, L, …)NEvents/day = f(cpu, L, …)• Best case nEvents/day = (CPU-sec in system) / (CPU-sec/event) * (sec/day)Best case nEvents/day = (CPU-sec in system) / (CPU-sec/event) * (sec/day)

– CPU-sec in system = (max CPU-sec in system)

• Standard unit of processing at D0 = ?

• Estimate processing power available to D0Farm on CAB. GHz or bogo-mips measurable by Ganglia?

– CPU-sec/event = (average CPU-sec/event on benchmark machine and data)

• Benchmark machine and data? Benchmark values for each code release?

• A Little Reality Never Hurt: Overall CPU Utilization EfficiencyA Little Reality Never Hurt: Overall CPU Utilization Efficiency

– CPU-sec in system = (max CPU-sec in system) * (CPU-sec/Wall-sec)

• CPU-sec/Wall-sec > 95% and steady now. Track this, but no longer a large effect.

• (But Full) Reality Bites: CPU/event is not a static value(But Full) Reality Bites: CPU/event is not a static value

– Dependent on: Luminosity, Data stream (min-bias vs. high-pt), Reconstruction code, …

• Data stream differences average out as all are processed per run, but a source of variation?

– Dependence on Luminosity a concern as Tevatron breaks record after record

• Do we have a measure of Luminosity which we can correlate to cpu/event?

– Can we adapt CPU-hrs/Job from CAB to help fill potential measurement gap here?

• Rough average events/job? Job defined as an integrated luminosity increment?


15

NEvents/day DiscussionNEvents/day Discussion• Identify DependenciesIdentify Dependencies

– Resource Utilization is largely addressed

– CPU/event in general, and CPU/event = f(L)

• Measure DependenciesMeasure Dependencies

– What single-point measurements are available

– What metrics are available to help identify cause

– Other experiences, wisdom to apply?

• Automate Measurement: MonitoringAutomate Measurement: Monitoring

– What can be done to “build this into the system” so even if a component is not a problem now, it can be watched just in case.

• Next StepsNext Steps

– Meetings/tasks before Jan 22 and next D0 Spokes+Vicky meeting?


16

Background SlidesBackground Slides

Original 16+4 Issues List


17

Issues List (p.1/4)Issues List (p.1/4)(Red = not treated in Phase 1, (Red = not treated in Phase 1, Green = treated or non-issue, Green = treated or non-issue, Yellow = notes)Yellow = notes)

• 1) 1) Unreliable state information returned by SAM-GridUnreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does : SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid.Grid.

– SAM-Grid Job Status development (see discussion on earlier slides). Delayed by Condor defect.

• 2) 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs” : The globus job : The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting.terminated. This eventually blocks new jobs from starting.

– AL: Improved script to identify them and treat symptoms. Not happened recently. But why happening at all?– Not specific to SAM-Grid Grid Production

• 3) 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problemScriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is : This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is which we do not use, but which cannot be fully disabled either. Developers say this is fixed in fixed in Condor 7,Condor 7, but this has not been proven yet. but this has not been proven yet.

– Condor 7 Upgrade – RESOLVED!

• 4) 4) CORBA communication problems with SAM stationCORBA communication problems with SAM station: The actual source of all CORBA problems : The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at timesdown at times, the SAM station needs to be moved to a separate node., the SAM station needs to be moved to a separate node.

– Move SAM station off of FWD1 – DONE! Context Server move as well – DONE!

• 5) 5) Intelligent job matching to forwarding nodesIntelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the : SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available.node has job slots available.

– Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.


18

Issues List (p.2/4) Issues List (p.2/4) (Red = not treated in Phase 1, (Red = not treated in Phase 1, Green = treated or non-issue, Green = treated or non-issue, Yellow = notes)Yellow = notes)

• 6) 6) Capacity of durable location serversCapacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the : Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen.Recocert as part of the merge this problem will worsen.

– Nothing in Phase 1. Later Phase may include decoupling of durable location servers?– No automatic handling of hardware failure. System keeps trying even if storage server down.

• 7) 7) CurMatch limit on forwarding nodesCurMatch limit on forwarding nodes: We need to increase this limit which probably implies : We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable.different forwarding nodes so response is more predictable.

– Decouple FWD nodes between Data and MC Production and tune separately for each – DONE.– Can now tune to optimize for Data Production.

• 8) 8) Job slot limit on forwarding nodesJob slot limit on forwarding nodes: The current limit of 750 job slots handled by each : The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production.the processing slots. Could be addressed by optimizing fwd node config for data production.


• 9) 9) Input queues on CABInput queues on CAB: We have to be able to fill the input queues on CAB to their capacity of : We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The ~1000 jobs. The configuration coupling between MC and data production that currently limits configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production.production.



19


• 10) 10) 32,001 Directory problem32,001 Directory problem: : Acceptable band-aid is in placeAcceptable band-aid is in place,, but we should follow up with but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system.the need to retain job state for tens of thousands of jobs in a large production system.

– Already a cron job to move information into sub-directories to avoid this.

• 11) 11) Spiral of Death problemSpiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all : See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system.situation is to do a complete kill/cold-stop and restart of the system.

– Condor 7 Upgrade? May be different causes in other episodes... Only one was understood.– Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only)

•12) 12) Various Globus errorsVarious Globus errors: We have repeated episodes where a significant number of jobs lose : We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost.cases all log information is lost.

– Later Phase to include more detailed debugging with more modern software in use.– At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC).

• For example: GAHP server... Part of Condor

•13) 13) Automatic restart of services on rebootAutomatic restart of services on reboot: Every node in the system (samgrid, samgfwd, : Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot.to not get any notification when some of these nodes reboot.

– Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!


20


• 14) 14) SRM needs to be cleanly isolated from the rest of the operationSRM needs to be cleanly isolated from the rest of the operation: This might come about as a : This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation).if the problematic interaction has nothing to do with the data processing operation).

– Nothing in Phase 1. Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production.

• 15) 15) Lack of TransparencyLack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate : No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging.important for debugging.

– Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase.

• 16) 16) Periods of Slow Fwd node to CAB Job transitionsPeriods of Slow Fwd node to CAB Job transitions: related to Spiral of Death issue?: related to Spiral of Death issue?– Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere.– Cures all observed cases? Not yet sure.

• MC-specific MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helpsIssue #1) File Delivery bottlenecks: use of SRM at site helps– Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or

something more specific.

• MC-specific 2) Redundant SAM caches needed in the fieldMC-specific 2) Redundant SAM caches needed in the field– Out of scope for Phase 1

• MC-specific 3) ReSS improvements needed, avoid problem sites,….MC-specific 3) ReSS improvements needed, avoid problem sites,….– Out of scope for Phase 1. PM sent doc, met with Joel.

• MC-specific 4) Get LCG forwarding nodes up and running reliablyMC-specific 4) Get LCG forwarding nodes up and running reliably– Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though.

d0 grid data production initiative: coordination mtg

Documents

mc production configurations

mc mergefwd2

remote mc users

que1 defaultphase

transfer of que1

new samgrid release

higher luminosity data

new job status value