ORNL is managed by UT-Battelle
for the US Department of Energy
Robinhood
Operational Preparation for Large-Scale Deployment
2 Presentation_name
Overview of OLCF / NCCS
• National Center for Computational Sciences
– Focus on at-scale HPC challenges
– Support for projects like SNS, NCRC
• Oak Ridge Leadership Computing Facility
– Largest project of NCCS
– Home of Titan/Atlas. Future home of Summit/Alpine
3 Presentation_name
Overview of Robinhood
• Policy Engine for POSIX file systems
• Extra hooks for Lustre
• Allows for near-real time file system information
Making
Robinhood fit
OLCF Production
Standards
5 Presentation_name
Reproducible Builds
• Using GitLab Runners
– Binary that can execute builds as part of a CI pipeline
– Settings -> General -> Enable pipelines
– Settings -> Pipelines
6 Presentation_name
Building Lustre
• Current setup:
– GitLab runner runs as bot build user on storage-util1 node
– Build script checks out copy of Lustre repo
– Uses current build system to create Lustre RPMs
– Stores them in staging area for manual signing/approval
– Only for Robinhood testing currently
• Future setup:
– “Bring your own build host”
– Using runner ”tags”
7 Presentation_name
Building Robinhood
• Similar setup to Lustre
– Lustre RPMs are installed manually
– Kick off pipeline build
– Robinhood RPMs are built against installed Lustre client
– RPMs are placed in staging area for testing/signing/deployment/installation
8 Presentation_name
Puppet Setup
• NCCS uses Puppet’s role and profile design workflow
• https://docs.puppet.com/pe/2017.2/r_n_p_full_example.html
• No current module on Puppet Forge
• WIP robinhood module
9 Presentation_name
Puppet Robinhood Module
Basic 1-to-1 setup between Robinhoodconfig options and Puppet parameters
Testing
Environment
11 Presentation_name
Testing Setup
• Tested against older hardware
• Used AtlasTDS file system
• Partition of NetApp E5500 with 48x 900GB 10k SAS drives, over 6G SAS
12 Presentation_name
Testing Hardware
• Storage-util1 Node:
– Dell PowerEdge R620
– 2x Intel® Xeon® CPU E5-2640 @ 2.50GHz
– 16x 16GB DIMM DDR3 1333 MHz
– Hyperthreading Disabled
– Diskless provisioning
13 Presentation_name
MariaDB tuning
• Mostly same settings as recommended by Robinhood’sstarting page
• innodb_additional_mem_pool_size setting is not used in 10.3
• For stock RHEL installs, the log_slow_queries and associated tunings (long_query_time and log-queries-not-using-indexes) can show if the database is a bottleneck
14 Presentation_name
Robinhood Tuning
• Set nb_threads to twice the number of physical cores
• Changed max_pending_operations from 10000 to 200000
• Set nb_threads_scan to twice the number of physical cores. This may be too many
• Changed queue_max_size to 10000 (from 1000) and queue_max_age from 5s to 10s
• Trade-off between consistency/recovery-time and speed
15 Presentation_name
Disk Utilization
16 Presentation_name
Bottlenecks?
• File system backend limited metadata performance
• Under certain metadata intensive workloads:
– Not really an easy solution
– Mentioned in https://jira.hpdd.intel.com/browse/LU-8047
17 Presentation_name
Issues with RHEL7
• Stock mariadb
• Systemd ulimit settings
18 Presentation_name
Testing Summary
• Current testing hardware can only process so quickly – we appear to have hit this limit
• Moved the bottleneck towards Lustre
• GET_FID is typically highest latency command
• Bursts of metadata traffic cause spikes of “Wait”-state commands; in our testing, shifts between GET_INFO_DB, DB_APPLY and CHGLOG_CLR
19 Presentation_name
Daemon vs. One-shot
• Split use-case
• Daemon:
– File system scanning
– Changelog consumption
– RBH_OPT="--readlog --scan"
• One-shot (“manual” process / cronjob):
– Policy application (e.g., purging)
Comparison to
Existing Tools
21 Presentation_name
PCircle
• Suite of file system tools for parallel data copying, checksumming, and profiling
• Currently used for ~weekly file system profiling
• Includes directory count, sym/hard linkcounts, file count, average file size, maxfiles within a directory, among other statistics
• Reports file size histograms, and top files (by size)
• https://github.com/olcf/pcircle
22 Presentation_name
fprof
• Able to reproduce fprof-like reporting by setting up fileclassbuckets
• Built-in reports like ‘top x’ files/directories provide similar functionality
23 Presentation_name
Output: rbh-report --class-info
24 Presentation_name
LustreDU
25 Presentation_name
LustreDU
• Provides directory-level usage for users/projects
• Populated by:
– Parsing Lester output
– Contacting inode query daemons running on OSS nodes
– Populating/updating MySQL database
• Only updated daily
• Issues running as privileged user
26 Presentation_name
rbh-du output
• Provides a quick du option
• Potentially provide a smart wrapper for users that use du vs rbh-du based on file path
27 Presentation_name
Purging Policies
• Non-Robinhood workflow:
– User submits request
– RUC approval
– UAO team member enters exemption into RATS
– Purge config is generated using those exemptions
• Robinhood pieces still WIP
• Example:
28 Presentation_name
Purging – Integration with Robinhood
• Want to keep same workflow for users and other groups
• Current thoughts:
– Pull list of purge exemptions
– Generate purge configuration file using multiple “tree” statements in a cleanup rule
– Run Robinhood with --once with that policy
– Log and remove configuration
Future Work
30 Presentation_name
Hardware upgrades
• Transition to using similar setup to current MDS nodes
• Single socket, faster clock speed
• SSD / NVMe storage target
31 Presentation_name
Clustering
• Move processes to multiple nodes
– Multiple physical nodes vs namespaced mounts / VMs
• Set up a Mariadb/MySQL cluster
– Millions of SQL statements per second
– https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/
32 Presentation_name
CEA’s Lustre Changelogs Aggregate &
Publish (lcap) integration
• Ability for multiple change-log readers
• Redirect a copy of the changelog to our Kafka instances while still using a single reader
33 Presentation_name
Lustre Jobstats
34 Presentation_name
Jobstats Integration
35 Presentation_name
Jobstats Integration - continued
• Database schema changes
– Add new columns to database: creation_job, last_access_job, last_mod_job, and last_mdchange_job
– Parse job_id (semi-support exists currently) to populate these fields
36 Presentation_name
Jobstats Integration – potential wins
• File system usage heuristics
• Security triggers / auditing
• File-level history
37 Presentation_name
References
• https://github.com/cea-hpc/robinhood/wiki/Documenation
• https://dev.mysql.com/doc
• https://mariadb.com
• https://github.com/fwang2/ioutils
• https://cug.org/proceedings/cug2014_proceedings/includes/files/pap157.pdf
• https://gitlab.com/gitlab-org/gitlab-ci-multi-runner
• http://wiki.lustre.org/images/0/02/LUG-2011-Aurelien_Degremont-Robinhood_Quick_Tour.pdf
• https://github.com/cea-hpc/lcap
• http://syst.univ-brest.fr/per3s/wp-content/uploads/2017/02/robinhood-Per3S.pdf
• https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml