Slurm Support

From HPCWIKI
Jump to navigation Jump to search

Slurm Support

  • HPCMATE Co., LTD provides thermal optimized HPC, GPGPU hardware and HPC cluster solutions and covers frontline customer support as a authorized system integration partner of SchedMD.
  • SchedMD is the core company behind the Slurm workload manager. The owners and employees of SchedMD have written over 90% of the Slurm code base. SchedMD also reviews and integrates contributions from others, distributes the Slurm code base and maintains the canonical version of Slurm.

What does Support include?

HPCMATE and SchedMD offers Level 3 Slurm support, see the “Overview of Slurm Support". HPCMATE covers all level of Slurm support with direct supported by SchedMD especially for Level 3 Slurm support. When your cluster encounters a Level 3 Slurm workload manager issue or bug, then a support contract allows you to submit the issue to the HPCMATE and SchedMD team for successful resolution. When you submit the request we ask for the following information in order to ensure timely issue resolution.

  1. The steps or script to reproduce to issue or bug.
  2. A copy of your Slurm configuration files.

After we review the issue the engineering team

  1. Generates a fix, this can be anything from a configuration change to a patch.
  2. Validates the fix by conducting regression testing.
  3. Develops additional test cases, where applicable, as a result of discovery of root cause.
  4. Communicates steps of action or resolution along with any code changes and testing results to the client.
  5. Submits the final resolution to the client. and for inclusion in the Slurm tree.

Slurm support also includes configuration assistance for each supported cluster. This assistance is valuable when the cluster is initially being configured to use Slurm or when the cluster needs to be modified as requirements change. For each supported cluster you will be able to review your cluster requirements, operating environment and the organizational goals for the machine with a Slurm engineer and work with the engineer to optimize the configuration to your needs. Slurm support also will allow you to receive detailed answers directly from the Slurm Development team when you encounter complex Slurm questions.

Why should I sign up?

High performance computers typically come with expectations of high utilization from the end users and management in order to ensure a strong return on the investment. When bug or complex technical issue is encountered it can take days or even weeks to resolve the issue in house. However, when your system is covered by a support contract you can reach out to the Slurm engineering experts to get help resolving these complex issues.

What should I expect?

You should expect to make comments like this anytime you get to work with the Slurm engineering team.

  • James Cuff, Assistant Dean for Research Computing at Harvard
  • Colin McMurtrie, Head of Systems, Swiss National Supercomputing Centre
    • "When we embarked upon our mission to port SLURM to our Cray XT and XE systems, we discovered firsthand the high quality software engineering that has gone into the creation of this product. From its very core SLURM has been designed to be extensible and flexible. Moreover, as our work progressed, we discovered the high level of technical expertise possessed by SchedMD who was very quick to respond to our questions with insightful advice, suggestions and clarifications. In the end we arrived at a solution that more than satisfied our needs. The project was so successful we have now migrated all our production science systems to Slurm, including our 20 cabinet Cray XT5 system. The ease with which we have made this transition is testament to the robustness and high quality of the product but also to the nofuss installation and configuration procedure and the high quality documentation. We have no qualms about recommending SLURM to any facility, large or small, who wish to make the break from the various commercial options available today"

Overview of Slurm Support

  • Level 1 Support (HPCMATE)
Activities performed in response to an initial notification or awareness of a suspected problem.
  1. Problem and/or bug validation as a Slurm related issue.
  2. Review of a symptoms/solutions database for known resolutions.
  3. Research to determine if problem is already reported in a Slurm issue tracking database.
  4. Develop a complete and welldescribed report of the problem.
  • Level 2 Support (HPCMATE)
Activities performed following the completion of Level 1 support if resolution is not achieved.
  1. Make best efforts to reproduce and diagnose the problem.
  2. Make best efforts to resolve or reduce severity of the problems.
  • Level 3 Support (HPCMATE & SchedMD)
Activities following the completion of Level 1 and 2 support without successful resolution.
  1. Supply successful problem resolution such as a bug fix or configuration change, where problem is reproducible.
  2. Validate any fixes made by conducting regression testing.
  3. Develop additional test cases, where applicable, as a result of discovery of root cause.
  4. Communicate steps of action or resolution along with any code changes and testing results to client.
  5. Submit final resolution to the client and for inclusion in the Slurm tree.

Severity Level and Response Commitments

When a problem or question is submitted the client will specify a severity level based upon the following criteria.

Severity 1 - Major Impact

A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and nonfunctional due to Slurm problem(s) and no procedural workaround exists.

  • Initial Response 2 Hours (During standard work hours)
  • Status Updates Daily
  • Work Schedule Continuous

Severity 2 – High Impact

A Severity 2 issue is a highimpact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

  • Initial Response 1 Business Day
  • Status Updates Weekly
  • Work Schedule Workday

Severity 3 Medium Impact

A Severity 3 issue is a mediumtolow impact problem that includes partial noncritical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds.

  • Initial Response 3 Business Day
  • Status Updates Monthly
  • Work Schedule Workday

Severity 4 – Minor Issues

A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications.

  • Initial Response (during normal work hours) As available
  • Status Updates As available
  • Work Schedule As available