Skip to content

Site Maintenance

This document outlines how to maintain your OSG site, including steps to take if you suspect that OSG jobs are causing issues.

Handle Misbehaving Jobs

In rare instances, you may experience issues at your site caused by misbehaving jobs (e.g., over-utilization of memory) from an OSG community or Virtual Organization (VO). If this occurs, you should immediately stop accepting job submissions from the OSG and remove the offending jobs:

  1. Configure your batch system to stop accepting jobs from the VO:

    • For HTCondor batch systems, set the following in /etc/condor/config.d/ on your HTCondor-CE or Access Point accepting jobs from an OSG Hosted CE:

      SUBMIT_REQUIREMENT_Ban_OSG = (Owner != "<OFFENDING VO USER>")
      SUBMIT_REQUIREMENT_Ban_OSG_REASON = "OSG pilot job submission temporarily disabled"
      SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) Ban_OSG
      

      Replacing <OFFENDING VO USER> with the name of the local Unix account corresponding to the problematic VO.

    • For Slurm batch systems, disable the relevant Slurm partition:

      [root@host] # scontrol update PartitionName=<OSG PARTITION> State=DOWN
      

      Replacing <OSG PARTITION> with the name of the partition where you are sending OSG jobs.

  2. Remove the VO's jobs:

    • For HTCondor batch systems, run the following command on your HTCondor-CE or Access Point accepting jobs from an OSG Hosted CE:

      [root@access-point] # condor_rm <OFFENDING VO USER>
      

      Replacing <OFFENDING VO USER> with the name of the local Unix account corresponding to the problematic VO.

    • For Slurm batch systems, run the following command:

      [root@host] # scancel -u <OFFENDING VO USER>
      

      Replacing <OFFENDING VO USER> with the name of the local Unix account corresponding to the problematic VO.

  3. Let us know so that we can track down the offending software or user: the same issue that you're experiencing may also be affecting other sites!

Keep OSG Software Updated

It is important to keep your software and data (e.g., CAs and VO client) up-to-date with the latest OSG release. See the release notes for your installed release series:

To stay abreast of software releases, we recommend subscribing to the osg-sites@opensciencegrid.org mailing list.

Notify OSG of Major Changes

To avoid potential issues with OSG job submissions, please notify us of major changes to your site, including:

  • Major OS version changes on the worker nodes (e.g., upgraded from EL 7 to EL 8)
  • Adding or removing container support through singularity or apptainer
  • Policy changes regarding OSG resource requests (e.g., number of cores or GPUs, memory usage, or maximum walltime)
  • Scheduled or unscheduled downtimes
  • Site topology changes such as additions, modifications, or retirements of OSG services
  • Changes to site contacts, such as administrative or security staff

Help

If you need help with your site, or need to report a security incident, follow the contact instructions.

Back to top