Installing and Maintaining HTCondor-CE¶
The HTCondor-CE software is a job gateway for an OSG Compute Element (CE). As such, HTCondor-CE is the entry point for jobs coming from the OSG — it handles authorization and delegation of jobs to your local batch system. In OSG today, most CEs accept pilot jobs from a factory, which in turn are able to accept and run end-user jobs. See the HTCondor-CE Overview for a much more detailed introduction.
Use this page to learn how to install, configure, run, test, and troubleshoot HTCondor-CE from the OSG software repositories.
If you are installing an HTCondor-CE for use outside of the OSG, consult this documentation
Before starting the installation process, consider the following points (consulting the Reference section below as needed):
- User IDs: If they do not exist already, the installation will create the Linux users
condor(UID 4716) and
- SSL certificate: The HTCondor-CE service uses a host certificate at
/etc/grid-security/hostcert.pemand an accompanying key at
- DNS entries: Forward and reverse DNS must resolve for the HTCondor-CE host
- Network ports: The pilot factories must be able to contact your HTCondor-CE service on port 9619 (TCP)
- Submit host: HTCondor-CE should be installed on a host that already has the ability to submit jobs into your local cluster
- File Systems: Non-HTCondor batch systems require a shared file system between the HTCondor-CE host and the batch system worker nodes.
As with all OSG software installations, there are some one-time (per host) steps to prepare in advance:
- Ensure the host has a supported operating system
- Obtain root access to the host
- Prepare the required Yum repositories
- Install CA certificates
An HTCondor-CE installation consists of the job gateway (i.e., the HTCondor-CE job router) and other support software (e.g., GridFTP, a Gratia probe, authentication software). To simplify installation, OSG provides convenience RPMs that install all required software.
Clean yum cache:
[email protected] # yum clean all --enablerepo=*
[email protected] # yum update
This command will update all packages
(Optional) If your batch system is already installed via non-RPM means and is in the following list, install the appropriate 'empty' RPM. Otherwise, skip to the next step.
If your batch system is… Then run the following command… HTCondor
yum install empty-condor --enablerepo=osg-empty
yum install empty-slurm --enablerepo=osg-empty
(Optional) If your HTCondor batch system is already installed via non-OSG RPM means, add the line below to
/etc/yum.repos.d/osg.repo. Otherwise, skip to the next step.
Select the appropriate convenience RPM:
If your batch system is... Then use the following package... HTCondor
Install the CE software:
[email protected] # yum install <PACKAGE>
<PACKAGE>is the package you selected in the above step.
There are a few required configuration steps to connect HTCondor-CE with your batch system and authentication method. For more advanced configuration, see the section on optional configurations.
Configuring the batch system¶
HTCondor-CE must be installed on a host that is configured to submit jobs to your batch system. The details of this configuration is likely site-specific and therefore beyond the scope of this document.
Enable your batch system in the HTCondor-CE configuration by editing the
enabled field in the
/etc/osg/config.d/20-<YOUR BATCH SYSTEM>.ini:
enabled = True
If you are using HTCondor as your local batch system (i.e., in addition to your HTCondor-CE), skip to the configuring authorization section. For other batch systems (e.g., PBS, LSF, SGE, SLURM), keep reading.
Batch systems other than HTCondor¶
Non-HTCondor batch systems require a shared file system configuration to support file transfer from the HTCondor-CE to
your site's worker nodes.
The current recommendation is to run a dedicated NFS server (whose installation is beyond the scope of this document) on
the CE host.
In this setup, HTCondor-CE writes to the local spool directory, the NFS server shares the directory, and each worker
node mounts the directory in the same location as on the CE.
For example, if your spool directory is
/var/lib/condor-ce (the default), you must mount the shared directory to
/var/lib/condor-ce on the worker nodes.
If you choose not to host the NFS server on your CE, you will need to turn off root squash so that the HTCondor-CE daemons can write to the spool directory.
You can control the value of the spool directory by setting
this file if it doesn't exist).
For example, the following sets the
SPOOL directory to
SPOOL = /home/condor
The shared spool directory must be readable and writeable by the
condor user for HTCondor-CE to function correctly.
To configure which virtual organizations and users are authorized to submit jobs to your, follow the instructions in the LCMAPS VOMS plugin document.
If your local batch system is HTCondor, it will attempt to utilize the LCMAPS callouts if enabled in the
If this is not the desired behavior, set
GSI_AUTHZ_CONF=/dev/null in the local HTCondor configuration.
Configuring CE collector advertising¶
To split jobs between the various sites of the OSG, information about each site's capabilities are uploaded to a central collector. The job factories then query the central collector for idle resources and submit pilot jobs to the available sites. To advertise your site, you will need to enter some information about the worker nodes of your clusters.
Please see the Subcluster / Resource Entry configuration document about configuring the data that will be uploaded to the central collector.
Applying configuration settings¶
Making changes to the OSG configuration files in the
/etc/osg/config.d directory does not apply those settings to
Settings that are made outside of the OSG directory take effect immediately or at least when the relevant service is
For the OSG settings, use the osg-configure tool to validate (to a limited
extent) and apply the settings to the relevant software components.
osg-configure software is included automatically in an HTCondor-CE installation.
Make all changes to
.inifiles in the
This document describes the critical settings for HTCondor-CE and related software. You may need to configure other software that is installed on your HTCondor-CE host, too.
Validate the configuration settings
[email protected] # osg-configure -v
Fix any errors (at least) that
- Once the validation command succeeds without errors, apply the configuration settings:
[email protected] # osg-configure -c
The following configuration steps are optional and will likely not be required for setting up a small site. If you do not need any of the following special configurations, skip to the section on using HTCondor-CE.
- Transforming and filtering jobs
- Configuring for multiple network interfaces
- Limiting or disabling locally running jobs on the CE
- Accounting with multiple CEs or local user jobs
- HTCondor accounting groups
- HTCondor-CE monitoring web interface
- Enable job retries
Transforming and filtering jobs¶
If you need to modify or filter jobs, more information can be found in the Job Router Recipes document.
If you need to assign jobs to HTCondor accounting groups, refer to this section.
Configuring for multiple network interfaces¶
If you have multiple network interfaces with different hostnames, the HTCondor-CE daemons need to know which hostname
and interface to use when communicating to each other.
NETWORK_INTERFACE to the hostname and IP address of your public interface, respectively, in
/etc/condor-ce/config.d/99-local.conf directory with the line:
NETWORK_HOSTNAME = condorce.example.com NETWORK_INTERFACE = 127.0.0.1
condorce.example.com text with your public interface’s hostname and
127.0.0.1 with your public interface’s
Limiting or disabling locally running jobs on the CE¶
If you want to limit or disable jobs running locally on your CE, you will need to configure HTCondor-CE's local and scheduler universes. Local and scheduler universes allow jobs to be run on the CE itself, mainly for remote troubleshooting. Pilot jobs will not run as local/scheduler universe jobs so leaving them enabled does NOT turn your CE into another worker node.
The two universes are effectively the same (scheduler universe launches a starter process for each job), so we will be configuring them in unison.
To change the default limit on the number of locally run jobs (the current default is 20), add the following to
START_LOCAL_UNIVERSE = TotalLocalJobsRunning + TotalSchedulerJobsRunning < <JOB-LIMIT> START_SCHEDULER_UNIVERSE = $(START_LOCAL_UNIVERSE)
<JOB-LIMIT>is the maximum number of jobs allowed to run locally
To only allow a specific user to start locally run jobs, add the following to
START_LOCAL_UNIVERSE = target.Owner =?= "<USERNAME>" START_SCHEDULER_UNIVERSE = $(START_LOCAL_UNIVERSE)
<USERNAME> for the username allowed to run jobs locally
- To disable locally run jobs, add the following to
START_LOCAL_UNIVERSE = False START_SCHEDULER_UNIVERSE = $(START_LOCAL_UNIVERSE)
RSV requires the ability to start local universe jobs so if you are using RSV, you need to allow local universe jobs
Accounting with multiple CEs or local user jobs¶
For non-HTCondor batch systems only
If your site has multiple CEs or you have non-grid users submitting to the same local batch system, the OSG accounting software needs to be configured so that it doesn't over report the number of jobs. Use the following table to determine which file requires editing:
|If your batch system is…||Then edit the following file on each of your CE(s)…|
Then edit the value of
SuppressNoDNRecords on each of your CE's so that it reads:
HTCondor accounting groups¶
For HTCondor batch systems only
If you want to provide fairshare on a group basis, as opposed to a Unix user basis, you can use HTCondor accounting groups. They are independent of the Unix groups the user may already be in and are documented in the HTCondor manual. If you are using HTCondor accounting groups, you can map jobs from the CE into HTCondor accounting groups based on their UID, their DN, or their VOMS attributes.
To map UIDs to an accounting group, add entries to
/etc/osg/uid_table.txtwith the following form:
The following is an example
uscms02 TestGroup osg other.osgedu
To map DNs or VOMS attributes to an accounting group, add lines to
/etc/osg/extattr_table.txtwith the following form:
SubjectOrAttributecan be a Perl regular expression. The following is an example
cmsprio cms.other.prio cms\/Role=production cms.prod \/DC=com\/DC=DigiCert-Grid\/O=Open\ Science\ Grid\/OU=People\/CN=Brian\ Lin\ 1047 osg.test .* other
/etc/osg/uid_table.txt are honored over
/etc/osg/extattr_table.txt if a job would match to lines in
Install and run the HTCondor-CE View¶
The HTCondor-CE View is an optional web interface to the status of your CE. To run the View,
Begin by installing the package htcondor-ce-view:
[email protected] # yum install htcondor-ce-view
Next, uncomment the
DAEMON_LISTconfiguration located at
DAEMON_LIST = $(DAEMON_LIST), CEVIEW, GANGLIAD, SCHEDD
Restart the CE service:
[email protected] # service condor-ce restart
Verify the service by entering your CE's hostname into your web browser
The website is served on port 80 by default. To change this default, edit the value of
Enable job retries¶
In HTCondor-CE 4+, batch system job retries are disabled by default. This is because most jobs submitted through HTCondor-CEs are actually resource requests (i.e. pilot jobs) instead of jobs containing user payloads. Therefore, it's preferred to prevent these jobs from retrying and instead wait for additional resource requests to be submitted. To re-enable job retries, set the following in your configuration:
ENABLE_JOB_RETRIES = True
As a site administrator, there are a few ways to use the HTCondor-CE:
- Managing the HTCondor-CE and associated services
- Using HTCondor-CE administrative tools to monitor and maintain the job gateway
- Using HTCondor-CE user tools to test gateway operations
Managing HTCondor-CE and associated services¶
In addition to the HTCondor-CE job gateway service itself, there are a number of supporting services in your installation. The specific services are:
||See CA documentation for more info|
|Your batch system||
Start the services in the order listed and stop them in reverse order. As a reminder, here are common service commands (all run as
|To...||On EL6, run the command...||On EL7, run the command...|
|Start a service||
|Stop a service||
|Enable a service to start on boot||
|Disable a service from starting on boot||
Using HTCondor-CE tools¶
Some of the HTCondor-CE administrative and user tools are documented in the HTCondor-CE troubleshooting guide.
To validate an HTCondor-CE, perform the following verification steps:
Verify that local job submissions complete successfully from the CE host. For example, if you have a Slurm cluster, run
sbatchfrom the CE and verify that it runs and completes with
Verify that all the necessary daemons are running with condor_ce_status -any.
Verify the CE's network configuration using condor_ce_host_network_check.
Verify that jobs can complete successfully using condor_ce_trace.
For information on how to troubleshoot your HTCondor-CE, please refer to the HTCondor-CE troubleshooting guide.
Registering the CE¶
To be part of the OSG Production Grid, your CE must be registered with the OSG. To register your resource:
Identify the facility, site, and resource group where your HTCondor-CE is hosted. For example, the Center for High Throughput Computing at the University of Wisconsin-Madison uses the following information:
Facility: University of Wisconsin Site: CHTC Resource Group: CHTC
To get assistance, please use the this page.
Here are some other HTCondor-CE documents that might be helpful:
- HTCondor-CE overview and architecture
- Configuring HTCondor-CE job routes
- The HTCondor-CE troubleshooting guide
- Submitting jobs to HTCondor-CE
The following directories contain the configuration for HTCondor-CE. The directories are parsed in the order presented and thus configuration within the final directory will override configuration specified in the previous directories.
||Configuration defaults (overwritten on package updates)|
||Files in this directory are parsed in alphanumeric order (i.e.,
For a detailed order of the way configuration files are parsed, run the following command:
[email protected] $ condor_ce_config_val -config
The following users are needed by HTCondor-CE at all sites:
||The HTCondor-CE will be run as root, but perform most of its operations as the
||Runs the Gratia probes to collect accounting data|
|File||User that owns certificate||Path to certificate|
Find instructions to request a host certificate here.
|Service Name||Protocol||Port Number||Inbound||Outbound||Comment|
|Htcondor-CE||tcp||9619||X||HTCondor-CE shared port|
Allow inbound and outbound network connection to all internal site servers, such as the batch system head-node only ephemeral outgoing ports are necessary.