This document outlines the overall installation process for an OSG site and provides many links into detailed installation, configuration, troubleshooting, and similar pages. If you do not see software-related technical documentation listed here, try the search bar at the top or contacting us at [email protected].
Plan the Site¶
If you have not done so already, plan the overall architecture of your OSG site. It is recommended that your plan be sufficiently detailed to include the OSG hosts that are needed and the main software components for each host. Be sure to consider the operating systems that OSG supports. For example, a basic site might include:
|Compute Element (CE)||
||OSG CE, HTCondor Central Manager, etc. (
||OSG worker node client (
Prepare the Batch System¶
For smaller sites (less than 50 worker nodes), the most common way to add a site to OSG is to install the OSG Compute Element (CE) on the central host of your batch system. At such a site - especially if you have minimal time to maintain a CE - you may want to contact firstname.lastname@example.org to ask about using an OSG-hosted CE instead of running your own. Before proceeding with an install, be sure that you can submit and successfully run a job from your OSG CE host into your batch system.
Add OSG Software¶
If necessary, provision all OSG hosts that are in your site plan that do not exist yet. The general steps to installing an OSG site are:
- Install OSG Yum Repos and the Compute Element software on your CE host
- Install the Worker Node client on your worker nodes.
- Install optional software to increase the capabilities of your site.
For sites with more than a handful of worker nodes, it is recommended to use some sort of configuration management tool to install, configure, and maintain your site. While beyond the scope of OSG’s documentation to explain how to select and use such a system, some popular configuration management tools are Puppet, Chef, Ansible, and CFEngine.
General Installation Instructions¶
- Security information for OSG signed RPMs
- Using Yum and RPM
- Install the OSG Yum repositories
- OSG Software release series - look here to upgrade to OSG 3.5
Installing and Managing Certificates for Site Security¶
- Installing the grid certificate authorities (CAs)
- How do I get X.509 host certificates?
- Automatically updating the grid certificate authorities (CAs)
- OSG PKI command line client reference
Installing and Configuring the Compute Element¶
- Install the compute element (HTCondor-CE and other software):
Adding OSG Software to Worker Nodes¶
- Worker Node (WN) Client Overview
- Install the WN client software on every worker node – pick a method:
- (optional) Install the CernVM-FS client to make it easy for user jobs to use needed software from OSG's OASIS repositories
- (optional) Install singularity on the OSG worker node, to allow pilot jobs to isolate user jobs.
Installing and Configuring Other Services¶
All of these node types and their services are optional, although OSG requires an HTTP caching service if you have installed CVMFS on your worker nodes.
- Install Frontier Squid, an HTTP caching proxy service.
- Storage element:
- Existing POSIX-based systems (such as NFS, Lustre, or GPFS):
- Hadoop Distributed File System (HDFS):
- Hadoop Overview: HDFS information, planning, and guides
- RSV monitoring to monitor and report to OSG on the health of your site
- Install the GlideinWMS VO Frontend if your want your users' jobs to run on the OSG
- Install the RSV GlideinWMS Tester if you want to test your front-end's ability to submit jobs to sites in the OSG
Verify OSG Software¶
If you haven't already, register any publicly facing resources with OSG software installed, including HTCondor-CE, Frontier Squid, GridFTP, and/or XRootD.
It is useful to test manual submission of jobs from inside and outside of your site through your CE to your batch system. If this process does not work manually, it will probably not work for the GlideinWMS pilot factory either.
Get test jobs¶
To begin running pilots at your site, e-mail email@example.com and ask for test pilots. Please provide them with the following information:
- The fully qualified domain name of the CE
- Resource name
- Supported OS version of your worker nodes (e.g., EL6, EL7, or both)
- Support for multicore jobs
- Maximum job walltime
- Maximum job memory usage
Once the factory team has enough information, they will start submitting pilots from the test factory to your CE. Initially, this will be one pilot at a time but once the factory verifies that pilot jobs are running successfully, that number will be ramped up to 10, then 100.
Verify reporting and monitoring¶
To verify that your site is correctly reporting to the OSG, check OSG's Accounting Portal for records of your site reports (select your site from the drop-down box). If you have enabled the OSG VO, you can also check http://flock.opensciencegrid.org/monitoring/condor/sites/all_1day.html.
Scale Up to Full Production¶
After successfully running all the pilot jobs that are submitted by the test factory and verifying your site reports, your site will be deemed production ready. No action is required on your end, factory operations will start submitting pilot jobs from the production factory.
Maintain the Site¶
To avoid potential issues with OSG job submissions, please notify us of major changes to your site, including:
- Major OS version changes on the worker nodes (e.g., upgraded from EL 6 to EL 7)
- Adding or removing container support
- Policy changes regarding maximum walltime or memory usage
- Scheduled or unscheduled downtimes
- Site topology changes such as additions, modifications, or retirements
- Changes to site contacts, such as administrative or security staff
It is also important to keep your software and data (e.g., CA and VO client) up-to-date with the latest OSG release. To stay abreast of software releases, we recommend subscribing to the firstname.lastname@example.org mailing list.
If you need help with your site, or need to report a security incident, follow the contact instructions.