Troubleshooting Gratia Accounting¶
This document will help you troubleshoot problems with the Gratia Accounting, particularly with problems in collecting and reporting accounting information to the central OSG accounting service.
Gratia/GRACC: The Big Picture¶
Gratia is software used in OSG to gather accounting information. The information is collected from individual resources at a site, such as a Compute Element or a a submission host. The program that collects the data is called a "Gratia probe". The information is transferred to a GRACC server. Most sites will choose to send the accounting data to the central OSG Gratia server, but you can also use a Gratia server at your site (which can send forward the data to the central OSG Gratia server). Here is a diagram:
Difference between Gratia and GRACC
Gratia is the legacy name of the OSG Accounting system. GRACC is the new name of the server and hosted components of the accounting system. When we refer to Gratia, we mean either the data or the probes on the resources. If we mention GRACC, we are referring to the hosted components that the OSG maintains.
These are the definitions of the major elements in the above figure.
- Gratia probe: A piece of software that collects accounting data from the computer on which it's running, and transmits it to a Gratia server.
- GRACC server: A server that collects Gratia accounting data from one or more sites and can share it with users via a web page. The GRACC server is hosted by the OSG.
- Reporter: A web service running on the GRACC server. Users can connect to the reporter via a web browser to explore the Gratia data.
- Collector: A web service running on the GRACC server that collects data from one or more Gratia probes. Users do not directly interact with the collector.
You can see the OSG's GRACC website at https://gracc.opensciencegrid.org.
You can see a fancier version of the Gratia data at https://display.opensciencegrid.org/. This is not running a Gratia collector, but is a separate service.
Gratia Probes are periodically run as cron jobs, but different probes will run at different intervals. The cron jobs will always run and you should not remove them. You can find them in
However, the cron jobs will only do anything if you have enabled them. You enable them via an init script. For example, to enable them:
[email protected] # service gratia-probes-cron start Enabling gratia probes cron: [ OK ]
To disable them:
[email protected] # service gratia-probes-cron stop Disabling gratia probes cron: [ OK ]
You also need to enable individual probes, usually via
osg-configure. Documentation on using
osg-configure with Gratia documented elsewhere.
Running Gratia Probes¶
When the cron jobs are enabled and run, they go through the following process, with minor changes between different Gratia probes:
- The probe is invoked. It reads its configuration from
- It collects the accounting information from the underlying system. For example, the Condor probe will read it from the
PER_JOB_HISTORY_DIR, which is usually
- It transforms the data into Gratia records and saves them into
- When there are sufficient Gratia records, or when sufficient time has passed, it uploads sets of records in batches to the GRACC server, then removes them from the
- All progress is logged to
- If there are failures in uploading the files to the GRACC server
- Files are not removed from
gratiafilesuntil they are successfully uploaded.
- Errors are logged to log files in
- The uploads will be tried again later.
- Files are not removed from
Gratia Probe Configuration¶
In normal cases,
osg-configure does the editing of the probe configuration files, at least on the CE. The configuration is found in
/etc/osg/config.d/30-gratia.ini and documented elsewhere.
If there are problems or special configuration, you might need to edit the Gratia configuration files yourself. Each probe has a separate configuration file found in
The ProbeConfig files have many details. A few options that you might need to edit are shown before. This is not a complete file, but only shows a subset of the options.
<ProbeConfiguration CollectorHost="gratia-osg-itb.opensciencegrid.org:80" SSLHost="gratia-osg-itb.opensciencegrid.org:80" SSLRegistrationHost="gratia-osg-itb.opensciencegrid.org:80" ProbeName="condor:fermicloud084.fnal.gov" SiteName="WISC_OSG_EDU" EnableProbe="1" />
The options you see here are:
|CollectorHost||The GRACC server this probe reports to|
|SSLHost||The GRACC server this probe reports to|
|SSLRegistrationHost||The GRACC server this probe reports to|
|ProbeName||The unique name for this probe. Note that it includes the probe type and the host name|
|SiteName||The name of your site, as registered in OIM. If your site must be registered in OIM|
|EnableProbe||The probe will only run if this is "1"|
Again, there are many more options in this file. Most of the time you won't need to touch them.
Are the Gratia cron jobs running?¶
You should make sure the Gratia cron jobs are running. The simplest way is with the
[email protected] # /sbin/service gratia-probes-cron status gratia probes cron is enabled.
If it is not enabled, enable it as described above.
A future release of Gratia will provide status on each of the individual probes, but right now this only ensures that the basic cron job is running. In the meantime, you can check if the individual Gratia probes are enabled. To do this, look at the
EnableProbe option in the
ProbeConfig file, as described above. A quick command to do this is shown here. Note that the Condor and GridFTP Transfer probes are enabled while the glexec probe is disabled:
[email protected] # cd /etc/gratia [email protected] # grep -r EnableProbe * condor/ProbeConfig: EnableProbe="1" glexec/ProbeConfig: EnableProbe="0" gridftp-transfer/ProbeConfig: EnableProbe="1"
If you see no log files in
/var/log/gratia you may have an error in the probe configuration file. Manually run the test for your probe (check
/usr/share/gratia/common/cron_check /etc/gratia/condor/ProbeConfig. If there is an error you may get a suggestion on where it is, e.g.:
[email protected] # /usr/share/gratia/common/cron_check /etc/gratia/condor/ProbeConfig Parse error in /etc/gratia/condor/ProbeConfig: not well-formed (invalid token): line 21, column 4
Correct the error and restart gratia.
Have you configured the resource names correctly?¶
Do the names of your resources match the names in OIM?
Gratia retrieves the resource name from the
Site Information section of the
;=================================================================== ; Site Information ;=================================================================== [Site Information] ; The group option indicates the group that the OSG site should be listed in, ; for production sites this should be OSG, for vtb or itb testing it should be ; OSG-ITB ; ; YOU WILL NEED TO CHANGE THIS group = OSG ; The host_name setting should give the host name of the CE that is being ; configured, this setting must be a valid dns name that resolves ; ; YOU WILL NEED TO CHANGE THIS host_name = tusker-gw1.unl.edu ; The resource setting should be set to the same value as used in the OIM ; registration at the goc ; ; YOU WILL NEED TO CHANGE THIS resource = Tusker-CE1 ; The resource_group setting should be set to the same value as used in the OIM ; registration at the goc ; ; YOU WILL NEED TO CHANGE THIS resource_group = Tusker
Do those names match the names that you registered with OIM? If not, edit the names, and rerun "osg-configure -c".
Did the site name change?¶
Was the site previously reporting data, but the site name (not host name, but site name) changed? When the site name changes, you need to ask the GRACC operations team to update the name of your site at the GRACC collector. To do this:
- Open a support ticket
- Select "Software or Service"
- Select "GRACC Operations"
- Type a friendly email that asks the GRACC team to change your site name at the collector. Make sure to tell them the old name and the new name. Below is an example email:
Hello GRACC Team, Please change the site name of my site from <Insert Old Name> to <Insert New Name>. Thanks, ...
Is a site reporting data?¶
You can see if the OSG GRACC Server is getting data from a site by going to GRACC:
- Specify the site name in Facility
HTCondor's Gratia Configuration¶
Only applicable to HTCondor batch sites, not SLURM, PBS, SGE or LSF sites
Condor must be configured to put information about each job into a special directory. Gratia will read and remove the files in order to collect the accounting information.
The configuration variable is called
PER_JOB_HISTORY_DIR. If you install the OSG RPM for Condor, the Gratia probe will extend its configuration by adding a file to
/etc/condor/config.d, and will set this variable to
/var/lib/gratia/data. If you are using a different installation method, you may need to set the variable yourself. You can check if it's set by using
condor_config_val, like this:
[email protected] $ condor_config_val -v PER_JOB_HISTORY_DIR PER_JOB_HISTORY_DIR: /var/lib/gratia/data Defined in '/etc/condor/config.d/99_gratia.conf', line 5.
If you set this value, you need to restart condor:
[email protected] # condor_restart Sent "Restart" command to local master
Unlike many Condor settings, a condor_reconfig is not sufficient - you must restart!
If you accidentally did not set
PER_JOB_HISTORY_DIR (see above)¶
The HTCondor Gratia probe will not publish accounting information about jobs without
PER_JOB_HISTORY_DIR. You can have Gratia read the Condor history file and publish data that way. If you know the time period of the missing data, you should specify a start and end times. This reduces the load on the Gratia collector. To do so:
%BLUE%Preferred method using start and end times [email protected] # /usr/share/gratia/condor/condor_meter --history --start-time="2014-06-01" --end-time="2014-06-02" --verbose 2014-06-03 10:00:36 CDT Gratia: RUNNING condor_meter MANUALLY using HTCondor history from 2014-06-01 to 2014-06-02 2014-06-03 10:00:36 CDT Gratia: RUNNING: condor_history -l -constraint '((JobCurrentStartDate > 1401598800) && (JobCurrentStartDate < 1401685200))' 2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records submitted: 399 2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records found: 400 2014-06-03 10:00:49 CDT Gratia: RUNNING condor_meter MANUALLY Finished %BLUE% or if you need to go back to the beginning of time [email protected] # /usr/share/gratia/condor/condor_meter --history --verbose 2014-06-03 10:06:19 CDT Gratia: RUNNING condor_meter MANUALLY using all HTCondor history 2014-06-03 10:06:19 CDT Gratia: RUNNING: condor_history -l 2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records submitted: 13026 2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records found: 13027 2014-06-03 10:11:38 CDT Gratia: RUNNING condor_meter MANUALLY Finished
Not much is printed to the screen, but you can see progress in the Gratia log file:
13:35:28 CDT Gratia: Initializing Gratia with /etc/gratia/condor/ProbeConfig 13:35:28 CDT Gratia: Creating a ProbeDetails record 2012-04-04T18:35:28Z 13:35:28 CDT Gratia: *********************************************************** 13:35:28 CDT Gratia: OK - Handshake added to bundle (1/100) 13:35:28 CDT Gratia: *********************************************************** 13:35:28 CDT Gratia: List of backup directories: [u'/var/lib/gratia/tmp'] 13:35:28 CDT Gratia: Reprocessing response: OK - Reprocessing 0 record(s) uploaded, 0 bundled, 0 failed 13:35:28 CDT Gratia: After reprocessing: 0 in outbox 0 in staged outbox 0 tar files 13:35:28 CDT Gratia: Creating a UsageRecord 2012-04-04T18:35:28Z ... 13:35:29 CDT Gratia: Processing bundle file: 13:35:29 CDT Gratia: Processing bundle file: /var/lib/gratia/tmp/gratiafiles/ subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/ outbox/r.18425.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80.gratia.xml__BSuXo18428 ... 13:35:29 CDT Gratia: *********************************************************** 13:35:29 CDT Gratia: Removing log files older than 31 days from /var/log/gratia 13:35:29 CDT Gratia: /var/log/gratia uses 0.035% and there is 73% free 13:35:29 CDT Gratia: Removing incomplete data files older than 31 days from /var/lib/gratia/data/ 13:35:29 CDT Gratia: /var/lib/gratia/data uses 0% and there is 73% free 13:35:29 CDT Gratia: End of execution summary: new records sent successfully: 37
Condor rotates history files, so you can only report what Condor has kept. Controlling the Condor history is documented in the Condor manual. In particular, see the options for MAX_HISTORY_LOG and MAX_HISTORY_ROTATIONS.
Bad Gratia hostname¶
This is an example problem where the configuration was bad: there was an incorrect hostname for the Gratia server. The problem is clearly visible in the Gratia log file, which is located in
/var/log/gratia/. There is one log file per day, labeled by the date:
[email protected] # cd /var/log/gratia/ [email protected] # cat 2012-04-03.log ... %RED%You can see that Gratia is using the correct configuration file: 15:06:55 CDT Gratia: Using config file: /etc/gratia/condor/ProbeConfig %RED%Here Gratia is removing a file from the Condor PER_JOB_HISTORY_DIR and creating a Gratia accounting record for it 15:06:55 CDT Gratia: Creating a UsageRecord 2012-04-03T20:06:55Z 15:06:55 CDT Gratia: Registering transient input file: /var/lib/gratia/data/history.37.0 15:06:55 CDT Gratia: *********************************************************** 15:06:55 CDT Gratia: Saved record to /var/lib/gratia/tmp/gratiafiles/ subdir.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80/ outbox/r.30604.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80.gratia.xml__wfIgi30606 15:06:55 CDT Gratia: Deleting transient input file: /var/lib/gratia/data/history.37.0 %RED%Later, Gratia failed to connect to the server due to a bad hostname 15:06:55 CDT Gratia: Failed to send xml to web service due to an error of type "socket.gaierror": (-2, 'Name or service not known') ... 15:06:55 CDT Gratia: Response indicates failure, the following files will not be deleted: 15:06:55 CDT Gratia: /var/lib/gratia/tmp/gratiafiles/ subdir.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80/ outbox/r.30604.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80.gratia.xml__wfIgi30606
If you accidentally had a bad Gratia hostname, you probably want to recover your Gratia data.
This can be done, though it's not simple. There are a few things you need to do. But first, you need to understand exactly where Gratia stores files.
When a Gratia extracts accounting information, it creates one file per record and stores it in a directory. The directory is a long name that contains the type of the probe (such as
condor), the name of the host you're running on, and the name of the GRACC host you're sending the information to. For simplicity, lets call that name probe-records, but you'll see what it really looks like below. Within this directory, you'll see some subdirectories:
|/var/lib/gratia/tmp/grataifiles/probe-records/outbox||The usual location for the accounting records|
|/var/lib/gratia/tmp/grataifiles/probe-records/staged/store||An overflow location when there are problems|
When you recover old records, you need to:
- Move files from the outbox of the incorrect probe-records directory into the outbox of the correctly named probe-records directory.
- Move tarred and compressed files from the staged/store of the incorrect probe-records directory into the staged/store of the correctly named probe-records directory. Then you uncompress them and remove the compressed version.
In the examples below, the hostname for gratia was "accidentally" spelled backwards. Instead of
gratia-osg-itb.opensciencegrid.org, it was
First you need to fix the hostname. For a CE, you can edit
osg-configure -c. In other installations, you have to edit the appropriate
Next, submit a job via to your batch system, then run the appropriate Gratia probe (or wait for it to run via cron). This will create the properly named directories on your disk. For example:
As a user:
[email protected] $ globus-job-run fermicloud084.fnal.gov/jobmanager-condor /bin/hostname
As root (adjust for your batch system):
[email protected] # /share/gratia/condor/condor\_meter
Find the Gratia records that can be easily uploaded. They are located in a a directory with an unwieldly name that includes your hostname and the incorrect name of the Gratia host. You can see the directory name in the Gratia log: the misspelled name is noted in red below, but it will be different on your computer.
[email protected] $ less /var/log/gratia/2012-04-06 ... 16:04:29 CDT Gratia: Response indicates failure, the following files will not be deleted: 16:04:29 CDT Gratia: /var/lib/gratia/tmp/gratiafiles/ subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/ outbox/r.916.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80.gratia.xml__JDlHbNb918
(The filename was wrapped for legibility.)
You can simply copy these to the correct directory. Wait for the Gratia cron job to run, or force it to run.
If this has been a persistent problem, you might have many records. After a while, they are put into a compressed files in another directory. You can move those files, then uncompress them. This is a long name: note that the path ends in "staged/store" instead of "outbox" as above:
%RED%# Find the old files [email protected] # cd /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/staged/store %RED%# Move them to the correct directory [email protected] # mv tz* /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/outbox/. [email protected] # cd !$ %RED%# For each tz file: [email protected] # tar xf tz.1223.... [name shortened for legibility] [email protected] # rm tz.1223....
When you've done this, you can re-run the Gratia probe by hand, or wait for it to run via cron.
Reference: Important Gratia files¶
If you need to look for more data, you can look at log files for the various services on your CE.
||Log file that records information about processing and uploading of Gratia accounting data|
||Log file specific to the Gratia GridFTP probe|
||Location for Condor and PBS job data before being processed by Gratia
||Location for temporary Gratia data as it is being processed, usually empty.
If you have files that are more than 30 minutes old in this directory, there may be a problem
||Configuration for Gratia probes, one per probe typeNormally you don't need to edit this|
Not all RPMs will be on all hosts. Instead, only the
gratia-probe-common and the one RPM specific to that host will be installed. The most common RPMs you will see are:
||Code shared between all Graita probes|
||The probe that tracks Condor usage|
||The probe that tracks SLURM usage|
||The probe that tracks PBS and/or LSF usage|
||The probe that tracks transfers done with GridFTP|