GOC Condor Collector Service Level Agreement
About This Document
This document details a service level agreement which outlines production expectations for the GOC Condor Collector and defines general support infrastructure for the service.
|1.1||9-24-2014||Scott Teige||First Draft|
This SLA is an agreement between OSG and OSG Stakeholders pertaining to the operation of the OSG Condor Collector service.
The Condor Collector SLA is owned by OSG Operations. It will be reviewed by and agreed upon by the OSG Executive Team.
Service Name and Description
GOC Condor Collector
The grid-facing component of the HTCondor compute element system available for use by all OSG VOs.
Service Target Response Priorities and Response Times
This section deals with unplanned outages. Please see Requests for Service Enhancement for information on planned maintenance outages.
|Description||* *||* *||* *|
|This service does not have critical priority||This service does not have high priority||This issue prevents all new sites from advertising in the info service or clients from querying the service. This issue causes intermittent drops of updates or query failures.||The issue prevents effective monitoring or full utilization of the Condor pool|
|Response Time||* *||* *||* *|
|N/A||N/A||Within 2 business days||Within 10 business days|
|Resolution Time||* *||* *||* *|
|N/A||N/A||The maximum acceptable resolution time is five (5) business days||The maximum acceptable resolution time is ten (10) business days|
|Escalates Every||* *||* *||* *|
|N/A||N/A||Five Days||Ten Days|
|Escalation Level||OSG Contact|
|1st||OSG Operations Infrastructure Lead|
|2nd||OSG Operations Coordinator|
|3rd||OSG Production Coordinator|
|4th||OSG Technical Director and Executive Director|
Critical Level Security Issues will be elevated immediately to the OSG Security Team.
Detailed information on contacts are viewable on the following MyOSG URL, and are maintained within the
Service Availability and Outages
The GOC will strive for 90% service availability. If service availability falls below 90% monthly as monitored by the GOC a root cause analysis and service plan will be submitted to the OSG stakeholders for plans to restore an acceptable level of service availability.
Reliability and availability will be determined by 2 critical RSV probes and the OSG availability algorithm. * One probe will verify the Condor collector is queryable whenever the host is pingable (i.e., reasonably not a network failure). * The other probe will verify an automatically generated configuration file has been updated recently
Service Support Hours
This service will be run for 24x7, but support will primarily be within business hours. The exception is for security incidents.
Service Off-Hours Support Procedures
All operational issues should be reported as per Customer Problem Reporting section.
Requests for Service Enhancements
OSG Operations will provide enhancement capabilities only as they are released by the HTCondor project or OSG Technology Group. Requests for customization of the deployed service may be made; OSG Operations has the right of up to one month of testing of any change.
OSG will give 1 week of warning prior to any change in the Collector service version. At any time during this week, Stakeholders are permitted to request a delay for up to 1 week after the originally scheduled upgrade.
OSG Operations will schedule downtimes and configuration changes during normal business hours unless approved by affected stakeholders. This is done so affected stakeholders are always on-hand in case if the downtime and changes cause further issues.
Customer Problem Reporting
Problems should be reported immediately at https://support.opensciencegrid.org
- The Collector will be run by OSG Operations and accessible to HTCondor compute elements.
- Problems not associated with the hardware, operating system or network are the responsibilty of the Technology Group.
Service Measuring and Reporting
The GOC will provide the following reports in the intervals indicated (monthly, quarterly, semi-annually, or annually):
|Report Name||Reporting Interval||Delivery Method||Responsible Party|
|System Availability and Reliability||Monthly||Web Posting||GOC|
SLA Validity Period
This SLA will be in affect for one year.
SLA Review Procedure
This SLA will renew automatically on a yearly basis unless change or update is requested by the OSG Operations Coordinator, the OSG Executive Team or the Stakeholders.
Appendix A - Approval
<-- CONTENT MANAGEMENT PROJECT
DEAR DOCUMENT OWNER
Thank you for claiming ownership for this document Please fill in your FirstLast name here: * Local OWNER = RobQ
Please define the document area, choose one of the defined areas from the next line DOC_AREA = (ComputeElement|General|Trash/Trash/Integration|Monitoring|Operations|Security|Storage|Trash/Tier3|User|VO) * Local DOC_AREA = Operations
define the primary role the document serves, choose one of the defined roles from the next line DOC_ROLE = (Developer|Documenter|Scientist|Student|SysAdmin|VOManager) * Local DOC_ROLE = VOManager
Please define the document type, choose one of the defined types from the next line DOC_TYPE = (HowTo|Installation|Knowledge|Navigation|Planning|Training|Troubleshooting) * Local DOC_TYPE = Knowledge
Please define if this document in general needs to be reviewed before release ( YES | NO ) * Local INCLUDE_REVIEW = YES
Please define if this document in general needs to be tested before release ( YES | NO ) * Local INCLUDE_TEST = NO
change to YES once the document is ready to be reviewed and back to NO if that is not the case * Local REVIEW_READY = YES
change to YES once the document is ready to be tested and back to NO if that is not the case * Local TEST_READY = NO
change to YES only if the document has passed the review and the test (if applicable) and is ready for release * Local RELEASE_READY = NO
DEAR DOCUMENT REVIEWER
Thank for reviewing this document Please fill in your FirstLast name here: * Local REVIEWER = ScottTeige
Please define the review status for this document to be in progress ( IN_PROGRESS ), failed ( NO ) or passed ( YES ) * Local REVIEW_PASSED = YES
DEAR DOCUMENT TESTER
Thank for testing this document Please fill in your FirstLast name here: * Local TESTER = ScottTeige
Please define the test status for this document to be in progress ( IN_PROGRESS ), failed ( NO ) or passed ( YES ) * Local TEST_PASSED = IN_PROGRESS