×By the end of May 2018, the ticketing system at https://ticket.opensciencegrid.org will be retired and support will be provided at https://support.opensciencegrid.org. Throughout this transition the support email ([email protected]) will be available as a point of contact.

Please see the service migration page for details: https://opensciencegrid.github.io/technology/policy/service-migrations-spring-2018/#ticket
Contact
Stephen Fralich
Details
Hyak_CE
OSG
GOC Ticket/submit
Stephen Fralich
UW-IT
Problem/Request
Normal
Closed
Waiting for fact ops answer
2015-11-29
Assignees
Kyle Gross / OSG GOC Support Team
OSG Glidein Factory Support / OSG Support Centers
Software Support (Triage) / OSG Software Team
Tim Cartwright / OSG Software Team
Brian Lin / OSG Software Team

Assignees TODO
Past Updates
by Kyle Gross 
Since it has been 6 days with no response, I will close this ticket.
 
Hi Brian,

After moving the GRAM-CE entry over to the SDSC factory, we also still see no reproduction of the original problem at Hyak. So we're in favor of closing out this ticket and moving the investigation over to OSC. I will go ahead and disable this GRAM-CE test entry at Hyak on the SDSC and GOC-ITB factories.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Brian Lin 
Hey all,

Since we're not having any luck reproducing the original issue with the ITB factory and the new GRAM CE, I propose that we close this ticket and let the Hyak admins take down the test CE (provided that they are content with their HTCondor-CE). We can continue to try to reproduce the original issue but I'm not confident that we'll be able to duplicate the state which caused it. It also strikes me as a waste of effort when we could be troubleshooting the glidein problems at OSC. If anyone would like this ticket to remain open, please comment by EOB tomorrow (11/25), otherwise I will close it.

- Brian
 
Hi Kyle,

Sorry, we haven't had a chance to test the GRAM-CE entry on the SDSC factory yet. I'll make sure we get to it next week. No update from other ticket with similar issues as far as I know. However, we did email Tim Cartwright to see if we could get one of the GRAM experts to look at the other site where the problem is being reproduced currently. I don't believe we've heard back from him. Not sure if you can ping him about this too.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Kyle Gross 
Marty,

Thanks for the last update.  I'm just doing my weekly pinging to see if you've done anything else since you've found the other ticket with the similar issue.

Thanks,

-Kyle
 
Hi Kyle,

Jeff and I have talked about this offline. As for an update, we're not able to reproduce the result with the GOC-ITB factory tests thus far. This may be due to it having more up-to-date globus packages that were not previously available when the problem here was first observed. We will attempt to re-test using the SDSC factory, which is still running on older packages. However, we have also found another instance of the same problem here, we believe: https://ticket.opensciencegrid.org/27288.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Kyle Gross 
Jeff,

Have you seen Marty's question here?

-Kyle
 
Jeff,

Maybe I'm misremembering, but didn't we also update globus and glideinWMS on GOC after Hyak began having issues with GRAM-CE? i.e., around the same time we moved them to HTCondor-CE?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Hi all,

I've submitted a boat load of test jobs from our test frontend. We'll see how they go. I'll try and keep up the pressure on the GRAM-CE. Note, however, these are simple sleep jobs. It'd be preferable to also see glideins take up real user jobs if possible. If anyone would like to participate, please send jobs.

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Hi Marty,

I don't think this is the case, because we saw the same behavior at the GOC factory before Hyak made the switch to HTCondor CE.

I agree with Brendan, Mats, can you submit more test jobs at a greater scale  and over a longer time period to see if we can reproduce it?

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
 
Hi all,

Could this be because the GOC-ITB factory has more up-to-date globus packages? e.g. We still have much older ones on the SDSC factory, where this problem originally started. Perhaps we should try again there?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Apparently not. I can confirm what Marty said, that all of the glideins submitted three days ago completed successfully, although none of them ever received any user jobs to run. Mats, could you try sending a larger amount of jobs to the site so we can see what happens with more load? Maybe a hundred or so (or more)?

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by Brian Lin 
So we aren't able to reproduce the issues that we had on the old GRAM CE?
 
Edgar,

I don't think it's lying. Factory monitoring reports one OSG glidein currently running at Hyak via the GRAM gatekeeper. Overall, the logs for two of these OSG glideins have been returned to the factory so far. Both show no obvious problems, other than they sat idle, never receiving a user job.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Hi Fact Ops,

Can you update this ticket and let us know if gram is "lying" again?

Edgar
OSG Software Support

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020
 
I missed that you wanted me to submit some jobs. I have now done so.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Mats Rynge 45
 
Hey Edgar,

Yeah, we're waiting on Mats to get some OSG jobs submitted to the gram test entry, OSG_US_Hyak_osg_gramce. The site name on it is Hyak_CE.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
 
Hi Brendan,

Are we waitting here for Mats to submit some jobs through from which FE? To the new gram CE entry? What is the entry name? Site Name?

Thanks,

Edgar
OSG Software Support

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020
 
I've setup an entry on the GOC ITB factory called OSG_US_Hyak_osg_gramce to use for testing purposes. Mats, could you go ahead and get some jobs going its way? It should start failing to report properly relatively soon after that.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by [email protected] 
Yes, go ahead at your convenience. I just got tied up with some
operational stuff the end of last week. We do keep 4 weeks of logs on
disk and I have another 4 weeks on tape. Let me know if you need me to
do anything.

On Mon, Oct 19, 2015 at 11:58 AM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hey Stephen,

Wanted to check in and see- are you ready for us to setup submission of test jobs to debug the GRAM issues?

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by Tim Cartwright 
Thanks, Stephen. Later this week is totally fine. I have added Mat Selmeci to this ticket, as he will be leading the troubleshooting effort. Do you happen to keep Globus logs going back a ways, in case we need to get historical log data?
by [email protected] 
I'd be happy to help, but if you're going to need me to help, can we
hold off until Wednesday or Thursday? Quarterly maintenance is
Tuesday. I re-enabled GRAM Friday at some point and it seems to be
working.

On Fri, Oct 9, 2015 at 11:20 AM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hey Stephen,

I think we're largely seeing the normal ebb and flow of submissions, although one other possible explanation is that I'm not sure if OSG's multicore VO is submitting as much and as consistently as the OSG singlecore VO, so that may also be why the amounts are fluctuating so much (Hyak is only configured on the factory side for the multicore VO).

On another note related to the GRAM issues, we've currently discovered two other sites suffering from what appears to be the same issue with GRAM, and we were wondering if you'd be open to helping debug whatever is causing these issues. If so, we'd want you to reenable GRAM, and we'd create a dedicated entry on our GOC ITB factory that'll submit to it so that we can start testing. If that's alright with you, I'll get everyone else added to the ticket so that we can get started.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by [email protected] 
That's good to hear. We were at capacity earlier today around 10
Pacific, but the number of jobs has dropped, but is on the rise again.
Is this just due to the number of Glideins waiting to run across all
sites? We have almost 1000 Idle cores at the moment.

On Wed, Oct 7, 2015 at 1:03 PM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
As an update from the factory perspective, Hyak is serving glideins optimally. Everything is looking good.

Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
We're running OSG jobs now and those jobs are picking up glidein
tasks. Hopefully they're making it back to you. The local files are
being removed from the shared condor spool directory.

Yeah, the GUMS update prompted the update. I wasn't here, so I had one
of my colleagues perform the updates and told him to just update
everything for simplicity. He did that and re-ran osg-configure -c and
rebooted. The scheduler event generator did not work properly on
reboot. It was running, but was not doing anything. I restarted and it
started producing output again. Prior to the update we were running
OSG and GLOW jobs. After the update OSG and GLOW jobs would run, but
not return. FNAL jobs were running and returning post updates. Jeff
suggested our configuration never really worked properly. That's quite
disappointing because we invested quite a lot of our time and OSG
staff time on several previous occassions. You can review older
tickets opened by me if you'd like to see.

We run Moab 7.2.9 and Torque 4.2.9. They run on a different host. OSG
jobs are preemptable. Local customers run both preemptor and preemptee
jobs. We don't want OSG jobs running for more than 4 hours. OSG
glideins need to request an appropriate number of tasks for the number
of cores in the system on which they land. We worked through all that
in previous requests. Yes, we run GUMS. GUMS runs on the same host as
all the other OSG software. You can download the profile:
http://staff.washington.edu/sjf4/osg-profile.txt (My key is on my
workstation in my office and I am at home this morning, so I can't
attach it). Let me know if you'd like any other information.

On Wed, Oct 7, 2015 at 8:08 AM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
by Tim Cartwright 
After reviewing this ticket in detail, I agree with Jeff's suggestion to try moving to HTCondor-CE, especially since that effort seems well underway at this time. If that path works out soon, then I am not sure what value there is in going back and doing forensics on the GRAM-CE.

Just for my own curiosity, what was the exact "security updates" announcement were you reacting to, and what hosts did you update? I am aware of the GUMS security fix, released on 8 September - is that what triggered your updates?

Also, could you explain a little about your overall system design? I see that you are running Moab/Torque; what version? Are you running GUMS? If so, is GUMS running on a host that is separate from the CE host? Also, it might be helpful to run the osg-system-profiler tool on your CE machine and attach the results to this ticket; doing so would provide a great deal of background information about your CE host and OSG software installation.

-- Tim Cartwright, OSG Software manager
 
Hi,

I just pushed the changes to production.

We'll keep an eye on it from the factory side.

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
 
My latest test batch ran to completion, and the factory had no issues tracking the glideins using the HTCondor CE.

Tomorrow morning I will push the changes to the production factories.

Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
It's all set for you to start testing again.

On Tue, Oct 6, 2015 at 3:40 PM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
I believe you need to set this up with Job Router rules.

https://twiki.grid.iu.edu/bin/view/Documentation/Release3/JobRouterRecipes#Writing_multiple_routes

For the gram case, we agreed on a queue named "osg"
If you can set up something similar for HTCondor CE, we will restrict the factory side to only use this queue. I'll stop my tests until you can confirm that the job router rules are in place. If you need help however, I may need to add someone with more HTCondor CE expertise on the ticket, just let us know.

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
They're not running in the right queue. I set up all this stuff for
the GRAM system. Where do I specify a default queue for the jobs and
other options?

On Tue, Oct 6, 2015 at 3:25 PM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hi Stephan,

Already all my test pilots are running at Hyak.  However they aren't able to connect back to our condor pool.  If I recall, do you need to explicitly whitelist IPs for outbound connections on your workernodes?  Can you please add the following host to allowed outbound IPs?

test-frontend-1.t2.ucsd.edu 169.228.130.121

10/06/15 14:42:05 (pid:26877) attempt to connect to <169.228.130.121:9636> failed: Connection timed out (connect errno = 110).  Will keep trying for 300 total seconds (237 to go).

10/06/15 14:46:02 (pid:26877) attempt to connect to <169.228.130.121:9636> failed: Connection timed out (connect errno = 110).
10/06/15 14:46:02 (pid:26877) CCBListener: connection to CCB server test-frontend-1.t2.ucsd.edu:9636 failed; will try to reconnect in 60 seconds.

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
 
Thanks for the update Stephen,

I have contacted some HTCondor CE experts that hopefully can help you regarding the slow startup time.

In the meantime, I have just submitted some test jobs to Hyak HTCondor-CE, using the GOC-ITB factory.  Once we confirm everything works from GOC-ITB, I can push the change to the production factories.

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
 
http://staff.washington.edu/sjf4/uw_osg_condor.tar.gz

I uploaded the Condor logs from the OSG server to the above link if that helps.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
 
I've enabled Condor-CE and disabled GRAM. I'm indifferent to GRAM vs Condor-CE. Whatever is going to be stable and not require a bunch of fiddling.

The below command succeeds, but it takes a long time (1-6 minutes). Condor isn't scheduling a job in our job scheduler in a timely manner. I can't tell from the logs what's with the delay. There's a lot of log entries like "Number of Active Workers 0".

condor_ce_run -r globus1.hyak.washington.edu:9619 /bin/env

We're going to need what options it uses to submit jobs to our scheduler too.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
 
For what it's worth, I would advocate having the software team understand if the problem can be fixed without asking Stephen to change the CE. We still have plenty of GRAM sites out there.

Bo

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Bodhitha Jayatilaka 2312
 
I'm adding OSG software support.

Basically (1) would require a gram expert to help us debug on your end. To give you a sense of (2), see this document:

https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallHTCondorCE

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
I don't know what's involved in either of these choices. We use
Moab/Torque as the scheduler. If I recall correctly, using GRAM is
what the documentation suggested.

On Tue, Oct 6, 2015 at 9:23 AM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hi all,

Based on what I'm seeing, something has never correctly worked since we started sending pilots to Hyak. I don't believe any recent changes are responsible. Please see this plot:
http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=OSG_US_Hyak_osg&frontend=total&infoGroup=running&elements=StatusRunning,ClientGlideTotal,ReqIdle,StatusIdle,&rra=4&window_min=1429145004751.6199&window_max=1444107600000&timezone=-7

Note the black line shows glideins successfully registering back to the VO user pools, which means plenty are running. However the green area is almost always 0. The green represents the number of glideins condor-g sees running at Hyak. This means the gatekeeper at Hyak is almost never correctly reporting when glidiens are actually running. To make things worse, the factory (condor-g) never even gets the message that these glideins have run to completion. So from the factory perspective, there is always plenty "idle" at the site, and it refuses to submit more (see gaps in September and October).  So we don't ever get new pilots until we manually remove these phantom pilots from the factory queues.

So we have two options, (1) either we try and understand why the gatekeeper state machine is broken in gram, or (2) Hyak drops gram in favor of HTCondor CE.

Stephen, in my experience in factory ops, we have seen many instances of (1) but have yet to understand the cause of it or know how to fix it.  So I am strongly in favor of (2), if you are open to migrating your CE. We had a similar problem at Wisconsin for one of their opportunistic queues, and (2) solved the problems there.

Let us know how you would like to proceed.

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by Kyle Gross 
Are there any updates on this issue?

-Kyle
by [email protected] 
16506194457164734166.15762837195100016529

Working from home and I don't have my key so hopefully they come
through via e-mail.

On Wed, Sep 30, 2015 at 3:47 PM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
I'm still seeing the same issue on both SDSC and GOC- OSG and GLOW glideins are showing as stuck in a Pending state from the factory side of things, despite the fact that you're seeing those same jobs running on the site-end, and the stderr & stdout files point to the job completing without issue.

I cleared out all of the stale glideins that resulted, can you attach another stderr & stdout from an OSG job as soon as one completes? Unfortunately the logs on our end cycle fairly quickly (within two days), so I'd like to try and get the output from a fresh job to see if I can dig up any more information on it from here. From the output you posted, it looks like one of these jobs should finish within an hour or starting.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
 
We haven't been seeing any Glideins since about 9/24.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
by Kyle Gross 
Marty/Stephen,

Are there anyvupdates on this issue?

-Kyle
 
Stephen,

Thanks for the update log. We've been seeing some software conflicts lately on updates. So this might be helpful.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Mats,

My simple sleep test jobs ran successfully overnight at Hyak via the GOC-ITB factory [1]. It looks like we're still waiting for the OSG glideins from the GOC factory to run. For comparison, can you also submit a batch of jobs to Hyak via the GOC-ITB factory?

Marty Kandes
UCSD Glidein Factory Operations

[1]

DESIRED_Sites: OSG_US_Hyak_osg
Submitted to: OSG_US_Hyak_osg
osgmis
Tue Sep 22 19:46:48 PDT 2015
Linux n0336 2.6.32-504.23.4.el6.x86_64 #1 SMP Fri May 29 10:16:43 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
/tmp/glide_m5QCza/execute/dir_27312
300
jobs slept for 300 seconds.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
I attached a list of the installed updates on 9/8. It seems like FNAL Glideins are working fine though. Let me know if there's anything else.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
 
Thanks, Mats.

We may have to wait for them to work through the queue. I've also got a batch of test jobs going through the GOC-ITB factory as well.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Requesting via the GOC factory, but no glideins yet:

http://glidein.grid.iu.edu/factory/monitor/factoryEntryStatusNow.html?entry=OSG_US_Hyak_osg

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Mats Rynge 45
 
Mats,

Well, that explains why GOC wasn't registering the OSG frontend. I've eliminated the typo and reconfigured the factory. Can you try again?

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Looks like there is a typo in the factory config.

GLIDEIN_Supported_VOs = "OSGVO_MULTIC ORE,MIS,Fermilab,glowVO"

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Mats Rynge 45
 
Mats,

Yes, I see the requests at SDSC. Is there anyway to force requests to GOC?

Marty Kandes
UCSD Gliden Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Marty,

We are requesting but not getting any glideins:

http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryEntryStatusNow.html?entry=OSG_US_Hyak_osg

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Mats Rynge 45
 
Hi Mats,

Has OSG been requesting glideins at Hyak lately? If not, can you provide some work to Hyak via the GOC factory? I don't trust the globus packages on SDSC anymore.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
 
Hi Stephen,

Can you provide us with a list of the updates that may have affected the glidein performance?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by [email protected] 
Not that I can recall. There's a glow job that just started. It's running tasks.

I've attached those files from the archive. It wouldn't allow the archive.

On Mon, Sep 21, 2015 at 2:32 PM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
It doesn't look like the attachment made it through, I'm not seeing it here on the ticket. Even if the glideins have run, they may not have been running user jobs correctly, and they're definitely not communicating back with the factory properly as all of those glideins are still listed as pending on the site from the factory end. I'm going to try clearing out all of the OSG glideins on the GOC factory so that some new ones can try submitting again.

Have we had this problem with glideins submitted to Hyak before? It sounds very familiar but I can't seem to find any previous tickets discussing it.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by [email protected] 
All those jobs have run. Their results are sitting on disk. I've
attached the contents of one of those directories.

On Mon, Sep 21, 2015 at 10:28 AM, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
I decided to give it the weekend, and Fermilab jobs are definitely running to completion without issue again. However, I'm still not seeing any OSG or GLOW glideins running on the site anymore. I've included a few OSG GridJobIDs below [1], can you check to see if there's something going on preventing them from running?

Brendan Dennis
UCSD Glidein Factory Operations

[1]
gt5 osg.hyak.washington.edu:2119/jobmanager-pbs https://globus1.hyak.washington.edu:44387/16506041583842135556/15762837195100016529/
gt5 osg.hyak.washington.edu:2119/jobmanager-pbs https://globus1.hyak.washington.edu:44387/16506041584010030596/15762837195100016529/
gt5 osg.hyak.washington.edu:2119/jobmanager-pbs https://globus1.hyak.washington.edu:44387/16506041584243745136/15762837195100016529/
gt5 osg.hyak.washington.edu:2119/jobmanager-pbs https://globus1.hyak.washington.edu:44387/16506041584300294116/15762837195100016529/
gt5 osg.hyak.washington.edu:2119/jobmanager-pbs https://globus1.hyak.washington.edu:44387/16506041584449718896/15762837195100016529/

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
 
Hyak was filled with stale pending or held glideins, so I went ahead and cleared them all out. I'm now seeing Fermilab glideins running on the site again, but so far no GLOW or OSG glideins have progressed from pending to running. I'll check back in a few hours and see if they've made it any further.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
by Kyle Gross 
I am sending this to glidein factory support:

I was away on vacation and another of the staff here had to install the 9/8 security updates. The globus scheduler event generator did not restart properly on reboot and I fixed that on Monday 9/14. We stopped receiving Glideins after 9/8. Did you disable something on your end?

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
by OSG-GOC 
I was away on vacation and another of the staff here had to install the 9/8 security updates. The globus scheduler event generator did not restart properly on reboot and I fixed that on Monday 9/14. We stopped receiving Glideins after 9/8. Did you disable something on your end?

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

GOC Ticket Version 2.2 | Report Bugs | Privacy Policy

Copyright 2018 The Trustees of Indiana University - Developed for Open Science Grid