[27288] Error during execution of glideins

Past Updates

Alright, I'll close this ticket. Thanks for your patience and let us know if you need any help in the future!

Cheers,
Brian

I'm inclined to declare victory on this.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Troy,

There was a thread on the gt-user mailing list that sounded very similar to your problem here (Titled "[gt-user] Problems with PBS SEG on Torque 5.1."). If you're getting globus directly from their repos, it sounds like they've got a fix available in their unstable repo in their 'globus-gram-job-manager-pbs' packages. If you would like to turn the SEG back on, you can try these new packages. Otherwise, you can leave the SEG off and I think we can close this ticket because it looks like jobs are still correctly reporting.

- Brian

According to https://ticket.grid.iu.edu/attachment/view?id=26794&file=stderr.txt, UW Hyak is running TORQUE 4.2.9, which isn't super new but isn't ancient either.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

I know our other site that was experiencing issues (hyak) is a PBS site but I don't know if they had updated to a later version of torque or not. I agree, things look better on the graph so let's give it a bit before we consider this 'solved'.

I disabled SEG for our GRAM instance just for grins, and surprisingly things seemed to start working better in my manual testing using globusrun.  A bunch of Nova jobs appeared shortly thereafter and are currently waiting in our local queue to run.  I'm inclined to see how those jobs do over the long weekend.

This ticket was first opened on Nov 2, about a month and a half after we upgraded Oakley to TORQUE 5.1.1; during that time, no Nova jobs were submitted to Oakley.  My current working theory is that the logging code in newer versions of TORQUE changed just enough that SEG's support for it is finnicky at best and broken at worst.  Were other OSG sites that had trouble with GRAM also running new-ish versions of TORQUE?

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Ok in that case, how exactly are you providing the PBS logs so that the
SEG can consume them? There should also be a directory that contains
files for each GRAM job and they should contain the status. Are all
those job statuses 'idle' or equivalent?

Thanks,
Brian

I have significantly more experience with GRAM than HTCondor, so I'm going to keep working on the GRAM side for a bit longer.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Doug/Troy,

Sorry about the confusion about your supported VOs, I was under the impression that you guys also supported OSG. I'm glad we got that sorted out.

I don't think we need to start a new ticket because it looks like although jobs (both user jobs and pilot jobs) are completing successfully, the pilot jobs are still not reporting their state back to the factory properly: http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=Nova_US_OSC_osg&frontend=Fermilab-fifebatch_frontend&infoGroup=running&elements=StatusRunning. We've got two options for fixing this:

1) Replace your GRAM gatekeeper software with HTCondor-CE.
2) Troubleshoot your GRAM gatekeeper

We recommend option 1 since GRAM support is minimal at this stage and we really don't know how deep the rabbit hole goes for the GRAM gatekeeper issue. Let us know how you would like to proceed.

Thanks,
Brian

Hi Marty,

Thanks for removing the microbone VO for the time being.  We're having
an internal discussion about supporting this experiment later this
morning, and we will make a decision soon.  Please make sure to
contact me before any other changes to VO membership are made, let me
know if there are problems with our site contact details.

If the NOvA glideins from yesterday have run successfully I'd like to
recommend we close this ticket.  And if there are lingering issues, I
recommend we create new tickets for each distinct item that remains.

Doug

On Thu, 14 Jan 2016 14:30:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1  <text/plain; utf-8 (quoted-printable)>]
> [2  <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]

Hi Troy,

We are currently observing Nova glideins pending at the gatekeeper, which means they are probably waiting in your local batch queue. We'll let you know if we see them run through OSC successfully.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

I forgot to mention that we discovered yesterday that the globus-scheduler-event-generator service for our OSG GRAM instance broke in late December due to a server move.  That has been fixed, AFAICT.

Is OSG still seeing glide-in errors to OSC now that we have resolved the fact that pilot/osg-flock.grid.iu.edu shouldn't have been trying to send job here?

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Hi Troy,

I've gone ahead and also removed Microboone from the supported VOs list. In the future, you can see the GLIDEIN_Supported_VOs list at OSC for the GOC and SDSC glidein factories at

http://glidein.grid.iu.edu/factory/monitor/factoryEntryStatusNow.html?entry=Nova_US_OSC_osg

and

http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryEntryStatusNow.html?entry=Nova_US_OSC_osg

respectively. Look near the bottom of the list of attributes.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

> I have no knowledge about OSGVO support at OSC but we just added Microboone yesterday because there is an agreement for a Microboone scientist (Project: PES0709 (Randy Johnson, UCN)) to get cycles.

> Yeah, I was just about to mention that. We recently got the Microboone request: https://ticket.opensciencegrid.org/28198

That request was premature.  They just received a startup allocation very recently, but we have not agreed to support OSG jobs for them yet.

In the future, please *DO NOT* add any VOs to the GLIDEIN_Supported_VOs list for OSC without first consulting with either Doug Johnson or myself.  As of right now, the GLIDEIN_Supported_VOs list for OSC should consist of Nova only.  (Also, for future reference, how would I be able to check the current contents of that list myself?)

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Hi all,

FYI - We've gone ahead and removed OSGVO from the GLIDEIN_Supported_VOs list.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Joe,

Yeah, I was just about to mention that. We recently got the Microboone request: https://ticket.opensciencegrid.org/28198

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

I have no knowledge about OSGVO support at OSC but we just added Microboone yesterday because there is an agreement for a Microboone scientist (Project: PES0709 (Randy Johnson, UCN)) to get cycles.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Joe Boyd 579

Hi all,

Perhaps there has been some confusion here. Nova glideins should be submitted and run under the Fermilab VO DN [1], not the OSG VO DN, unless there has been a change to this I'm not aware of. Note, however, we do have Nova, OSGVO, and Microboone all listed in the GLIDEIN_Supported_VOs list in the factory configuration for OSC. Please confirm if you would like OSGVO and Microboone, which also runs under Fermilab VO DN, to be removed from this list. If we do this, then I believe you'll only receive the Nova-specific user jobs from the Fermilab VO frontend at OSC.

Marty Kandes
UCSD Glidein Factory Operations

[1]

subject   : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov/CN=proxy
issuer    : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
identity  : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
type      : proxy
strength  : 1024 bits
path      : /var/lib/gwms-factory/client-proxies/user_fefermifife/glidein_gfactory_instance/credential_fifebatchgpvmhead1_OSG_gWMSFrontend.OSG_nova_714925
timeleft  : 264:00:51
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
issuer    : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov
attribute : /fermilab/nova/Role=pilot/Capability=NULL
attribute : /fermilab/accelerator/Role=NULL/Capability=NULL
attribute : /fermilab/annie/Role=NULL/Capability=NULL
attribute : /fermilab/argoneut/Role=NULL/Capability=NULL
attribute : /fermilab/cdms/Role=NULL/Capability=NULL
attribute : /fermilab/chips/Role=NULL/Capability=NULL
attribute : /fermilab/coupp/Role=NULL/Capability=NULL
attribute : /fermilab/darkside/Role=NULL/Capability=NULL
attribute : /fermilab/des/Role=NULL/Capability=NULL
attribute : /fermilab/dune/Role=NULL/Capability=NULL
attribute : /fermilab/genie/Role=NULL/Capability=NULL
attribute : /fermilab/gm2/Role=NULL/Capability=NULL
attribute : /fermilab/grid/Role=NULL/Capability=NULL
attribute : /fermilab/lar1/Role=NULL/Capability=NULL
attribute : /fermilab/lar1nd/Role=NULL/Capability=NULL
attribute : /fermilab/lariat/Role=NULL/Capability=NULL
attribute : /fermilab/mars/Role=NULL/Capability=NULL
attribute : /fermilab/mars/accel/Role=NULL/Capability=NULL
attribute : /fermilab/mars/gm2/Role=NULL/Capability=NULL
attribute : /fermilab/mars/lbne/Role=NULL/Capability=NULL
attribute : /fermilab/mars/mu2e/Role=NULL/Capability=NULL
attribute : /fermilab/minerva/Role=NULL/Capability=NULL
attribute : /fermilab/miniboone/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
attribute : /fermilab/mu2e/Role=NULL/Capability=NULL
attribute : /fermilab/nova/Role=NULL/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/numix/Role=NULL/Capability=NULL
attribute : /fermilab/patriot/Role=NULL/Capability=NULL
attribute : /fermilab/seaquest/Role=NULL/Capability=NULL
attribute : /fermilab/uboone/Role=NULL/Capability=NULL
timeleft  : 20:00:51
uri       : voms2.fnal.gov:15001

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Hi Brian,

Thanks for the clarification, but this is a change in behavior for us.
We currently only support the NOvA VO at OSC.  All OSG initiated jobs
need to go through that VO for the purposes of accounting against the
NOvA collaboration's allocation at OSC.  In our VO configuration we
have all the NOvA OSG initiated jobs going through a single user
account named nova.  This has been how glideins have worked until the
last few weeks/month.  I can't emphasize strongly enough that we are
not a general OSG resource, and that we have to charge OSG jobs
against a specific allocation at OSC based on the user ID the job runs
as.

If the NOvA jobs are now going to be initiated by
"/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu", shouldn't this DN be
specified by the VO by managing it through VOMS?

We may support additional VOs in the future, but again, we will
support those VOs because members of the VOs are faculty or PIs in
Ohio, and have been allocated computation cycles and storage at OSC
through OSC allocations.  How would we preserve local accounting if
all jobs are being started through osg-flock.grid.iu.edu?  There can't
be any DNs in common between VOs.

Doug

On Wed, 13 Jan 2016 18:44:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1  <text/plain; utf-8 (quoted-printable)>]
> [2  <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]

I'm not surprised that you see those failures because those jobs should be mapped to a user that can submit batch jobs. The main site for what the pilots are doing is here: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html. Pilot jobs are an essential part of the OSG infrastructure: when the job lands on a compute node, it starts up the HTCondor worker stack that communicates with the OSG, and allows users to OSG users to run jobs at your site.

If you don't want to allow this, we should talk to factory ops so that you can drop support to the OSG VO.

> I can't speak to exactly why pilot jobs need to create '~/.globus' but the pilots need to do a lot of things to set up their payloads so they can communicate back to the OSG and accept user jobs. I wouldn't expect jobs mapped to 'nobody' to work. Is there another user you could point it to?

For the time being, I've remapped that DN to the user account for rsv, which is local to our OSG VO box and can't submit batch jobs.  However, that too is generating errors:

root@osg:/var/log/globus# tail /var/log/globus/gram_rsv.log
ts=2016-01-13T20:32:21.866456Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084249176766/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.148491Z id=19517 event=gram.job.end level=ERROR gramid=/16506032086491899326/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.292851Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084372714686/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.398011Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083232984766/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.429724Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083335676606/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.555567Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084382490046/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.591012Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083541326526/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.595550Z id=19517 event=gram.job.end level=ERROR gramid=/16506032085156604606/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.699050Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083296032446/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.699161Z id=19517 event=gram.job.end level=ERROR gramid=/16506032086786345406/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"

Is there documentation somewhere that describes what this pilot/osg-flock.grid.iu.edu DN is trying to do?  I have googled variations on it several times and not really come up with anything useful.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

> I'm confused by this.  AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"?  If not, to what local user should that DN map?  And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?

You can choose which user it maps to in /etc/edg-mkgridmap.conf by changing the last field of the following lines:

# USER-VO-MAP osg OSG -- 23 -- Rob Quick (rquick@....)
group vomss://voms.opensciencegrid.org:8443/voms/osg osg
group vomss://voms.grid.iu.edu:8443/voms/osg osg

And applying the changes by running `edg-mkgridmap`. If you would like to support multiple VOs (Virtual Organizations = OSG projects), you can map other VOs to other users or the same user if that suits you. This can all be supported on one CE (Compute Element = VO box).

I can't speak to exactly why pilot jobs need to create '~/.globus' but the pilots need to do a lot of things to set up their payloads so they can communicate back to the OSG and accept user jobs. I wouldn't expect jobs mapped to 'nobody' to work. Is there another user you could point it to?

> > 1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.
>
> I'm confused by this.  AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"?  If not, to what local user should that DN map?  And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?

I did a little more digging into this and found the config files for edg-mkgridmap.  I added an entry to map "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu" to the nobody user.  However, now we're seeing errors like the following:

root@osg:/var/log/globus# more /var/log/globus/gram_nobody.log
ts=2016-01-13T18:50:32.772783Z id=7220 event=gram.make_job_dir.end level=ERROR gramid=/16506016693465985926/14500289408298687149/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:50:32.945021Z id=7220 event=gram.make_job_dir.end level=ERROR gramid=/16506016693839809926/14500289408298687149/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:53:41.716176Z id=7619 event=gram.make_job_dir.end level=ERROR gramid=/16506016690739335926/14500289408298658176/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:53:41.882591Z id=7619 event=gram.make_job_dir.end level=ERROR gramid=/16506016692255845366/14500289408298658176/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:56:14.339399Z id=7741 event=gram.make_job_dir.end level=ERROR gramid=/16506017792204720721/14500289408298670491/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:56:14.494684Z id=7741 event=gram.make_job_dir.end level=ERROR gramid=/16506017791931179601/14500289408298670491/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:58:41.759802Z id=7822 event=gram.make_job_dir.end level=ERROR gramid=/16506017790855326121/14500289408298689940/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:58:41.883412Z id=7822 event=gram.make_job_dir.end level=ERROR gramid=/16506017792564247721/14500289408298689940/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"

What exactly is the pilot/osg-flock.grid.iu.edu service trying to do here that it needs to create a ~/.globus directory?

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Attached are the stdout and stderr files from the last nova job submitted to OSC's Oakley cluster yesterday.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Also, our GRAM instance is starting to see RSL errors again:

ts=2016-01-12T18:22:25.818325Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:22:25.818345Z id=2642 event=gram.query.end level=ERROR gramid=/16506029772674911736/14500289408298685574/ uri="/16506029772674911736/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.062297Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.062340Z id=2642 event=gram.query.end level=ERROR gramid=/16506167220725199961/14500289408298660531/ uri="/16506167220725199961/14500289408298660531/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.069415Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.069446Z id=2642 event=gram.query.end level=ERROR gramid=/16506213379776989516/14500289408298685574/ uri="/16506213379776989516/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.079587Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.079618Z id=2642 event=gram.query.end level=ERROR gramid=/16506029772674911736/14500289408298685574/ uri="/16506029772674911736/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

> 4) Troy/Doug, first I'd like to apologize for the state of this ticket. If Marty/Kevin come back with more issues, we may want to start thinking about scrapping the globus gatekeeper entirely and moving to HTCondor-CE. Another site was having issues with the globus gatekeeper and after lengthy troubleshooting we decided that switching was the easier solution. They were able to install the new software and accept new jobs in half a day.

I will have to check internally on possible timelines for that.  Unfortunately this is unfunded effort for us, so it's not going to be super high priority.

> 1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.

I'm confused by this.  AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"?  If not, to what local user should that DN map?  And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?

> But the real problem we've been trying to understand is the gram failing to report accurate job status back to our factory submit host.

I'll take a further look at this in the next day or two.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

This libfann issue is a red herring. It is an unrelated problem we already know about and are working on fixing it.
-Alex

On Jan 11, 2016, at 4:42 PM, Open Science Grid FootPrints <osg@....<mailto:osg@....>> wrote:

[Duplicate message snipped]

I'll also reply for Marty on behalf of factory ops since he's not back until Tomorrow.

There are 2 independent issues here,

In reply to Brian Lin's questions:
https://ticket.opensciencegrid.org/27288#1452545246

1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.

But the real problem we've been trying to understand is the gram failing to report accurate job status back to our factory submit host. Please see this plot:
http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=Nova_US_OSC_osg&frontend=Fermilab-fifebatch_frontend&infoGroup=running&elements=StatusRunning,ClientGlideRunning,ClientGlideIdle,StatusIdle,&rra=0&window_min=0&window_max=0&timezone=-8

Yes, Kevin is right, Nova is running at the moment, but that will only be temporary. See in our factory plot, "claimed" is non- zero, that means pilots were actually running at OSC. However our "running" green area is flatlined to 0. Gram is not reporting back that those pilots ever transitioned from idle to running.

If Nova keeps submitting batches, eventually the factory will only see "idle" pilots on the queue and no longer submit new ones, because these "idle" pilots have in fact already ran to completion, but on our factory side condor-g thinks they haven't even started yet.

This is the worrying issue that in my opinion warrents the transition to condor-ce.

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

> Do you have Nova's CVMFS directory (/cvmfs/nova.opensciencegrid.org) set up? I'm wondering if the cert errors are due to either A) the initial configuration failure or B) the script may be looking for certs in a CVMFS directory.

AFAICT, yes:

root@oak-rw:/nfs/13/troy# pdsh -w n0011,n0134,n0148,n0580 ls -al /cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft/releases/S16-01-07/lib | dshbak -c
----------------
n[0011,0134,0148,0580]
----------------
total 46
drwxr-xr-x  4 cimsrvr 301    69 Jan  7 17:22 .
drwxr-xr-x 97 cimsrvr 301  4096 Jan  7 19:01 ..
drwxr-xr-x  2 cimsrvr 301 20480 Jan  7 17:42 Linux2.6-GCC-debug
drwxr-xr-x  2 cimsrvr 301 20480 Jan  7 17:50 Linux2.6-GCC-maxopt

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Troy,

Do you have Nova's CVMFS directory (/cvmfs/nova.opensciencegrid.org) set up? I'm wondering if the cert errors are due to either A) the initial configuration failure or B) the script may be looking for certs in a CVMFS directory.

Thanks,
Brian

My response to Enrique:

On 01/11/2016 04:20 PM, Enrique Arrieta Diaz wrote:
> Here are the messages that I see:
>
> *****************************************************************************************************
> terminate called after throwing an instance of 'cet::exception'
>    what():  ---- Configuration BEGIN
>    Unable to load requested library /cvmfs/nova.opensciencegrid.org/novasoft/slf6
> /novasoft/releases/S16-01-07/lib/Linux2.6-GCC-maxopt/libLEM_dict.so
>    libdoublefann.so.2: cannot open shared object file: No such file or directory
> ---- Configuration END

Is this application assuming that libdoublefann.so.2 is in the default library search path?  Because AFAICT, that is not the case on Oakley.

> IFDH_DEBUG=0 => 0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> Mon Jan 11 15:27:39 EST 2016 ./arrieta1-Offsite_test_OSC-20160111_1401.sh COMPLETED with exit status 250
> IFDH_DEBUG=0 => 0
> *****************************************************************************************************

That's odd.  That seems like a cert problem, but the cert in question is in our OSG VO box's /etc/grid-security/certificates directory as well as the moral equivalent in the osg-wn-client installation.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Doug and I got some out-of-band email from Enrique Arrieta Díaz:

>On 01/11/2016 03:13 PM, Troy Baer wrote:
>> On 01/11/2016 03:03 PM, Enrique Arrieta Diaz wrote:
>>> Hello Troy, I just submitted a new set of 20 Productions test jobs to
>>> OSC:
>>>
>>> http://samweb.fnal.gov:8480/station_monitor/nova/stations/nova/projects/arrieta1-Offsite_test_OSC-20160111_1401
>>>
>>>
>>> 6575523.0@....
>>
>> We have 4 Nova jobs currently running on Oakley.  I am not sure that
>> you are using the term "job" in quite the same way we are, though --
>> the samweb.fnal.gov URL above appears to refer to only one of the four
>> TORQUE jobs from Nova that I see.
>>
>I see that all of the OSG jobs in the above URL have ended with status
>"bad" and last activity "process ended - bad".  Do you see anything on
>your end with more useful error messages?

*****************************************************************************************************
terminate called after throwing an instance of 'cet::exception'
what():  ---- Configuration BEGIN
Unable to load requested library /cvmfs/nova.opensciencegrid.org/novasoft/slf6
/novasoft/releases/S16-01-07/lib/Linux2.6-GCC-maxopt/libLEM_dict.so
libdoublefann.so.2: cannot open shared object file: No such file or directory
---- Configuration END

IFDH_DEBUG=0 => 0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
Mon Jan 11 15:27:39 EST 2016 ./arrieta1-Offsite_test_OSC-20160111_1401.sh COMPLETED with exit status 250
IFDH_DEBUG=0 => 0
*****************************************************************************************************

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Nova had about 40 jobs waiting to run at OSC for the past few days, these all ran this morning and they appear to have completed successfully.

Kevin

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

Ok, this ticket is really getting out of hand and is in dire need of a summary.

1) Marty, what exactly are the problems that you're seeing from the factory end? Simple auth errors? Because it sounds like the osg-flock pilot DN has NOT been added to their grid-mapfile when it very much should be.

2) Kevin, are you seeing any issues with NOVA jobs? It's not very clear in this ticket what the original problem actually was and if the issues ever actually affected NOVA.

3) Marty/Kevin, did the factories submit lots of pilots to OSC over the weekend that would cause the OOM on their CE?

4) Troy/Doug, first I'd like to apologize for the state of this ticket. If Marty/Kevin come back with more issues, we may want to start thinking about scrapping the globus gatekeeper entirely and moving to HTCondor-CE. Another site was having issues with the globus gatekeeper and after lengthy troubleshooting we decided that switching was the easier solution. They were able to install the new software and accept new jobs in half a day.

- Brian

We've had 4 jobs arrive this afternoon.  I did the following to capture their state 10-15 minutes after they started:

pdsh -w n0011,n0134,n0148,n0580 cd /tmp \; find  -user nova \| xargs tar czf /tmp/nova-files.tgz

The results of these are fairly large, ranging from ~915MB to 4.5GB:

troy@oakley01:/fs/lustre/troy$ ls -alh
total 12G
drwxr-xr-x   7 troy sysp  28K Jan 11 14:48 .
drwxrwxrwt 202 root root  52K Jan  7 12:47 ..
[...]
-rw-r--r--   1 troy sysp 3.8G Jan 11 14:45 n0011-nova-files.tgz
-rw-r--r--   1 troy sysp 4.5G Jan 11 14:45 n0134-nova-files.tgz
-rw-r--r--   1 troy sysp 1.9G Jan 11 14:46 n0148-nova-files.tgz
-rw-r--r--   1 troy sysp 915M Jan 11 14:46 n0580-nova-files.tgz
[...]

These jobs have been running for more than ~20 minutes, so AFAICT they are successful from our perspective.  I'm not sure how I would verify that they're successful from OSG's PoV, though.

We're also seeing lots of attempts by the DN "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu" to connect to our OSG VO box, but AFAICT we've never been asked to add them to our grid-mapfile.  OTOH, we're not seeing any more of the RSL attribute errors we were seeing last month.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Our OSG VO box OOMed over the weekend.  It looks like something on the OSG side kept starting new connections to it till it fell over:

root@osg:/nfs/13/troy# lastcomm -f /var/account/pacct.1 | grep -c condor_starter
9680
root@osg:/nfs/13/troy# lastcomm -f /var/account/pacct.1 | grep -c globus-gatekeep
25837

root@osg:/nfs/13/troy# grep -c nova /var/spool/batch/torque/client_logs/20160110
649
root@osg:/nfs/13/troy# grep -c tomcat /var/spool/batch/torque/client_logs/20160110
1062

We did have 12 nova jobs appear after our OSG VO box recovered, but only 4 of them ran for more than ~20 minutes and none ran for more than ~40 minutes.

If OSG is going to push some test jobs at us, I wish we would get some prior warning, since we kind of need to be paying attention while they are running.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Good morning,

Checking in on the status of testing.  If I may assist please let me know.

Thank you,

I will ask, but our offsite expert may not be back from the holidays yet.
-Alex

On Jan 5, 2016, at 4:36 PM, Open Science Grid FootPrints <osg@....<mailto:osg@....>> wrote:

[Duplicate message snipped]

It looks like the files in question have been purged in the interim.  Can somebody try to send some Nova jobs our way tomorrow morning?

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Troy,

Could you host the file somewhere so that I could grab it?

Thanks,
Brian

Could you try e-mailing it or placing it on a web-server so I could grab it?

Thanks,
Brian

I tried that yesterday, and the resulting file was larger than the OSG
ticket system would accept.

On 12/15/2015 12:46 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]

Can you grab one of the /tmp/pbstmp.* folders and attach it to this ticket?

AFAICT, we had 97 nova jobs submitted this afternoon, but only 5 of them ran for significantly longer than 20 minutes.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Something appears to have changed, because now we have a bunch of nova jobs running.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

On 12/14/2015 02:02 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]

We're seeing a lot of errors like the following in our GRAM logs now:

ts=2015-12-14T18:56:42.460154Z id=31271 event=gram.validate_rsl.end
level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2015-12-14T18:56:42.460194Z id=31271 event=gram.query.end level=ERROR
gramid=/16506225426086058416/14500289408298685574/
uri="/16506225426086058416/14500289408298685574/" msg="Error processing
query" status=-48 reason="the provided RSL could not be properly parsed
ts=2015-12-14T18:56:42.482060Z id=31271 event=gram.validate_rsl.end
level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2015-12-14T18:56:42.482105Z id=31271 event=gram.query.end level=ERROR
gramid=/16506255115752160996/14500289408298685574/
uri="/16506255115752160996/14500289408298685574/" msg="Error processing
query" status=-48 reason="the provided RSL could not be properly parsed

--Troy

--

Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/

Hi,
There are currently Nova jobs trying to run at OSC; looks like our frontend has been asking for glideins at OSC for several days now (since at least 12/8) without any starting.

Kevin

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

Hi Troy,

The factory doesn't actually generate/send any jobs directly, so we'll need to get the Fermilab or OSG frontend people to send some jobs your way. I believe Kevin Retzke was taking care of that earlier; Kevin, could you arrange for some more Nova jobs to land on their site?

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659

To collect output, we need the factory to send us some jobs.

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

It looks like the glidein logs have a different naming scheme than on the factory, as those gmon.out files are definitely not the glidein logs. It might be worth making a tarball of one of the /tmp/pbstmp.* folders in its entirety instead, to make sure we get everything associated with the run.

Brendan Dennis
UCSD Glidein Factory Logs

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659

Since the ticket system won't accept .tgz files (which is a rant for another time), I've unloaded a base64-encoded version of the tarball.  To extract, do the following:

base64 -d <osg-logs.tgz.base64encoded.txt | tar xzvf -

by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671

Sorry, this fell off the stack.  I need to get my DN updated before I
can upload files to the ticket, and I haven't had a chance to request
that yet.

On 12/09/2015 10:36 AM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]

Doug/Troy: Any word on those files and the size of /tmp?

Doug/Troy: I'd also be curious to see other logs, namely the MasterLog, StarterLog, and StartdLog. How big is your /tmp partition? I ask because we've seen some issues with /tmp filling up and causing pilots to go haywire.

Thanks,
Brian

Hi Doug,

Unfortunately we cannot attach files to the ticket when received via email.  You must use the web UI to attach files.
You should be able to reach this ticket directly using the following URL:  https://ticket.grid.iu.edu/27288

Please let me know if I may assist.

Thank you,
Vince

Doug,

You'll need to use the web UI to attach files to the ticket. Could you look at another glidein and provide its *.out, *.err, and log dir?

Thanks,
Brian

Attached is a tarball of the following files from one of our compute
nodes that has 9 nova jobs running:

# find /tmp/pbstmp.*/glide* \( -name "*.out" -o -name "*.err" \)
/tmp/pbstmp.5167119/glide_FUO19t/execute/dir_9427/gmon.out
/tmp/pbstmp.5167120/glide_QxP8JX/execute/dir_17683/gmon.out
/tmp/pbstmp.5167121/glide_y1OFrM/execute/dir_7394/gmon.out
/tmp/pbstmp.5167132/glide_9sjUnb/execute/dir_12825/gmon.out
/tmp/pbstmp.5167133/glide_tFhiUP/execute/dir_22538/gmon.out
/tmp/pbstmp.5167141/glide_wA2FHE/execute/dir_5728/gmon.out
/tmp/pbstmp.5167143/glide_HtLbxp/execute/dir_20651/gmon.out
/tmp/pbstmp.5167146/glide_euGW9z/execute/dir_15368/gmon.out
/tmp/pbstmp.5168310/glide_E9Tg25/execute/dir_24003/gmon.out
/tmp/pbstmp.5168312/glide_DRsomU/execute/dir_15369/gmon.out
/tmp/pbstmp.5168314/glide_DNxLfE/execute/dir_19300/gmon.out

Unfortunately, I don't see any errors messages (or much of anything
else) in those.

On 11/24/2015 06:15 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]

Troy,

If possible, can you upload some examples of any *.out or *.err logs you find in these temporary work directories as an attachment to the ticket here?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

One wrinkle in this is that $TMPDIR on our systems is ephemeral; it only
lives as long as the TORQUE job, so we pretty much have to catch these
while they're running.  I've tried poking around on some of the
currently running jobs, and I'm not really sure what I'm looking at:

[root@n0249 ~]# ls -al /tmp/pbstmp.5164908/glide_bfhhWd/log/
total 844
drwxr-xr-x  2 nova PES0656   4096 Nov 24 17:52 .
drwxr-xr-x 10 nova PES0656   4096 Nov 24 14:54 ..
-rw-------  1 nova PES0656      0 Nov 24 14:54 InstanceLock
-rw-r--r--  1 nova PES0656    104 Nov 24 14:54 .master_address
-rw-r--r--  1 nova PES0656   1932 Nov 24 14:54 MasterLog
prw-------  1 nova PES0656      0 Nov 24 17:52 procd_address
prw-------  1 nova PES0656      0 Nov 24 17:52 procd_address.watchdog
-rw-r--r--  1 nova PES0656 434896 Nov 24 17:52 ProcLog
-rw-r--r--  1 nova PES0656    104 Nov 24 14:54 .startd_address
-rw-------  1 nova PES0656    146 Nov 24 14:55 .startd_claim_id.slot1
-rw-r--r--  1 nova PES0656 396618 Nov 24 17:50 StartdLog
-rw-r--r--  1 nova PES0656   3659 Nov 24 14:55 StarterLog

[root@n0249 ~]# grep -i error /tmp/pbstmp.5164908/glide_bfhhWd/log/*Log
/tmp/pbstmp.5164908/glide_bfhhWd/log/MasterLog:11/24/15 14:54:12
(pid:10576) Daemon Log is logging: D_ALWAYS D_ERROR
/tmp/pbstmp.5164908/glide_bfhhWd/log/StartdLog:11/24/15 14:54:13
(pid:10580) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB
/tmp/pbstmp.5164908/glide_bfhhWd/log/StartdLog:11/24/15 14:54:14
(pid:10580) VM-gahp server reported an internal error
/tmp/pbstmp.5164908/glide_bfhhWd/log/StarterLog:11/24/15 14:55:21
(pid:10607) Daemon Log is logging: D_ALWAYS D_ERROR
/tmp/pbstmp.5164908/glide_bfhhWd/log/StarterLog:11/24/15 14:55:22
(pid:10607) Error file:
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/_condor_stderr

[root@n0249 ~]# more
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/_condor_stderr
[...nothing...]

What does "VM-gahp server reported an internal error" mean? That's the
only obvious error I can see, and it seems to happen almost
immediately.  Also, this job is clearly doing work:

[root@n0249 ~]# ps auxwf | grep nova
root     15584  0.0  0.0 103304   820 pts/8    S+ 18:03
0:00              \_ grep nova
nova      5524  0.0  0.0 110424  1612 ?        Ss 14:53   0:00  \_ -sh
/var/spool/batch/torque/mom_priv/jobs/5164908.oak-batch.osc.edu.SC
nova      5597  0.0  0.0 106628  1816 ?        S 14:53   0:00      \_
/bin/bash
/nfs/17/nova/.globus/.gass_cache/local/md5/3a/d6c4c10150ec60d3258681bacb155f/md5/97/ef9ce28e4e53f8ceecfce5d0719590/data
-v std -name gfactory_instance -entry Nova_US_OSC_osg -clientname
fifebatchgpvmhead2_OSG_gWMSFrontend.OSG_nova -schedd
schedd_glideins5@.... -proxy None -factory SDSC -web
http://gfactory-1.t2.ucsd.edu/factory/stage -sign
cd26d67f9e8ea668f2d8bfea737a13ef27c5fded -signentry
6496b05b9d283d89a011e6e62a554811c0151e78 -signtype sha1 -descript
description.fbobom.cfg -descriptentry description.fbobom.cfg -dir TMPDIR
-param_GLIDEIN_Client fifebatchgpvmhead2_OSG_gWMSFrontend.OSG_nova
-submitcredid 714925 -slotslayout fixed -clientweb
http://fifebatchgpvmhead2.fnal.gov/vofrontend/stage -clientsign
22678b74f42a46020074e625a4beebdf7d4d1d85 -clientsigntype sha1
-clientdescript description.fb5fm7.cfg -clientgroup OSG_nova
-clientwebgroup
http://fifebatchgpvmhead2.fnal.gov/vofrontend/stage/group_OSG_nova
-clientsigngroup e1941fbc24b3e0a74eb4a4aeea134bdf9e682b38
-clientdescriptgroup description.ebieT2.cfg -param_CONDOR_VERSION
default -param_GLIDEIN_Glexec_Use NEVER -param_GLIDEIN_Job_Max_Time
34800 -param_GLIDECLIENT_ReqNode
gfactory.minus,1.dot,t2.dot,ucsd.dot,edu -param_GLIDECLIENT_Rank 1
-param_GLIDEIN_Report_Failed NEVER -param_MIN_DISK_GBS 1
-param_GLIDEIN_Monitoring_Enabled False -param_HAS_USAGE_MODEL OFFSITE
-param_UPDATE_COLLECTOR_WITH_TCP True -param_CONDOR_ARCH default
-param_USE_MATCH_AUTH True -param_CONDOR_OS default
-param_GLIDEIN_Collector
fifebatchhead3.dot,fnal.dot,gov.colon,9620.minus,9630.semicolon,fifebatchhead4.dot,fnal.dot,gov.colon,9620.minus,9630
-cluster 3281717 -subcluster 2
nova      9850  0.0  0.0   9376  1380 ?        S 14:54   0:00
\_ /bin/bash /tmp/pbstmp.5164908/glide_bfhhWd/main/condor_startup.sh
glidein_config
nova     10576  0.0  0.0  95836  8152 ?        S 14:54
0:00              \_
/tmp/pbstmp.5164908/glide_bfhhWd/main/condor/sbin/condor_master -f
-pidfile /tmp/pbstmp.5164908/glide_bfhhWd/condor_master2.pid
nova     10579  0.1  0.0  23580  5024 ?        S 14:54
0:13                  \_ condor_procd -A
/tmp/pbstmp.5164908/glide_bfhhWd/log/procd_address -L
/tmp/pbstmp.5164908/glide_bfhhWd/log/ProcLog -R 1000000 -S 60 -C 16997
nova     10580  0.0  0.0  96668  9068 ?        S 14:54
0:04                  \_ condor_startd -f
nova     10607  0.0  0.0  96008  8520 ?        S 14:55
0:00                      \_ condor_starter -f fifebatch1.fnal.gov
nova     10611  0.0  0.0   9376  1272 ?        S 14:55
0:00                          \_ /bin/sh
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/condor_exec.exe
--limit 4 --multifile --export
DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox --config
Production/fcl/prod_reco_pidpart_numi_job.fcl --source
/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft/setup/setup_nova.sh:-r:S15-05-04c:-b:maxopt:-e:/cvmfs/nova.opensciencegrid.org/externals:-5:/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft:-6:/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft
-X runNovaSAM.py --hashDirs --copyOut --outTier out1:reco --outTier
out2:lemsum --outTier out3:pidpart
nova     10673  0.0  0.0   9732  1860 ?        S 14:55
0:00                              \_ /bin/sh
./arrieta1-draining_prod_artdaq_FA14-10-03x.d_nd_genie_fhc_nonswap_ndnewpos_draining_reco_S15-05-04c_AND_pidpart_S15-05-04c-20151123_1402.sh
--limit 4 --multifile --export
DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox --config
Production/fcl/prod_reco_pidpart_numi_job.fcl --source
/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft/setup/setup_nova.sh:-r:S15-05-04c:-b:maxopt:-e:/cvmfs/nova.opensciencegrid.org/externals:-5:/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft:-6:/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft
-X runNovaSAM.py --hashDirs --copyOut --outTier out1:reco --outTier
out2:lemsum --outTier out3:pidpart
nova     15518  0.3  0.0 220000 41488 ?        S 18:01
0:00                                  \_ python
/cvmfs/nova.opensciencegrid.org/externals/NovaGridUtils/v01.44/NULL/bin/runNovaSAM.py
-c Production/fcl/prod_reco_pidpart_numi_job.fcl --hashDirs --copyOut
--outTier out1:reco --outTier out2:lemsum --outTier out3:pidpart
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/no_xfer/ifdh_16997_5524/neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.root
nova     15527 94.2  1.3 1384208 653076 ?      Rl 18:01
2:08                                      \_ nova -c
prod_reco_pidpart_numi_job_neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.fcl
--sam-application-family=nova --sam-application-version=S15-05-04c
--sam-file-type=importedSimulated --sam-data-tier=out1:reco
--sam-stream-name=out1:out1 --sam-data-tier=out2:lemsum
--sam-stream-name=out2:out1 --sam-data-tier=out3:pidpart
--sam-stream-name=out3:out1
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/no_xfer/ifdh_16997_5524/neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.root

On 11/24/2015 02:54 PM, Open Science Grid FootPrints wrote:
> ------------------------------------------------------------------------
> *Notification of Ticket Change*
>
> https://ticket.opensciencegrid.org/27288
>
>
> *Description:*
> Hi Doug,
>
> Is there anyway you can try and find some of the glidein logs
> associated with this period of time you observed this non-trivial CPU
> usage? The logs are still not being returned to the glidein factories
> after running at the site, if they are in fact running successfully.
> We don't have any confirmation of glideins running properly at OSC. In
> fact, during the last week, we only see that most glideins were held
> at the factories and never properly submitted to OSC. I've cleared out
> these held glideins from the factories. Maybe the new ones will not go
> held again.
>
> Marty Kandes
> UCSD Glidein Factory Operations
>
> by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty
> Kandes 3049
>

--

Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/

Hi Doug,

Is there anyway you can try and find some of the glidein logs associated with this period of time you observed this non-trivial CPU usage? The logs are still not being returned to the glidein factories after running at the site, if they are in fact running successfully. We don't have any confirmation of glideins running properly at OSC. In fact, during the last week, we only see that most glideins were held at the factories and never properly submitted to OSC. I've cleared out these held glideins from the factories. Maybe the new ones will not go held again.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

From our end, it looks like OSG/nova jobs that have come in after
2015-11-20 12:50:00 EST have accrued non-trivial amounts of CPU time.

On 11/17/2015 03:02 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]

Doug,

No problem. Let us know when you're back and have had a chance to look for some glidein logs on your end.

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Unfortunately I can't get the exact directory that the glidein logs will be in, since there are no glidein stdout/err logs being returned that would have the variable information set, but one place to look (besides TMPDIR) would be to check /tmp for glide_* folders, where * is a six digit alphanumeric. A lot of glidein data and logs should be in those folders.

Also, I'm clearing out all of the stale held glideins again on the site, so you should start getting jobs attempting to run shortly.

Brendan Dennis
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659

Hi,

I've been unable to spend time on this ticket before leaving for the SC15 conference. One thing that we still need to do is change the version of the osg-wn-client software on the compute nodes. We neglected to upgrade this software when we upgrade the VO box. Version 3.2.29 is installed. Please send updates when we can expect running jobs, and the specific names of log files you would want collected.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

Doug,

You should be able to find the glidein logs on the worker nodes in the work_dir="TMPDIR", where TMPDIR is a path environment variable you all probably setup on your end originally. I suspect it's whereever your found the original stderr/stdout files you sent us at the beginning of this ticket. Can you try and dig up a few more, more recent ones for us?

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Marty,

In that case, where can Doug find the glidein logs? Presumably on the worker nodes but where?

Thanks,
Brian

Hi Brian,

Yes and no. The problem has 'disappeared' for us in the sense that glidein logs are no longer returned to the factories now. So what we really need is someone look at the glidein logs locally at OSC to determine what might be causing the problem, which now seems to have a similar profile to the possible GRAM/globus-related problem previously observed at Hyak (https://ticket.opensciencegrid.org/26794).

If you're interested in other sites exhibiting the 'Not enough arguments' error, I'm compiling a list here --- https://jira.opensciencegrid.org/browse/GFACTOPS-765 --- as I come across them.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Doug: Could you attach the output of osg-system-profiler?

Marty: It's a little hard to follow this ticket, has the 'Not enough arguments' error gone away? Is the problem now that glidein startd's are not reporting their status properly back to the SDSC factory?

Thanks,
Brian

Hi all,

As Troy mentioned earlier, the failure of OSG glideins to authenticate properly at OSC is likely due to their DN not being in the grid-mapfile. Has this DN been added?

As for the Nova-specific Fermilab glideins, it looks like we may have encountered a GRAM-related problem also seen at another site previously. Here, the symptoms are that user jobs and glideins appear to run fine when viewed from the site's perspective, but the glideins fail to update the factories with their progress correctly. If I recall correctly, this leads to the factory not requesting the appropriate amount of glideins at a site. I'll alert the OSG software folks about this as we've been attempting to reproduce the problem elsewhere.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Hi,

We are seeing a mix of what appear to be successful jobs, but have seen some failed jobs on the 6th.  New attachment named stdout-20151106.txt with the error.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

The Nova glideins came from the SDSC factory (and the jobs did finish fine):

005 (3320115.6349.000) 11/07 00:17:43 Job terminated.
(1) Normal termination (return value 0)
Usr 0 14:33:35, Sys 0 00:00:26  -  Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
Usr 0 14:33:35, Sys 0 00:00:26  -  Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
5311732  -  Run Bytes Sent By Job
24369  -  Run Bytes Received By Job
5311732  -  Total Bytes Sent By Job
24369  -  Total Bytes Received By Job
Partitionable Resources :    Usage  Request Allocated
Cpus                 :                 1         1
Disk (KB)            :  4460720  4000000 819007892
Memory (MB)          :      928     2000      2500
...
028 (3320115.6349.000) 11/07 00:17:43 Job ad information event triggered.
SentBytes = 5311732.0
JOB_GLIDEIN_Name = "gfactory_instance"
TerminatedNormally = true
ReturnValue = 0
JOB_GLIDEIN_ClusterId = "3234446"
JOB_GLIDEIN_SiteWMS = "PBS"
EventTypeNumber = 28
Subproc = 0
ReceivedBytes = 24369.0
JOB_GLIDEIN_SiteWMS_Slot = "Unknown"
MyType = "JobTerminatedEvent"
TriggerEventTypeName = "ULOG_JOB_TERMINATED"
TotalRemoteUsage = "Usr 0 14:33:35, Sys 0 00:00:26"
JOB_Site = "OSC"
JOB_GLIDEIN_Site = "OSC"
Proc = 6349
EventTime = "2015-11-07T00:17:43"
TotalSentBytes = 5311732.0
TriggerEventTypeNumber = 5
TotalLocalUsage = "Usr 0 00:00:00, Sys 0 00:00:00"
JOB_GLIDEIN_Schedd = "schedd_glideins5@...."
Cluster = 3320115
RunLocalUsage = "Usr 0 00:00:00, Sys 0 00:00:00"
JOB_GLIDEIN_Entry_Name = "Nova_US_OSC_osg"
JOB_GLIDEIN_ProcId = "0"
JOB_GLIDEIN_Factory = "SDSC"
JOB_GLIDEIN_SiteWMS_Queue = "serial"
RunRemoteUsage = "Usr 0 14:33:35, Sys 0 00:00:26"
JOB_GLIDEIN_SiteWMS_JobId = "5021481.oak-batch.osc.edu"
TotalReceivedBytes = 24369.0

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

Was the problem fixed?
Messages are a bit confusing. From one side Keving seems to run successfully from the other the OSG factory glideins seem to fail authorization. Is Kevin using a different factory?
Thanks, Marco

The jobs are still running; I just did a condor_ssh_to_job on one and it's chugging along.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

I'm not sure if this is related to the auth problem Marty mentioned, but
we've seen a lot (>1700) or errors like the following today:

PID: 19695 -- Failure: globus_gss_assist_gridmap() failed
authorization. globus_gss_assist: Gridmap lookup failure: Could not map
/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu

Indeed, that DN does not appear in the grid-mapfile on our OSG host:

root@osg:~# grep "/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"
/etc/grid-security/grid-mapfile
[...nothing...]

I'm not sure if that's indicative of a problem or not.

--Troy

On 11/06/2015 04:59 PM, Open Science Grid FootPrints wrote:
> ------------------------------------------------------------------------
> *Notification of Ticket Change*
>
> https://ticket.opensciencegrid.org/27288
>
>
> *Description:*
> Marco et al.,
>
> OSG glideins are being held at the factory due to an authentication
> problem [1].
>
> Marty Kandes
> UCSD Glidein Factory Operations
>
> [1]
>
> 2529089.6 feosgflock 11/6 20:39 Globus error 7: authentication with
> the remote server failed
> 2529089.8 feosgflock 11/6 20:39 Globus error 7: authentication with
> the remote server failed
> 2529129.1 feosgflock 11/6 21:18 Globus error 7: authentication with
> the remote server failed
>
> by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty
> Kandes 3049
>

--

Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/

Kevin,

Did your Nova jobs complete successfully?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Marco et al.,

OSG glideins are being held at the factory due to an authentication problem [1].

Marty Kandes
UCSD Glidein Factory Operations

[1]

2529089.6   feosgflock     11/6  20:39 Globus error 7: authentication with the remote server failed
2529089.8   feosgflock     11/6  20:39 Globus error 7: authentication with the remote server failed
2529129.1   feosgflock     11/6  21:18 Globus error 7: authentication with the remote server failed

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

I see 20 nova jobs running at OSC. Looks like they're running fine. I'll check later today after some should have finished.

[root@fifebatch1 ~]# condor_q -global -run -constraint 'match_glidein_site == "OSC"'

-- Schedd: fifebatch1.fnal.gov : <131.225.67.102:9615?addrs=131.225.67.102-9615&noUDP&sock=5418_fe92>
ID          OWNER            SUBMITTED     RUN_TIME HOST(S)
3320115.6349 novapro        11/1  22:45   0+01:15:37 glidein_6304_39232620@....
3320115.6350 novapro        11/1  22:45   0+01:15:37 glidein_24161_127665226@....
3320115.6351 novapro        11/1  22:45   0+01:15:37 glidein_4109_377560656@....
3320115.6352 novapro        11/1  22:45   0+01:15:37 glidein_3991_267024100@....
3320115.6353 novapro        11/1  22:45   0+01:15:36 glidein_26036_694103520@....
3320115.6354 novapro        11/1  22:45   0+01:15:36 glidein_4318_555792800@....
3320115.6355 novapro        11/1  22:45   0+01:15:36 glidein_25569_44532554@....
3320115.6356 novapro        11/1  22:45   0+01:15:36 glidein_4229_78867672@....
3320115.6357 novapro        11/1  22:45   0+01:15:36 glidein_24162_632328475@....
3320115.6358 novapro        11/1  22:45   0+01:15:36 glidein_22742_40579404@....
3320115.6359 novapro        11/1  22:45   0+01:15:33 glidein_29525_872434150@....
3320115.6360 novapro        11/1  22:45   0+01:15:36 glidein_27392_56514680@....
3320115.6361 novapro        11/1  22:45   0+01:15:35 glidein_4726_112252800@....
3320115.6362 novapro        11/1  22:45   0+01:15:35 glidein_22638_810925536@....
3320115.6363 novapro        11/1  22:45   0+01:15:35 glidein_26035_414303848@....
3320115.6364 novapro        11/1  22:45   0+01:15:35 glidein_32692_110966837@....
3320115.6365 novapro        11/1  22:45   0+01:15:35 glidein_22472_148323987@....
3320115.6366 novapro        11/1  22:45   0+01:15:35 glidein_22473_189983650@....
3320115.6394 novapro        11/1  22:45   0+00:48:50 glidein_6226_45024500@....
3320115.6397 novapro        11/1  22:45   0+00:47:50 glidein_4186_198896256@....
[root@fifebatch1 ~]# grep 3320115\.6349 /var/log/condor/EventLog
001 (3320115.6349.000) 11/06 08:36:26 Job executing on host: <10.25.4.15:54687?CCBID=131.225.67.218:9625%3faddrs%3d131.225.67.218-9625#1326402%20131.225.67.219:9625%3faddrs%3d131.225.67.219-9625#1325329&noUDP>
006 (3320115.6349.000) 11/06 08:36:34 Image size of job updated: 10
006 (3320115.6349.000) 11/06 08:41:34 Image size of job updated: 164552
006 (3320115.6349.000) 11/06 08:51:35 Image size of job updated: 2182304
006 (3320115.6349.000) 11/06 08:56:36 Image size of job updated: 2195068
006 (3320115.6349.000) 11/06 09:01:39 Image size of job updated: 2200668
006 (3320115.6349.000) 11/06 09:06:40 Image size of job updated: 2213216
006 (3320115.6349.000) 11/06 09:16:41 Image size of job updated: 2216784
006 (3320115.6349.000) 11/06 09:21:41 Image size of job updated: 2216784
006 (3320115.6349.000) 11/06 09:31:42 Image size of job updated: 2237152

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

All,

We have jobs running, and they are actually executing the 'nova' executable. This is in contrast to what we were seeing before the upgrade with the glidein_startup.sh error. Please let us know whether these jobs look OK from an external perspective.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

Hi,

The upgrade is complete, please send glideins.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

One data point that might be useful that we stumbled on while trying
to install the CA changes that were just announced; we're still at
osg-version-3.1.46-1 which is apparently a dead-end.  Since glideins
are broken we will upgrade to 3.3, perhaps we should table testing
until that's finished.  I'll update the ticket when the upgrade is
finished.

Doug

On Thu, 05 Nov 2015 11:46:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1  <text/plain; utf-8 (quoted-printable)>]
> [2  <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]

Hi Marco,

I've enabled OSGVO at OSC.

Marty Kandes
UCSD Glidein Factory Operations

P.S. Yes, the 'Not enough arguments' error has been a bit odd. When I saw the problem before, it seemed to occur at sites with an already existing underlying problem. There was one very clear example of this: a known site with a problem before we upgraded glideinWMS, then after the upgrade the original problem immediately became 'Not enough arguments'. I'll go back and check my notes on the specifics here when I get a chance.

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Hi Marty,
could you re-enable OSG and send there some glideins?
It seems strange that "Not enough arguments" is appearing here and there. Being in the glidein_startup.sh script should affect all glideins.
I have also a patch attached to the redmine ticket (https://cdcvs.fnal.gov/redmine/issues/10762) but I'd wait first to see how the jobs at OSC are running.
Thank you,
Marco

Hi,

No jobs have been run since 2015-10-31.  If we can get some jobs submitted during the day we can try to collect these files during execution. These jobs run for about 20 minutes to failing, so there is a reasonable window to collect the file.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

Thank you Doug,
the file I need is the one invoked by this script:
/nfs/17/nova/.globus/.gass_cache/local/md5/da/ce1b3b754bc2e12f318fe5f5a335a3/md5/96/b8b66f54614b2f5e73994fb323bb96/data

Probably that specific one has been deleted. If you have a job running, you can see the last line in the scheduler_pbs_job_script for the updated location of the startup script.
It would be great if you can attach it and add in the message the command line used (e.g. /nfs/17/nova/.globus/.gass_cache/local/md5/da/ce1b3b754bc2e12f318fe5f5a335a3/md5/96/b8b66f54614b2f5e73994fb323bb96/data "-v" "std" "-name" "gfactory_instance" ... "-cluster" "2517720" "-subcluster" "7"  </dev/null) so that I can connect it to the stdout/err in the factory (or if you can attach its stdout/err).
Thank you,
Marco

Hi,

I don't see the glidein_startup.sh in the files I have available. I do have the batch script that was executed named scheduler_pbs_job_script, see attachment (with .txt suffix to get around upload restriction.)

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

I'm adding here and to the gwms ticket a patch to improve the error message.

Thanks, Marco.

Yeah, I've been seeing the 'Not enough arguments' error [1] pop up here and there again in the last month or so. But since It hasn't been as wide spread as it was in September after that glideinWMS upgrade, I haven't spent a lot of time to investigating. This is the 3rd entry where I've seen the problem. In some of these other cases, it doesn't appear to affect all factories. Here, in this case, there is also the associated hold reason [2] given for the held glideins at the factory. I'm not sure what the source of the problem is, but I suspect globus/GRAM.

Marty Kandes
UCSD Glidein Factory Operations

[1]

Sat Oct 31 11:35:26 EDT 2015 Not enough arguments in fetch_file main error_gen.sh error_gen.faulAe.sh regular 0 TRUE FALSE

[2]

3219131.9   fefermifife    11/1  03:06 Globus error 31: the job manager failed to cancel the job as requested

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

I opened a support ticket in GlideinWMS redmine: https://cdcvs.fnal.gov/redmine/issues/10762

Could you attach the glidein_startup,sh script that was run? This might be an issue we need to bring up with the GlideinWMS developers.

I'm CCing the factory support, they may have some insight as well.

Thanks,
Kevin

by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130

Hi,

We're experiencing glidein failures at OSC. The output from a failed job is attached as the file named 'stdout'. There are no apparent errors in the stdout of the job, that file is attached as well. Please let us know if this looks like a local problem to OSC, or a remote problem. If it's local to OSC, some guidance on where to start looking would be appreciated.

Doug

by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836

27288 / Error during execution of glideins