×By the end of May 2018, the ticketing system at https://ticket.opensciencegrid.org will be retired and support will be provided at https://support.opensciencegrid.org. Throughout this transition the support email ([email protected]) will be available as a point of contact.
Please see the service migration page for details: https://opensciencegrid.github.io/technology/policy/service-migrations-spring-2018/#ticket
by Brian Lin
Alright, I'll close this ticket. Thanks for your patience and let us know if you need any help in the future!
Cheers,
Brian
I'm inclined to declare victory on this.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
Troy,
There was a thread on the gt-user mailing list that sounded very similar to your problem here (Titled "[gt-user] Problems with PBS SEG on Torque 5.1."). If you're getting globus directly from their repos, it sounds like they've got a fix available in their unstable repo in their 'globus-gram-job-manager-pbs' packages. If you would like to turn the SEG back on, you can try these new packages. Otherwise, you can leave the SEG off and I think we can close this ticket because it looks like jobs are still correctly reporting.
- Brian
I know our other site that was experiencing issues (hyak) is a PBS site but I don't know if they had updated to a later version of torque or not. I agree, things look better on the graph so let's give it a bit before we consider this 'solved'.
I disabled SEG for our GRAM instance just for grins, and surprisingly things seemed to start working better in my manual testing using globusrun. A bunch of Nova jobs appeared shortly thereafter and are currently waiting in our local queue to run. I'm inclined to see how those jobs do over the long weekend.
This ticket was first opened on Nov 2, about a month and a half after we upgraded Oakley to TORQUE 5.1.1; during that time, no Nova jobs were submitted to Oakley. My current working theory is that the logging code in newer versions of TORQUE changed just enough that SEG's support for it is finnicky at best and broken at worst. Were other OSG sites that had trouble with GRAM also running new-ish versions of TORQUE?
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
Ok in that case, how exactly are you providing the PBS logs so that the
SEG can consume them? There should also be a directory that contains
files for each GRAM job and they should contain the status. Are all
those job statuses 'idle' or equivalent?
Thanks,
Brian
I have significantly more experience with GRAM than HTCondor, so I'm going to keep working on the GRAM side for a bit longer.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
Doug/Troy,
Sorry about the confusion about your supported VOs, I was under the impression that you guys also supported OSG. I'm glad we got that sorted out.
I don't think we need to start a new ticket because it looks like although jobs (both user jobs and pilot jobs) are completing successfully, the pilot jobs are still not reporting their state back to the factory properly: http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=Nova_US_OSC_osg&frontend=Fermilab-fifebatch_frontend&infoGroup=running&elements=StatusRunning. We've got two options for fixing this:
1) Replace your GRAM gatekeeper software with HTCondor-CE.
2) Troubleshoot your GRAM gatekeeper
We recommend option 1 since GRAM support is minimal at this stage and we really don't know how deep the rabbit hole goes for the GRAM gatekeeper issue. Let us know how you would like to proceed.
Thanks,
Brian
by djohnson@....
Hi Marty,
Thanks for removing the microbone VO for the time being. We're having
an internal discussion about supporting this experiment later this
morning, and we will make a decision soon. Please make sure to
contact me before any other changes to VO membership are made, let me
know if there are problems with our site contact details.
If the NOvA glideins from yesterday have run successfully I'd like to
recommend we close this ticket. And if there are lingering issues, I
recommend we create new tickets for each distinct item that remains.
Doug
On Thu, 14 Jan 2016 14:30:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1 <text/plain; utf-8 (quoted-printable)>]
> [2 <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]
Hi Troy,
We are currently observing Nova glideins pending at the gatekeeper, which means they are probably waiting in your local batch queue. We'll let you know if we see them run through OSC successfully.
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
I forgot to mention that we discovered yesterday that the globus-scheduler-event-generator service for our OSG GRAM instance broke in late December due to a server move. That has been fixed, AFAICT.
Is OSG still seeing glide-in errors to OSC now that we have resolved the fact that pilot/osg-flock.grid.iu.edu shouldn't have been trying to send job here?
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
> I have no knowledge about OSGVO support at OSC but we just added Microboone yesterday because there is an agreement for a Microboone scientist (Project: PES0709 (Randy Johnson, UCN)) to get cycles.
> Yeah, I was just about to mention that. We recently got the Microboone request: https://ticket.opensciencegrid.org/28198
That request was premature. They just received a startup allocation very recently, but we have not agreed to support OSG jobs for them yet.
In the future, please *DO NOT* add any VOs to the GLIDEIN_Supported_VOs list for OSC without first consulting with either Doug Johnson or myself. As of right now, the GLIDEIN_Supported_VOs list for OSC should consist of Nova only. (Also, for future reference, how would I be able to check the current contents of that list myself?)
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Hi all,
FYI - We've gone ahead and removed OSGVO from the GLIDEIN_Supported_VOs list.
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
Joe,
Yeah, I was just about to mention that. We recently got the Microboone request: https://ticket.opensciencegrid.org/28198
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
I have no knowledge about OSGVO support at OSC but we just added Microboone yesterday because there is an agreement for a Microboone scientist (Project: PES0709 (Randy Johnson, UCN)) to get cycles.
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Joe Boyd 579
Hi all,
Perhaps there has been some confusion here. Nova glideins should be submitted and run under the Fermilab VO DN [1], not the OSG VO DN, unless there has been a change to this I'm not aware of. Note, however, we do have Nova, OSGVO, and Microboone all listed in the GLIDEIN_Supported_VOs list in the factory configuration for OSC. Please confirm if you would like OSGVO and Microboone, which also runs under Fermilab VO DN, to be removed from this list. If we do this, then I believe you'll only receive the Nova-specific user jobs from the Fermilab VO frontend at OSC.
Marty Kandes
UCSD Glidein Factory Operations
[1]
subject : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov/CN=proxy
issuer : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
identity : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
type : proxy
strength : 1024 bits
path : /var/lib/gwms-factory/client-proxies/user_fefermifife/glidein_gfactory_instance/credential_fifebatchgpvmhead1_OSG_gWMSFrontend.OSG_nova_714925
timeleft : 264:00:51
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO fermilab extension information ===
VO : fermilab
subject : /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=frontend/fifebatch.fnal.gov
issuer : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov
attribute : /fermilab/nova/Role=pilot/Capability=NULL
attribute : /fermilab/accelerator/Role=NULL/Capability=NULL
attribute : /fermilab/annie/Role=NULL/Capability=NULL
attribute : /fermilab/argoneut/Role=NULL/Capability=NULL
attribute : /fermilab/cdms/Role=NULL/Capability=NULL
attribute : /fermilab/chips/Role=NULL/Capability=NULL
attribute : /fermilab/coupp/Role=NULL/Capability=NULL
attribute : /fermilab/darkside/Role=NULL/Capability=NULL
attribute : /fermilab/des/Role=NULL/Capability=NULL
attribute : /fermilab/dune/Role=NULL/Capability=NULL
attribute : /fermilab/genie/Role=NULL/Capability=NULL
attribute : /fermilab/gm2/Role=NULL/Capability=NULL
attribute : /fermilab/grid/Role=NULL/Capability=NULL
attribute : /fermilab/lar1/Role=NULL/Capability=NULL
Hi Brian,
Thanks for the clarification, but this is a change in behavior for us.
We currently only support the NOvA VO at OSC. All OSG initiated jobs
need to go through that VO for the purposes of accounting against the
NOvA collaboration's allocation at OSC. In our VO configuration we
have all the NOvA OSG initiated jobs going through a single user
account named nova. This has been how glideins have worked until the
last few weeks/month. I can't emphasize strongly enough that we are
not a general OSG resource, and that we have to charge OSG jobs
against a specific allocation at OSC based on the user ID the job runs
as.
If the NOvA jobs are now going to be initiated by
"/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu", shouldn't this DN be
specified by the VO by managing it through VOMS?
We may support additional VOs in the future, but again, we will
support those VOs because members of the VOs are faculty or PIs in
Ohio, and have been allocated computation cycles and storage at OSC
through OSC allocations. How would we preserve local accounting if
all jobs are being started through osg-flock.grid.iu.edu? There can't
be any DNs in common between VOs.
Doug
On Wed, 13 Jan 2016 18:44:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1 <text/plain; utf-8 (quoted-printable)>]
> [2 <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]
by Brian Lin
I'm not surprised that you see those failures because those jobs should be mapped to a user that can submit batch jobs. The main site for what the pilots are doing is here: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html. Pilot jobs are an essential part of the OSG infrastructure: when the job lands on a compute node, it starts up the HTCondor worker stack that communicates with the OSG, and allows users to OSG users to run jobs at your site.
If you don't want to allow this, we should talk to factory ops so that you can drop support to the OSG VO.
> I can't speak to exactly why pilot jobs need to create '~/.globus' but the pilots need to do a lot of things to set up their payloads so they can communicate back to the OSG and accept user jobs. I wouldn't expect jobs mapped to 'nobody' to work. Is there another user you could point it to?
For the time being, I've remapped that DN to the user account for rsv, which is local to our OSG VO box and can't submit batch jobs. However, that too is generating errors:
root@osg:/var/log/globus# tail /var/log/globus/gram_rsv.log
ts=2016-01-13T20:32:21.866456Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084249176766/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.148491Z id=19517 event=gram.job.end level=ERROR gramid=/16506032086491899326/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.292851Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084372714686/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.398011Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083232984766/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.429724Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083335676606/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.555567Z id=19517 event=gram.job.end level=ERROR gramid=/16506032084382490046/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.591012Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083541326526/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.595550Z id=19517 event=gram.job.end level=ERROR gramid=/16506032085156604606/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.699050Z id=19517 event=gram.job.end level=ERROR gramid=/16506032083296032446/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
ts=2016-01-13T20:32:28.699161Z id=19517 event=gram.job.end level=ERROR gramid=/16506032086786345406/14500289408298649993/ job_status=4 status=-17 reason="the job failed when the job manager attempted to run it"
Is there documentation somewhere that describes what this pilot/osg-flock.grid.iu.edu DN is trying to do? I have googled variations on it several times and not really come up with anything useful.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
> I'm confused by this. AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"? If not, to what local user should that DN map? And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?
You can choose which user it maps to in /etc/edg-mkgridmap.conf by changing the last field of the following lines:
# USER-VO-MAP osg OSG -- 23 -- Rob Quick (rquick@....)
group vomss://voms.opensciencegrid.org:8443/voms/osg osg
group vomss://voms.grid.iu.edu:8443/voms/osg osg
And applying the changes by running `edg-mkgridmap`. If you would like to support multiple VOs (Virtual Organizations = OSG projects), you can map other VOs to other users or the same user if that suits you. This can all be supported on one CE (Compute Element = VO box).
I can't speak to exactly why pilot jobs need to create '~/.globus' but the pilots need to do a lot of things to set up their payloads so they can communicate back to the OSG and accept user jobs. I wouldn't expect jobs mapped to 'nobody' to work. Is there another user you could point it to?
> > 1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.
>
> I'm confused by this. AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"? If not, to what local user should that DN map? And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?
I did a little more digging into this and found the config files for edg-mkgridmap. I added an entry to map "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu" to the nobody user. However, now we're seeing errors like the following:
root@osg:/var/log/globus# more /var/log/globus/gram_nobody.log
ts=2016-01-13T18:50:32.772783Z id=7220 event=gram.make_job_dir.end level=ERROR gramid=/16506016693465985926/14500289408298687149/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:50:32.945021Z id=7220 event=gram.make_job_dir.end level=ERROR gramid=/16506016693839809926/14500289408298687149/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:53:41.716176Z id=7619 event=gram.make_job_dir.end level=ERROR gramid=/16506016690739335926/14500289408298658176/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:53:41.882591Z id=7619 event=gram.make_job_dir.end level=ERROR gramid=/16506016692255845366/14500289408298658176/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:56:14.339399Z id=7741 event=gram.make_job_dir.end level=ERROR gramid=/16506017792204720721/14500289408298670491/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:56:14.494684Z id=7741 event=gram.make_job_dir.end level=ERROR gramid=/16506017791931179601/14500289408298670491/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:58:41.759802Z id=7822 event=gram.make_job_dir.end level=ERROR gramid=/16506017790855326121/14500289408298689940/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
ts=2016-01-13T18:58:41.883412Z id=7822 event=gram.make_job_dir.end level=ERROR gramid=/16506017792564247721/14500289408298689940/ status=-22 path=//.globus msg="Error creating directory" errno=2 reason="No such file or directory"
What exactly is the pilot/osg-flock.grid.iu.edu service trying to do here that it needs to create a ~/.globus directory?
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Attached are the stdout and stderr files from the last nova job submitted to OSC's Oakley cluster yesterday.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Also, our GRAM instance is starting to see RSL errors again:
ts=2016-01-12T18:22:25.818325Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:22:25.818345Z id=2642 event=gram.query.end level=ERROR gramid=/16506029772674911736/14500289408298685574/ uri="/16506029772674911736/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.062297Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.062340Z id=2642 event=gram.query.end level=ERROR gramid=/16506167220725199961/14500289408298660531/ uri="/16506167220725199961/14500289408298660531/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.069415Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.069446Z id=2642 event=gram.query.end level=ERROR gramid=/16506213379776989516/14500289408298685574/ uri="/16506213379776989516/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
ts=2016-01-12T18:27:51.079587Z id=2642 event=gram.validate_rsl.end level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2016-01-12T18:27:51.079618Z id=2642 event=gram.query.end level=ERROR gramid=/16506029772674911736/14500289408298685574/ uri="/16506029772674911736/14500289408298685574/" msg="Error processing query" status=-48 reason="the provided RSL could not be properly parsed
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
> 4) Troy/Doug, first I'd like to apologize for the state of this ticket. If Marty/Kevin come back with more issues, we may want to start thinking about scrapping the globus gatekeeper entirely and moving to HTCondor-CE. Another site was having issues with the globus gatekeeper and after lengthy troubleshooting we decided that switching was the easier solution. They were able to install the new software and accept new jobs in half a day.
I will have to check internally on possible timelines for that. Unfortunately this is unfunded effort for us, so it's not going to be super high priority.
> 1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.
I'm confused by this. AFAICT, the OSG software is managing the DNs in /etc/grid-security/grid-mapfile on our OSG VO box, so shouldn't that include "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"? If not, to what local user should that DN map? And what happens if we need to add a second OSG project on OSC resources -- can that be accomodated on the same VO box, or will we need a second VO box?
> But the real problem we've been trying to understand is the gram failing to report accurate job status back to our factory submit host.
I'll take a further look at this in the next day or two.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by ahimmel@....
This libfann issue is a red herring. It is an unrelated problem we already know about and are working on fixing it.
-Alex
On Jan 11, 2016, at 4:42 PM, Open Science Grid FootPrints <osg@....<mailto:osg@....>> wrote:
[Duplicate message snipped]
I'll also reply for Marty on behalf of factory ops since he's not back until Tomorrow.
There are 2 independent issues here,
In reply to Brian Lin's questions:
https://ticket.opensciencegrid.org/27288#1452545246
1) the osg-flock issue is different, they are never running because yes, OSC needs to whitelist their DN to the gridmap file.
But the real problem we've been trying to understand is the gram failing to report accurate job status back to our factory submit host. Please see this plot:
http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=Nova_US_OSC_osg&frontend=Fermilab-fifebatch_frontend&infoGroup=running&elements=StatusRunning,ClientGlideRunning,ClientGlideIdle,StatusIdle,&rra=0&window_min=0&window_max=0&timezone=-8
Yes, Kevin is right, Nova is running at the moment, but that will only be temporary. See in our factory plot, "claimed" is non- zero, that means pilots were actually running at OSC. However our "running" green area is flatlined to 0. Gram is not reporting back that those pilots ever transitioned from idle to running.
If Nova keeps submitting batches, eventually the factory will only see "idle" pilots on the queue and no longer submit new ones, because these "idle" pilots have in fact already ran to completion, but on our factory side condor-g thinks they haven't even started yet.
This is the worrying issue that in my opinion warrents the transition to condor-ce.
Thanks,
Jeff Dost
OSG Glidein Factory Operations
by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
> Do you have Nova's CVMFS directory (/cvmfs/nova.opensciencegrid.org) set up? I'm wondering if the cert errors are due to either A) the initial configuration failure or B) the script may be looking for certs in a CVMFS directory.
AFAICT, yes:
root@oak-rw:/nfs/13/troy# pdsh -w n0011,n0134,n0148,n0580 ls -al /cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft/releases/S16-01-07/lib | dshbak -c
----------------
n[0011,0134,0148,0580]
----------------
total 46
drwxr-xr-x 4 cimsrvr 301 69 Jan 7 17:22 .
drwxr-xr-x 97 cimsrvr 301 4096 Jan 7 19:01 ..
drwxr-xr-x 2 cimsrvr 301 20480 Jan 7 17:42 Linux2.6-GCC-debug
drwxr-xr-x 2 cimsrvr 301 20480 Jan 7 17:50 Linux2.6-GCC-maxopt
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
Troy,
Do you have Nova's CVMFS directory (/cvmfs/nova.opensciencegrid.org) set up? I'm wondering if the cert errors are due to either A) the initial configuration failure or B) the script may be looking for certs in a CVMFS directory.
Thanks,
Brian
My response to Enrique:
On 01/11/2016 04:20 PM, Enrique Arrieta Diaz wrote:
> Here are the messages that I see:
>
> *****************************************************************************************************
> terminate called after throwing an instance of 'cet::exception'
> what(): ---- Configuration BEGIN
> Unable to load requested library /cvmfs/nova.opensciencegrid.org/novasoft/slf6
> /novasoft/releases/S16-01-07/lib/Linux2.6-GCC-maxopt/libLEM_dict.so
> libdoublefann.so.2: cannot open shared object file: No such file or directory
> ---- Configuration END
Is this application assuming that libdoublefann.so.2 is in the default library search path? Because AFAICT, that is not the case on Oakley.
> IFDH_DEBUG=0 => 0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
> verify error:num=19:self signed certificate in certificate chain
> verify return:0
> IFDH_DEBUG=0 => 0
> Mon Jan 11 15:27:39 EST 2016 ./arrieta1-Offsite_test_OSC-20160111_1401.sh COMPLETED with exit status 250
> IFDH_DEBUG=0 => 0
> *****************************************************************************************************
That's odd. That seems like a cert problem, but the cert in question is in our OSG VO box's /etc/grid-security/certificates directory as well as the moral equivalent in the osg-wn-client installation.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Doug and I got some out-of-band email from Enrique Arrieta Díaz:
>On 01/11/2016 03:13 PM, Troy Baer wrote:
>> On 01/11/2016 03:03 PM, Enrique Arrieta Diaz wrote:
>>> Hello Troy, I just submitted a new set of 20 Productions test jobs to
>>> OSC:
>>>
>>> http://samweb.fnal.gov:8480/station_monitor/nova/stations/nova/projects/arrieta1-Offsite_test_OSC-20160111_1401
>>>
>>>
>>> 6575523.0@....
>>
>> We have 4 Nova jobs currently running on Oakley. I am not sure that
>> you are using the term "job" in quite the same way we are, though --
>> the samweb.fnal.gov URL above appears to refer to only one of the four
>> TORQUE jobs from Nova that I see.
>>
>I see that all of the OSG jobs in the above URL have ended with status
>"bad" and last activity "process ended - bad". Do you see anything on
>your end with more useful error messages?
*****************************************************************************************************
terminate called after throwing an instance of 'cet::exception'
what(): ---- Configuration BEGIN
Unable to load requested library /cvmfs/nova.opensciencegrid.org/novasoft/slf6
/novasoft/releases/S16-01-07/lib/Linux2.6-GCC-maxopt/libLEM_dict.so
libdoublefann.so.2: cannot open shared object file: No such file or directory
---- Configuration END
IFDH_DEBUG=0 => 0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
depth=2 DC = com, DC = DigiCert-Grid, O = DigiCert Grid, CN = DigiCert Grid Root CA
verify error:num=19:self signed certificate in certificate chain
verify return:0
IFDH_DEBUG=0 => 0
Mon Jan 11 15:27:39 EST 2016 ./arrieta1-Offsite_test_OSC-20160111_1401.sh COMPLETED with exit status 250
IFDH_DEBUG=0 => 0
*****************************************************************************************************
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Nova had about 40 jobs waiting to run at OSC for the past few days, these all ran this morning and they appear to have completed successfully.
Kevin
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130
by Brian Lin
Ok, this ticket is really getting out of hand and is in dire need of a summary.
1) Marty, what exactly are the problems that you're seeing from the factory end? Simple auth errors? Because it sounds like the osg-flock pilot DN has NOT been added to their grid-mapfile when it very much should be.
2) Kevin, are you seeing any issues with NOVA jobs? It's not very clear in this ticket what the original problem actually was and if the issues ever actually affected NOVA.
3) Marty/Kevin, did the factories submit lots of pilots to OSC over the weekend that would cause the OOM on their CE?
4) Troy/Doug, first I'd like to apologize for the state of this ticket. If Marty/Kevin come back with more issues, we may want to start thinking about scrapping the globus gatekeeper entirely and moving to HTCondor-CE. Another site was having issues with the globus gatekeeper and after lengthy troubleshooting we decided that switching was the easier solution. They were able to install the new software and accept new jobs in half a day.
- Brian
We've had 4 jobs arrive this afternoon. I did the following to capture their state 10-15 minutes after they started:
pdsh -w n0011,n0134,n0148,n0580 cd /tmp \; find -user nova \| xargs tar czf /tmp/nova-files.tgz
The results of these are fairly large, ranging from ~915MB to 4.5GB:
troy@oakley01:/fs/lustre/troy$ ls -alh
total 12G
drwxr-xr-x 7 troy sysp 28K Jan 11 14:48 .
drwxrwxrwt 202 root root 52K Jan 7 12:47 ..
[...]
-rw-r--r-- 1 troy sysp 3.8G Jan 11 14:45 n0011-nova-files.tgz
-rw-r--r-- 1 troy sysp 4.5G Jan 11 14:45 n0134-nova-files.tgz
-rw-r--r-- 1 troy sysp 1.9G Jan 11 14:46 n0148-nova-files.tgz
-rw-r--r-- 1 troy sysp 915M Jan 11 14:46 n0580-nova-files.tgz
[...]
These jobs have been running for more than ~20 minutes, so AFAICT they are successful from our perspective. I'm not sure how I would verify that they're successful from OSG's PoV, though.
We're also seeing lots of attempts by the DN "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu" to connect to our OSG VO box, but AFAICT we've never been asked to add them to our grid-mapfile. OTOH, we're not seeing any more of the RSL attribute errors we were seeing last month.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Our OSG VO box OOMed over the weekend. It looks like something on the OSG side kept starting new connections to it till it fell over:
root@osg:/nfs/13/troy# lastcomm -f /var/account/pacct.1 | grep -c condor_starter
9680
root@osg:/nfs/13/troy# lastcomm -f /var/account/pacct.1 | grep -c globus-gatekeep
25837
root@osg:/nfs/13/troy# grep -c nova /var/spool/batch/torque/client_logs/20160110
649
root@osg:/nfs/13/troy# grep -c tomcat /var/spool/batch/torque/client_logs/20160110
1062
We did have 12 nova jobs appear after our OSG VO box recovered, but only 4 of them ran for more than ~20 minutes and none ran for more than ~40 minutes.
If OSG is going to push some test jobs at us, I wish we would get some prior warning, since we kind of need to be paying attention while they are running.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Vince Neal
Good morning,
Checking in on the status of testing. If I may assist please let me know.
Thank you,
by ahimmel@....
I will ask, but our offsite expert may not be back from the holidays yet.
-Alex
On Jan 5, 2016, at 4:36 PM, Open Science Grid FootPrints <osg@....<mailto:osg@....>> wrote:
[Duplicate message snipped]
It looks like the files in question have been purged in the interim. Can somebody try to send some Nova jobs our way tomorrow morning?
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by Brian Lin
Troy,
Could you host the file somewhere so that I could grab it?
Thanks,
Brian
by Brian Lin
Could you try e-mailing it or placing it on a web-server so I could grab it?
Thanks,
Brian
by troy@....
I tried that yesterday, and the resulting file was larger than the OSG
ticket system would accept.
On 12/15/2015 12:46 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
by Brian Lin
Can you grab one of the /tmp/pbstmp.* folders and attach it to this ticket?
AFAICT, we had 97 nova jobs submitted this afternoon, but only 5 of them ran for significantly longer than 20 minutes.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
Something appears to have changed, because now we have a bunch of nova jobs running.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by troy@....
On 12/14/2015 02:02 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
by troy@....
We're seeing a lot of errors like the following in our GRAM logs now:
ts=2015-12-14T18:56:42.460154Z id=31271 event=gram.validate_rsl.end
level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2015-12-14T18:56:42.460194Z id=31271 event=gram.query.end level=ERROR
gramid=/16506225426086058416/14500289408298685574/
uri="/16506225426086058416/14500289408298685574/" msg="Error processing
query" status=-48 reason="the provided RSL could not be properly parsed
ts=2015-12-14T18:56:42.482060Z id=31271 event=gram.validate_rsl.end
level=ERROR msg="Unsupported RSL attribute" attribute=invalid status=-48
ts=2015-12-14T18:56:42.482105Z id=31271 event=gram.query.end level=ERROR
gramid=/16506255115752160996/14500289408298685574/
uri="/16506255115752160996/14500289408298685574/" msg="Error processing
query" status=-48 reason="the provided RSL could not be properly parsed
--Troy
--
Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/
Hi,
There are currently Nova jobs trying to run at OSC; looks like our frontend has been asking for glideins at OSC for several days now (since at least 12/8) without any starting.
Kevin
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130
Hi Troy,
The factory doesn't actually generate/send any jobs directly, so we'll need to get the Fermilab or OSG frontend people to send some jobs your way. I believe Kevin Retzke was taking care of that earlier; Kevin, could you arrange for some more Nova jobs to land on their site?
Brendan Dennis
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
To collect output, we need the factory to send us some jobs.
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
It looks like the glidein logs have a different naming scheme than on the factory, as those gmon.out files are definitely not the glidein logs. It might be worth making a tarball of one of the /tmp/pbstmp.* folders in its entirety instead, to make sure we get everything associated with the run.
Brendan Dennis
UCSD Glidein Factory Logs
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
Since the ticket system won't accept .tgz files (which is a rant for another time), I've unloaded a base64-encoded version of the tarball. To extract, do the following:
base64 -d <osg-logs.tgz.base64encoded.txt | tar xzvf -
by /DC=org/DC=cilogon/C=US/O=Ohio State University/CN=Troy Baer A31671
by troy@....
Sorry, this fell off the stack. I need to get my DN updated before I
can upload files to the ticket, and I haven't had a chance to request
that yet.
On 12/09/2015 10:36 AM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
by Brian Lin
Doug/Troy: Any word on those files and the size of /tmp?
by Brian Lin
Doug/Troy: I'd also be curious to see other logs, namely the MasterLog, StarterLog, and StartdLog. How big is your /tmp partition? I ask because we've seen some issues with /tmp filling up and causing pilots to go haywire.
Thanks,
Brian
by Vince Neal
Hi Doug,
Unfortunately we cannot attach files to the ticket when received via email. You must use the web UI to attach files.
You should be able to reach this ticket directly using the following URL: https://ticket.grid.iu.edu/27288
Please let me know if I may assist.
Thank you,
Vince
by Brian Lin
Doug,
You'll need to use the web UI to attach files to the ticket. Could you look at another glidein and provide its *.out, *.err, and log dir?
Thanks,
Brian
by troy@....
Attached is a tarball of the following files from one of our compute
nodes that has 9 nova jobs running:
# find /tmp/pbstmp.*/glide* \( -name "*.out" -o -name "*.err" \)
/tmp/pbstmp.5167119/glide_FUO19t/execute/dir_9427/gmon.out
/tmp/pbstmp.5167120/glide_QxP8JX/execute/dir_17683/gmon.out
/tmp/pbstmp.5167121/glide_y1OFrM/execute/dir_7394/gmon.out
/tmp/pbstmp.5167132/glide_9sjUnb/execute/dir_12825/gmon.out
/tmp/pbstmp.5167133/glide_tFhiUP/execute/dir_22538/gmon.out
/tmp/pbstmp.5167141/glide_wA2FHE/execute/dir_5728/gmon.out
/tmp/pbstmp.5167143/glide_HtLbxp/execute/dir_20651/gmon.out
/tmp/pbstmp.5167146/glide_euGW9z/execute/dir_15368/gmon.out
/tmp/pbstmp.5168310/glide_E9Tg25/execute/dir_24003/gmon.out
/tmp/pbstmp.5168312/glide_DRsomU/execute/dir_15369/gmon.out
/tmp/pbstmp.5168314/glide_DNxLfE/execute/dir_19300/gmon.out
Unfortunately, I don't see any errors messages (or much of anything
else) in those.
On 11/24/2015 06:15 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
Troy,
If possible, can you upload some examples of any *.out or *.err logs you find in these temporary work directories as an attachment to the ticket here?
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by troy@....
One wrinkle in this is that $TMPDIR on our systems is ephemeral; it only
lives as long as the TORQUE job, so we pretty much have to catch these
while they're running. I've tried poking around on some of the
currently running jobs, and I'm not really sure what I'm looking at:
[root@n0249 ~]# ls -al /tmp/pbstmp.5164908/glide_bfhhWd/log/
total 844
drwxr-xr-x 2 nova PES0656 4096 Nov 24 17:52 .
drwxr-xr-x 10 nova PES0656 4096 Nov 24 14:54 ..
-rw------- 1 nova PES0656 0 Nov 24 14:54 InstanceLock
-rw-r--r-- 1 nova PES0656 104 Nov 24 14:54 .master_address
-rw-r--r-- 1 nova PES0656 1932 Nov 24 14:54 MasterLog
prw------- 1 nova PES0656 0 Nov 24 17:52 procd_address
prw------- 1 nova PES0656 0 Nov 24 17:52 procd_address.watchdog
-rw-r--r-- 1 nova PES0656 434896 Nov 24 17:52 ProcLog
-rw-r--r-- 1 nova PES0656 104 Nov 24 14:54 .startd_address
-rw------- 1 nova PES0656 146 Nov 24 14:55 .startd_claim_id.slot1
-rw-r--r-- 1 nova PES0656 396618 Nov 24 17:50 StartdLog
-rw-r--r-- 1 nova PES0656 3659 Nov 24 14:55 StarterLog
[root@n0249 ~]# grep -i error /tmp/pbstmp.5164908/glide_bfhhWd/log/*Log
/tmp/pbstmp.5164908/glide_bfhhWd/log/MasterLog:11/24/15 14:54:12
(pid:10576) Daemon Log is logging: D_ALWAYS D_ERROR
/tmp/pbstmp.5164908/glide_bfhhWd/log/StartdLog:11/24/15 14:54:13
(pid:10580) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB
/tmp/pbstmp.5164908/glide_bfhhWd/log/StartdLog:11/24/15 14:54:14
(pid:10580) VM-gahp server reported an internal error
/tmp/pbstmp.5164908/glide_bfhhWd/log/StarterLog:11/24/15 14:55:21
(pid:10607) Daemon Log is logging: D_ALWAYS D_ERROR
/tmp/pbstmp.5164908/glide_bfhhWd/log/StarterLog:11/24/15 14:55:22
(pid:10607) Error file:
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/_condor_stderr
[root@n0249 ~]# more
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/_condor_stderr
[...nothing...]
What does "VM-gahp server reported an internal error" mean? That's the
only obvious error I can see, and it seems to happen almost
immediately. Also, this job is clearly doing work:
[root@n0249 ~]# ps auxwf | grep nova
root 15584 0.0 0.0 103304 820 pts/8 S+ 18:03
0:00 \_ grep nova
nova 5524 0.0 0.0 110424 1612 ? Ss 14:53 0:00 \_ -sh
/var/spool/batch/torque/mom_priv/jobs/5164908.oak-batch.osc.edu.SC
nova 5597 0.0 0.0 106628 1816 ? S 14:53 0:00 \_
/bin/bash
/nfs/17/nova/.globus/.gass_cache/local/md5/3a/d6c4c10150ec60d3258681bacb155f/md5/97/ef9ce28e4e53f8ceecfce5d0719590/data
-v std -name gfactory_instance -entry Nova_US_OSC_osg -clientname
fifebatchgpvmhead2_OSG_gWMSFrontend.OSG_nova -schedd
schedd_glideins5@.... -proxy None -factory SDSC -web
http://gfactory-1.t2.ucsd.edu/factory/stage -sign
cd26d67f9e8ea668f2d8bfea737a13ef27c5fded -signentry
6496b05b9d283d89a011e6e62a554811c0151e78 -signtype sha1 -descript
description.fbobom.cfg -descriptentry description.fbobom.cfg -dir TMPDIR
-param_GLIDEIN_Client fifebatchgpvmhead2_OSG_gWMSFrontend.OSG_nova
-submitcredid 714925 -slotslayout fixed -clientweb
http://fifebatchgpvmhead2.fnal.gov/vofrontend/stage -clientsign
22678b74f42a46020074e625a4beebdf7d4d1d85 -clientsigntype sha1
-clientdescript description.fb5fm7.cfg -clientgroup OSG_nova
-clientwebgroup
http://fifebatchgpvmhead2.fnal.gov/vofrontend/stage/group_OSG_nova
-clientsigngroup e1941fbc24b3e0a74eb4a4aeea134bdf9e682b38
-clientdescriptgroup description.ebieT2.cfg -param_CONDOR_VERSION
default -param_GLIDEIN_Glexec_Use NEVER -param_GLIDEIN_Job_Max_Time
34800 -param_GLIDECLIENT_ReqNode
gfactory.minus,1.dot,t2.dot,ucsd.dot,edu -param_GLIDECLIENT_Rank 1
-param_GLIDEIN_Report_Failed NEVER -param_MIN_DISK_GBS 1
-param_GLIDEIN_Monitoring_Enabled False -param_HAS_USAGE_MODEL OFFSITE
-param_UPDATE_COLLECTOR_WITH_TCP True -param_CONDOR_ARCH default
-param_USE_MATCH_AUTH True -param_CONDOR_OS default
-param_GLIDEIN_Collector
fifebatchhead3.dot,fnal.dot,gov.colon,9620.minus,9630.semicolon,fifebatchhead4.dot,fnal.dot,gov.colon,9620.minus,9630
-cluster 3281717 -subcluster 2
nova 9850 0.0 0.0 9376 1380 ? S 14:54 0:00
\_ /bin/bash /tmp/pbstmp.5164908/glide_bfhhWd/main/condor_startup.sh
glidein_config
nova 10576 0.0 0.0 95836 8152 ? S 14:54
0:00 \_
/tmp/pbstmp.5164908/glide_bfhhWd/main/condor/sbin/condor_master -f
-pidfile /tmp/pbstmp.5164908/glide_bfhhWd/condor_master2.pid
nova 10579 0.1 0.0 23580 5024 ? S 14:54
0:13 \_ condor_procd -A
/tmp/pbstmp.5164908/glide_bfhhWd/log/procd_address -L
/tmp/pbstmp.5164908/glide_bfhhWd/log/ProcLog -R 1000000 -S 60 -C 16997
nova 10580 0.0 0.0 96668 9068 ? S 14:54
0:04 \_ condor_startd -f
nova 10607 0.0 0.0 96008 8520 ? S 14:55
0:00 \_ condor_starter -f fifebatch1.fnal.gov
nova 10611 0.0 0.0 9376 1272 ? S 14:55
0:00 \_ /bin/sh
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/condor_exec.exe
--limit 4 --multifile --export
DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox --config
Production/fcl/prod_reco_pidpart_numi_job.fcl --source
/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft/setup/setup_nova.sh:-r:S15-05-04c:-b:maxopt:-e:/cvmfs/nova.opensciencegrid.org/externals:-5:/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft:-6:/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft
-X runNovaSAM.py --hashDirs --copyOut --outTier out1:reco --outTier
out2:lemsum --outTier out3:pidpart
nova 10673 0.0 0.0 9732 1860 ? S 14:55
0:00 \_ /bin/sh
./arrieta1-draining_prod_artdaq_FA14-10-03x.d_nd_genie_fhc_nonswap_ndnewpos_draining_reco_S15-05-04c_AND_pidpart_S15-05-04c-20151123_1402.sh
--limit 4 --multifile --export
DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox --config
Production/fcl/prod_reco_pidpart_numi_job.fcl --source
/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft/setup/setup_nova.sh:-r:S15-05-04c:-b:maxopt:-e:/cvmfs/nova.opensciencegrid.org/externals:-5:/cvmfs/nova.opensciencegrid.org/novasoft/slf5/novasoft:-6:/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft
-X runNovaSAM.py --hashDirs --copyOut --outTier out1:reco --outTier
out2:lemsum --outTier out3:pidpart
nova 15518 0.3 0.0 220000 41488 ? S 18:01
0:00 \_ python
/cvmfs/nova.opensciencegrid.org/externals/NovaGridUtils/v01.44/NULL/bin/runNovaSAM.py
-c Production/fcl/prod_reco_pidpart_numi_job.fcl --hashDirs --copyOut
--outTier out1:reco --outTier out2:lemsum --outTier out3:pidpart
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/no_xfer/ifdh_16997_5524/neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.root
nova 15527 94.2 1.3 1384208 653076 ? Rl 18:01
2:08 \_ nova -c
prod_reco_pidpart_numi_job_neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.fcl
--sam-application-family=nova --sam-application-version=S15-05-04c
--sam-file-type=importedSimulated --sam-data-tier=out1:reco
--sam-stream-name=out1:out1 --sam-data-tier=out2:lemsum
--sam-stream-name=out2:out1 --sam-data-tier=out3:pidpart
--sam-stream-name=out3:out1
/tmp/pbstmp.5164908/glide_bfhhWd/execute/dir_10607/no_xfer/ifdh_16997_5524/neardet_genie_fhc_nonswap_ndnewpos-ndfluxv08_2000_r00010812_s13_c001_FA14-10-03x.d_v1_20151119_135356.sim.overlay.daq.root
On 11/24/2015 02:54 PM, Open Science Grid FootPrints wrote:
> ------------------------------------------------------------------------
> *Notification of Ticket Change*
>
> https://ticket.opensciencegrid.org/27288
>
>
> *Description:*
> Hi Doug,
>
> Is there anyway you can try and find some of the glidein logs
> associated with this period of time you observed this non-trivial CPU
> usage? The logs are still not being returned to the glidein factories
> after running at the site, if they are in fact running successfully.
> We don't have any confirmation of glideins running properly at OSC. In
> fact, during the last week, we only see that most glideins were held
> at the factories and never properly submitted to OSC. I've cleared out
> these held glideins from the factories. Maybe the new ones will not go
> held again.
>
> Marty Kandes
> UCSD Glidein Factory Operations
>
> by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty
> Kandes 3049
>
--
Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/
Hi Doug,
Is there anyway you can try and find some of the glidein logs associated with this period of time you observed this non-trivial CPU usage? The logs are still not being returned to the glidein factories after running at the site, if they are in fact running successfully. We don't have any confirmation of glideins running properly at OSC. In fact, during the last week, we only see that most glideins were held at the factories and never properly submitted to OSC. I've cleared out these held glideins from the factories. Maybe the new ones will not go held again.
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by troy@....
From our end, it looks like OSG/nova jobs that have come in after
2015-11-20 12:50:00 EST have accrued non-trivial amounts of CPU time.
On 11/17/2015 03:02 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
Doug,
No problem. Let us know when you're back and have had a chance to look for some glidein logs on your end.
Thanks,
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
Unfortunately I can't get the exact directory that the glidein logs will be in, since there are no glidein stdout/err logs being returned that would have the variable information set, but one place to look (besides TMPDIR) would be to check /tmp for glide_* folders, where * is a six digit alphanumeric. A lot of glidein data and logs should be in those folders.
Also, I'm clearing out all of the stale held glideins again on the site, so you should start getting jobs attempting to run shortly.
Brendan Dennis
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Brendan Dennis 2659
Hi,
I've been unable to spend time on this ticket before leaving for the SC15 conference. One thing that we still need to do is change the version of the osg-wn-client software on the compute nodes. We neglected to upgrade this software when we upgrade the VO box. Version 3.2.29 is installed. Please send updates when we can expect running jobs, and the specific names of log files you would want collected.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
Doug,
You should be able to find the glidein logs on the worker nodes in the work_dir="TMPDIR", where TMPDIR is a path environment variable you all probably setup on your end originally. I suspect it's whereever your found the original stderr/stdout files you sent us at the beginning of this ticket. Can you try and dig up a few more, more recent ones for us?
Thanks,
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Brian Lin
Marty,
In that case, where can Doug find the glidein logs? Presumably on the worker nodes but where?
Thanks,
Brian
Hi Brian,
Yes and no. The problem has 'disappeared' for us in the sense that glidein logs are no longer returned to the factories now. So what we really need is someone look at the glidein logs locally at OSC to determine what might be causing the problem, which now seems to have a similar profile to the possible GRAM/globus-related problem previously observed at Hyak (https://ticket.opensciencegrid.org/26794).
If you're interested in other sites exhibiting the 'Not enough arguments' error, I'm compiling a list here --- https://jira.opensciencegrid.org/browse/GFACTOPS-765 --- as I come across them.
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Brian Lin
Doug: Could you attach the output of osg-system-profiler?
Marty: It's a little hard to follow this ticket, has the 'Not enough arguments' error gone away? Is the problem now that glidein startd's are not reporting their status properly back to the SDSC factory?
Thanks,
Brian
Hi all,
As Troy mentioned earlier, the failure of OSG glideins to authenticate properly at OSC is likely due to their DN not being in the grid-mapfile. Has this DN been added?
As for the Nova-specific Fermilab glideins, it looks like we may have encountered a GRAM-related problem also seen at another site previously. Here, the symptoms are that user jobs and glideins appear to run fine when viewed from the site's perspective, but the glideins fail to update the factories with their progress correctly. If I recall correctly, this leads to the factory not requesting the appropriate amount of glideins at a site. I'll alert the OSG software folks about this as we've been attempting to reproduce the problem elsewhere.
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
Hi,
We are seeing a mix of what appear to be successful jobs, but have seen some failed jobs on the 6th. New attachment named stdout-20151106.txt with the error.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
The Nova glideins came from the SDSC factory (and the jobs did finish fine):
005 (3320115.6349.000) 11/07 00:17:43 Job terminated.
(1) Normal termination (return value 0)
Usr 0 14:33:35, Sys 0 00:00:26 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 14:33:35, Sys 0 00:00:26 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
5311732 - Run Bytes Sent By Job
24369 - Run Bytes Received By Job
5311732 - Total Bytes Sent By Job
24369 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 4460720 4000000 819007892
Memory (MB) : 928 2000 2500
...
028 (3320115.6349.000) 11/07 00:17:43 Job ad information event triggered.
SentBytes = 5311732.0
JOB_GLIDEIN_Name = "gfactory_instance"
TerminatedNormally = true
ReturnValue = 0
JOB_GLIDEIN_ClusterId = "3234446"
JOB_GLIDEIN_SiteWMS = "PBS"
EventTypeNumber = 28
Subproc = 0
ReceivedBytes = 24369.0
JOB_GLIDEIN_SiteWMS_Slot = "Unknown"
MyType = "JobTerminatedEvent"
TriggerEventTypeName = "ULOG_JOB_TERMINATED"
TotalRemoteUsage = "Usr 0 14:33:35, Sys 0 00:00:26"
JOB_Site = "OSC"
JOB_GLIDEIN_Site = "OSC"
Proc = 6349
EventTime = "2015-11-07T00:17:43"
Was the problem fixed?
Messages are a bit confusing. From one side Keving seems to run successfully from the other the OSG factory glideins seem to fail authorization. Is Kevin using a different factory?
Thanks, Marco
The jobs are still running; I just did a condor_ssh_to_job on one and it's chugging along.
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130
by troy@....
I'm not sure if this is related to the auth problem Marty mentioned, but
we've seen a lot (>1700) or errors like the following today:
PID: 19695 -- Failure: globus_gss_assist_gridmap() failed
authorization. globus_gss_assist: Gridmap lookup failure: Could not map
/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu
Indeed, that DN does not appear in the grid-mapfile on our OSG host:
root@osg:~# grep "/DC=com/DC=DigiCert-Grid/O=Open Science
Grid/OU=Services/CN=pilot/osg-flock.grid.iu.edu"
/etc/grid-security/grid-mapfile
[...nothing...]
I'm not sure if that's indicative of a problem or not.
--Troy
On 11/06/2015 04:59 PM, Open Science Grid FootPrints wrote:
> ------------------------------------------------------------------------
> *Notification of Ticket Change*
>
> https://ticket.opensciencegrid.org/27288
>
>
> *Description:*
> Marco et al.,
>
> OSG glideins are being held at the factory due to an authentication
> problem [1].
>
> Marty Kandes
> UCSD Glidein Factory Operations
>
> [1]
>
> 2529089.6 feosgflock 11/6 20:39 Globus error 7: authentication with
> the remote server failed
> 2529089.8 feosgflock 11/6 20:39 Globus error 7: authentication with
> the remote server failed
> 2529129.1 feosgflock 11/6 21:18 Globus error 7: authentication with
> the remote server failed
>
> by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty
> Kandes 3049
>
--
Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/
Kevin,
Did your Nova jobs complete successfully?
Marty Kandes
UCSD Glidein Factory Operations
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
Marco et al.,
OSG glideins are being held at the factory due to an authentication problem [1].
Marty Kandes
UCSD Glidein Factory Operations
[1]
2529089.6 feosgflock 11/6 20:39 Globus error 7: authentication with the remote server failed
2529089.8 feosgflock 11/6 20:39 Globus error 7: authentication with the remote server failed
2529129.1 feosgflock 11/6 21:18 Globus error 7: authentication with the remote server failed
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
006 (3320115.6349.000) 11/06 09:21:41 Image size of job updated: 2216784
006 (3320115.6349.000) 11/06 09:31:42 Image size of job updated: 2237152
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130
All,
We have jobs running, and they are actually executing the 'nova' executable. This is in contrast to what we were seeing before the upgrade with the glidein_startup.sh error. Please let us know whether these jobs look OK from an external perspective.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
Hi,
The upgrade is complete, please send glideins.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
by djohnson@....
One data point that might be useful that we stumbled on while trying
to install the CA changes that were just announced; we're still at
osg-version-3.1.46-1 which is apparently a dead-end. Since glideins
are broken we will upgrade to 3.3, perhaps we should table testing
until that's finished. I'll update the ticket when the upgrade is
finished.
Doug
On Thu, 05 Nov 2015 11:46:00 -0500,
Open Science Grid FootPrints wrote:
>
> [1 <text/plain; utf-8 (quoted-printable)>]
> [2 <text/html; utf-8 (quoted-printable)>]
> [Duplicate message snipped]
Hi Marco,
I've enabled OSGVO at OSC.
Marty Kandes
UCSD Glidein Factory Operations
P.S. Yes, the 'Not enough arguments' error has been a bit odd. When I saw the problem before, it seemed to occur at sites with an already existing underlying problem. There was one very clear example of this: a known site with a problem before we upgraded glideinWMS, then after the upgrade the original problem immediately became 'Not enough arguments'. I'll go back and check my notes on the specifics here when I get a chance.
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
by Marco Mambelli
Hi Marty,
could you re-enable OSG and send there some glideins?
It seems strange that "Not enough arguments" is appearing here and there. Being in the glidein_startup.sh script should affect all glideins.
I have also a patch attached to the redmine ticket (https://cdcvs.fnal.gov/redmine/issues/10762) but I'd wait first to see how the jobs at OSC are running.
Thank you,
Marco
Hi,
No jobs have been run since 2015-10-31. If we can get some jobs submitted during the day we can try to collect these files during execution. These jobs run for about 20 minutes to failing, so there is a reasonable window to collect the file.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
by Marco Mambelli
Thank you Doug,
the file I need is the one invoked by this script:
/nfs/17/nova/.globus/.gass_cache/local/md5/da/ce1b3b754bc2e12f318fe5f5a335a3/md5/96/b8b66f54614b2f5e73994fb323bb96/data
Probably that specific one has been deleted. If you have a job running, you can see the last line in the scheduler_pbs_job_script for the updated location of the startup script.
It would be great if you can attach it and add in the message the command line used (e.g. /nfs/17/nova/.globus/.gass_cache/local/md5/da/ce1b3b754bc2e12f318fe5f5a335a3/md5/96/b8b66f54614b2f5e73994fb323bb96/data "-v" "std" "-name" "gfactory_instance" ... "-cluster" "2517720" "-subcluster" "7" </dev/null) so that I can connect it to the stdout/err in the factory (or if you can attach its stdout/err).
Thank you,
Marco
Hi,
I don't see the glidein_startup.sh in the files I have available. I do have the batch script that was executed named scheduler_pbs_job_script, see attachment (with .txt suffix to get around upload restriction.)
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836
by Marco Mambelli
I'm adding here and to the gwms ticket a patch to improve the error message.
Thanks, Marco.
Yeah, I've been seeing the 'Not enough arguments' error [1] pop up here and there again in the last month or so. But since It hasn't been as wide spread as it was in September after that glideinWMS upgrade, I haven't spent a lot of time to investigating. This is the 3rd entry where I've seen the problem. In some of these other cases, it doesn't appear to affect all factories. Here, in this case, there is also the associated hold reason [2] given for the held glideins at the factory. I'm not sure what the source of the problem is, but I suspect globus/GRAM.
Marty Kandes
UCSD Glidein Factory Operations
[1]
Sat Oct 31 11:35:26 EDT 2015 Not enough arguments in fetch_file main error_gen.sh error_gen.faulAe.sh regular 0 TRUE FALSE
[2]
3219131.9 fefermifife 11/1 03:06 Globus error 31: the job manager failed to cancel the job as requested
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049
Could you attach the glidein_startup,sh script that was run? This might be an issue we need to bring up with the GlideinWMS developers.
I'm CCing the factory support, they may have some insight as well.
Thanks,
Kevin
by /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Kevin Retzke 3130
by OSG-GOC
Hi,
We're experiencing glidein failures at OSC. The output from a failed job is attached as the file named 'stdout'. There are no apparent errors in the stdout of the job, that file is attached as well. Please let us know if this looks like a local problem to OSC, or a remote problem. If it's local to OSC, some guidance on where to start looking would be appreciated.
Doug
by /DC=org/DC=cilogon/C=US/O=Ohio Technology Consortium (OH-TECH)/CN=Douglas Johnson A20836