×By the end of May 2018, the ticketing system at https://ticket.opensciencegrid.org will be retired and support will be provided at https://support.opensciencegrid.org. Throughout this transition the support email ([email protected]) will be available as a point of contact.

Please see the service migration page for details: https://opensciencegrid.github.io/technology/policy/service-migrations-spring-2018/#ticket
Contact
Kenneth Herner
Details
UCSD glidein
MIS
GOC Ticket/submit
Kenneth Herner
osg-glidein-factory
Problem/Request
Normal
Closed
Ken Please Review
2017-05-22
Assignees
OSG Glidein Factory Support / OSG Support Centers

Assignees TODO
Past Updates
 
Hi All,

I'll go ahead and close this.

Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
Hi,

I think we are good.
As long as GridPP is acknowledged, that's all we need to show the
funding agencies.
I've received the monitoring information.
So from my point of view we can close this ticket.
I can then report back to the other UK DUNE sites, to see if they want
to use any information from this.

Regards,
Daniela

On 19 May 2017 at 16:11, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hi Vince,

Yes, I followed up separately with Daniela on the monitoring question. I'll also forward the acknowledgement information to the relevant DUNE parties. Indeed the security discussion will continue past the lifetime of this particular ticket. I'm afraid I will have to punt on the funding question though (I don't hold any purse strings.)

Regards,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by Vince Neal 
Morning Ken,

Daniela has a few outstanding requests regarding this ticket.  Could you please review and advise?

Thank you,
Vince
by [email protected] 
Hi All,

I don't think I'm the right person to close this ticket (I doubt I
can), but in summary:
- production now works, unless someone tells me otherwise
- we don't use glexec and the security issues have been forwarded (at
least in the UK) to the relevant people and are being discussed
elsewhere
- is there a webpage where *I* can monitor the state of my site wrt
DUNE (note that the site is not part of OSG) ?
- if DUNE is using GridPP resources, DUNE needs to acknowledge GridPP
in their relevant publications:
https://www.gridpp.ac.uk/about/acknowledging-gridpp/ (We need
funding  too....)

Regards,
Daniela

On 11 May 2017 at 18:52, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hello,

The following entries DUNE_T2_UK_London_IC_ceprod07/08 have been moved to production. Kindly let us know if anything else, otherwise please close the ticket.

Thanks
VG
OSG FActory OPS (new)

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=VAIBHAV GUPTA 4273
 
Hi,

Our frontend is also ready to move to production, so go ahead whenever you can.

Daniela, as far as contact goes, I'd say some combination of myself, Steve Timm, and Tom Junk can work on problems. You can also submit a GOC or Fermilab Service Desk ticket.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi Jeff,

We are quite happy to move to production. Having said this, we are
also transitioning part of the cluster to EL7, please only put
ceprod07 & 08 into production as 05 and 06 will be moved over to EL7
once we are convinced it's actually working.

On a related note: How should I contact DUNE if I have any problems
with the glideins and/or I need to convey some update to the cluster ?
Usually I would just submit a GGUS ticket.

Regards,

Daniela

On 9 May 2017 at 22:15, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hi,

It looks like Marian already made the changes, I'd just like to add for completeness that because DUNE has different requirements than CMS, we created new entries for DUNE only for single core.

These are still on the ITB factory only.  Ken, Daniela, if you are satisfied, let us know and we can push these new entries to production.

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by [email protected] 
Yes, that looks correct.

Daniela

On 5 May 2017 at 23:03, Open Science Grid FootPrints
<[email protected]> wrote:
> [Duplicate message snipped]
 
Hi,

should we do this on all entries? Just making sure we're talking about CEs: ceprod0{5,6,7,8}.grid.hep.ph.ic.ac.uk, where GLIDEIN_MaxMemMBs=2530 and GLIDEIN_CPUS=1, correct?

Thanks,
Marian

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
by [email protected] 
Hi Ken,

yes, just turn glexec off.

Regards,
Daniela

On 3 May 2017 at 21:14, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
All right, I'm confident that there is a bug on our end during the proxy renewal on the server. Somehow during the renewal the DN changes from the dunegpvm01 to my CILogon Basic DN. I'll work with our developers to figure out what's going on.

In the meantime, would you be OK with moving into production with glexec turned off? I want to be sure that's OK with you before we ask factory ops to change anything.

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
 
Hi Daniela,

Don't worry about it. I was able to do some more digging on this side, and I think I understand what is happening now. All signs currently point to the subtle bug on our side that I mentioned earlier.

I've just sent yet another set of 16 probe jobs. If these succeed, I am confident that I understand the issue. Basically it has to do with how the proxy renewal on our servers is working when the jobs sit in the queue for a while.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi Ken,

this proves surprisingly difficult. The glidein name (or parts of it) do
not show in the logs, so when the job is gone I can't map it to any
specific job that ran. Maybe your logs store the cream id (something like
CREAM+a bunch of numbers) or the lrms id (currently hovering around 3494000
in our system, possibly with an 'sge' attached to it ?), in the future that
would be helpful.

Having said this, I had a quick look on wj13 and in the glexec logs on the
node  I can see that the frontend cert (frontend_pp) seem to attempt and
succeed in a glexec switch and is mapped to a glexec account, but a second
attempt to change user within the same job fails (this is now from
/var/log/messages, our glexec log doesn't seem to be too keen to store
failures)

May  3 16:04:43 wj13 python: glexec[162468]: lt2-duneplt409
['/vols/grid/wn/emi3/v3.15.3-1/emi-wn-3.15.3-1_sl6v1/usr/sbin/glexec', '/bin/b
ash', '-c', 'install -m 755 -d
"/srv/localstage/scratch/3492355.1.grid.q/CREAM189703536/glide_M293Lr/execute/dir_162293" && tar -C "/srv/lo
calstage/scratch/3492355.1.grid.q/CREAM189703536/glide_M293Lr/execute/dir_162293"
-x']
May  3 16:04:43 wj13 python: glexec[162468]: lt2-duneplt409 Target proxy: /srv/localstage/scratch/3492355.1.grid.q/CREAM189703536/glide_M2
93Lr/execute/dir_162293.condor/x509cc_dunepro_Production

I don't see any reference to the glexec accounts  dunegpvm01 gets mapped to
as noted by argus, so all I can conclude is that it didn't use that cert,
but because I don't have the proxy anymore, I can't tell which user it was
aiming for (i.e. if it's the correct cert and some problem with our glexec
I can't tell).
I need to go home now, maybe we can do some coordinated tests so I can look
at the running jobs and/or maybe Simon has a better idea.

Cheers,
Daniela
 
If you have just a few minutes, can you see what DNs these glideins were seeing (this was within the last hour):

[email protected]
[email protected]
[email protected]

They *should* have all used the dunegpvm01 DN. If it was something else, I'd really like to know. I'm suddenly worried about the possibility of a very subtle bug on our end...

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
I can see the dunegpvm certificate being mapped correctly:
/var/log/argus/pepd/process.log:2017-05-03 02:01:01.151Z - INFO
[DFPMObligationHandler] - ACCOUNTMAPPER_OH: DN: CN=production/
dunegpvm01.fnal.gov,OU=Services,O=Open Science
Grid,DC=opensciencegrid,DC=org pFQAN: /dune/Role=Production FQANs:
[/dune/Role=Production, /dune] mapped to POSIX account:
PosixAccount{user=glx-dune131 group=lt2-dune groups=lt2-dune}

If you look at the time stamps though it looks like the wrong CA certs
(OU=People,O=Fermi National Accelerator Laboratory,C=US,DC=cilogon,DC=org
with your DN) actually show up in the logs later than the correct ones.

Cron job gone rogue ?

On 3 May 2017 at 15:56, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Well this is really odd... I do indeed see some of the jobs going held with the usual glexec problem, but I also see jobs from the *same cluster* successfully completing. All jobs in that cluster should have been using the same DN, so this is not making much sense to me.

Did you see any successful mappings using "/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/dunegpvm01.fnal.gov"?

I sent 16 new jobs just now that I'm 100% sure are using the dunegpvm01 cert. If you see failures there, then I'm *really* confused.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by Steven Timm 
I will be away from the lab on business travel May 3-4.
I will have limited access to E-mail, and limited time to respond.
If there are any issues with FermiCloud or any Grid and Cloud Operations services please open a Service Desk ticket.

Steven Timm
by [email protected] 
I still see:
pepd/process.log:2017-05-03 05:12:36.340Z - ERROR
[TrustStoreValidationErrorLogger] - Validation error: error at position 0
in chain, problematic certificate subject:
CN=1717757645,CN=2160414386,CN=1488398773,CN=1683709237,CN=UID:kherner,CN=Kenneth
Herner,OU=People,O=Fermi National Accelerator
Laboratory,C=US,DC=cilogon,DC=org (category: X509_CHAIN): Trusted issuer of
this certificate was not established

so it's still using the 'wrong' certificate.

Just turn glexec off.....

On 2 May 2017 at 20:30, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
We have fixed the frontend issue, and we saw some jobs start this morning, but they all went held with a glexec setup error. I was a little surprised to see that since the DN involved in those jobs was supposed to be from an OSG CA-signed cert, not CILogon Basic. I've sent a fresh set recently to see if the problem happens again.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
 
D'oh! Who knew it was case-sensitive... I'll pass that along to the guy in charge of Puppet on that node.

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi Ken,

The error we see in the glidein client logs within the jobs is:

Mon May 1 22:43:02 BST 2017 Error running
'/srv/localstage/scratch/3469111.1.grid.q/CREAM769959666/glide_JCbcFf/main/glexec_setup.sh'
USE_GLEXEC in VO Frontend configured to be optional.
Accepted values are 'NEVER' or 'OPTIONAL' or 'REQUIRED'.
Mon May 1 22:43:02 BST 2017 Sleeping 256
Mon May 1 22:47:18 BST 2017 Sleeping 268
Mon May 1 22:51:46 BST 2017 Sleeping 274
Mon May 1 22:56:20 BST 2017 Sleeping 327

Regards,
Simon

> [Duplicate message snipped]
 
Yes, it looks like the glideins are starting, but they aren't reporting in to our collectors for some reason, so the jobs never see the slots to match. I'll investigate on our side.

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi,

I currently see 21 dune jobs running and 21 queueing.

Daniela

On 1 May 2017 at 21:59, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
They didn't, actually. It looks like they are all still queued.

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by Vince Neal 
Morning all,

How did the test jobs go?

Vince
 
32 jobs are in the queue now.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Yes, just send a bunch. Or we can try putting it on the production server.

Regards,
Daniela

On 26 April 2017 at 15:15, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Hi Daniela,

Yes, the glideins without glexec enabled seem to be working well. I also sent some test jobs with glexec enabled using a service cert signed by the OSG CA, and I believe those were also fine. I can send some more if you'd like to confirm that.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi,

To get back to the initial topic:
I see a couple of DUNE glideins running at Imperial (without glexec). Are
these tests successful ?

Regards,
Daniela

On 24 April 2017 at 11:44, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Hi Daniela,
there are quite a few different aspects to this ticket, and I'm involved in the background in more than one way, which makes it a bit difficult to reply briefly and to the point. If I say something you don't understand, or you like to know more about, please just ask.
Some remarks:
- CILogon basic is one of a number of so-called IOTA CAs. These CAs have a lower "level of assurance" than the thus far accepted CAs which fall in the respective categories classic, slcs and mics. IOTA is a 4th category. The difference is that for the other 3, the CA is doing extensive identity vetting (typically via photo ID), or at least it can vouch for someone having done that. For the IOTA CAs that's not (per se) the case, the CA provides a persistent and unique identifier, but does not per se know the user. That means that that vetting has to be done elsewhere, typically by the VOs. For the 4 CERN experiment VOs that's indeed the case: only vetted members of the experiments can enroll in the VO, which means that those VOs are ok to be used with IOTA CAs. However, in Europe not all (many?) VOs do such vetting, and hence IOTA CAs should only be accepted for certain VOs and not for others. Which VOs are OK and which are not, is something that is typically decided by the eInfrastructures, such as OSG or EGI.
In short, enabling an IGTF IOTA CA without any further restrictions, is something that should not normally be done and should preferably also be discussed first with the e-infrastructure, in your case I presume EGI, unless this concerns dedicated OSG hardware?

- In order to make these combined CA / VO decisions, new software needs to be installed:
* For LCMAPS I have produced a new plugin, which has been released by the EGI UMD, both 3 and 4, a few months ago already, in the form of lcmaps-plugins-vo-ca-ap. More information can be found in https://wiki.nikhef.nl/grid/Lcmaps-plugins-vo-ca-ap
* For Argus, I have also produced code to make this work, but there (due to different reasons) the team has unfortunately taken much more time to integrate this. It will hopefully be released in the UMD update, in May I believe. My earlier version of the software is available, which, combined with some adaptations in the policy, can also make a workable and secure setup for Argus.

For a gLExec-on-workernode scenario, even one that does a callout to Argus, you could still run the new lcmaps plugin before doing the Argus callout, which would still protect your resources.
So there are a few options...

Best wishes,
Mischa Salle

by /DC=org/DC=terena/DC=tcs/C=NL/O=Nikhef/CN=Mischa Salle [email protected]
by Dave Dykstra 
I am also adding [email protected] and the OSG security officer on the Cc of this ticket.

Daniela,

Thank you for pushing back on the current situation, I agree that it is not good.

First, let me say how important it is to us at Fermilab to be using a CA based on federated identity: it enables us to transparently and automatically create X.509 certificates for users without having to run our own CA as we used to.  The only other significantly sized grid VO that is doing this so far that I am aware of is LIGO, but I expect more projects and/or institutions to be doing something similar in the future.  So it isn't an option to simply switch to a traditional CA.

In the OSG, all VOs are required to verify that their users are actual members of their VO, so the cilogon-basic CA is accepted by default in the OSG CA cert package.  EGI cannot do that, because they have VOs that allow people to anonymously join.  Since people can also anonymously get cilogon-basic certificates from some supported IdPs, the combination could allow people to anonymously use grid resources, which is clearly not a good idea.  The long term solution planned for EGI for this problem of accepting federated identity CAs is the software that I mentioned from NIKHEF.   Perhaps since you are motivated for there to be a good solution, you could help by being an early adopter.

In addition, glexec is getting replaced by singularity which does not need to verify users' certificates at all. There still may be other use cases for those user certificates however, for example to verify access to storage in Europe or even to start jobs using certificates from a federated CA based in Europe, so NIKHEF has put in the effort to make a solution.

Finally, there is one more solution that could be implemented by OSG that I have been proposing for about a year but which so far has not gotten any traction.   CILogon has a federated identity CA that is already accepted in Europe, the cilogon-silver CA.  My proposal has been to have an agreement between OSG & CILogon to have OSG security confirm which Identity Providers (IdPs) verify all their users, and have CILogon switch those IdPs to cilogon-silver instead of cilogon-basic.  This is something that would be quite doable technically and mainly requires updating the formal document describing cilogon-silver to describe what's happening.

Dave
by [email protected] 
Hi Dave,

as far as I can tell, I can only answer this ticket via email, hence I
cc'ed them. Their addresses are:
[email protected], [email protected]

Regards,
Daniela

On 21 April 2017 at 14:58, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
by [email protected] 
I am away from the office, returning Monday 24th April, 2017.

For anything urgent related to GridPP/EGI security please
make sure your email is copied to [email protected]

Thank You.
Ian Neilson
 
Let me send some test jobs with the DUNE Production account; it uses an OSG CA service cert. It sounds like Dave will write a reply soon that has some more details of why things currently work the way they do.

Cheers,
Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by Dave Dykstra 
Before I respond more to your comment, I am attempting to add the people on Cc you said you added but I don't see there: Ian (I am guessing Ian Neilson) and Jeremy Coles.  Let's see if this works.
by [email protected] 
Hi,

I have to admit I am not keen on any of these shenanigans. We have a
perfectly valid way of authorizing users that is accepted all over the grid
and no, I prefer not to download CA certificates of the internet(TM). (I
assume if I would I would get regular crl updates for this ? What about CA
updates ? Anything ? )
Why does DUNE not give users certificates from a recognized authority ? I
realize OSG suffers from a not-invented-here syndrome almost as big as
CERN, but from our point of view it's a small VO and unlike the LHC VOs who
can throw their weight (and attached funding around) you'll run into
problems over and over again. Yes, you can turn of glexec, and that will
technically solve your problem for the moment, but that doesn't mean that
in the future we will find some other way of banning users from
non-recognized CA.

I might sound grouchy here (OK, I am grouchy, and old, but I have also been
around a long time), but I'm actually trying to make this work in the long
term, which I think would be better for all concerned.

I've cc'ed the GridPP security person (Ian) on this ticket, maybe he can
shed some lights on my concern. I've also put on Jeremy Coles who is the
GridPP technical something important(TM). If possible could they please be
kept on the ticket ?

I still think that for running the test Ken should use his other DN, that
way I can test if glexec actually works in principle, which is something
I'd like to know.

Regards,
Daniela

On 20 April 2017 at 20:33, Open Science Grid FootPrints <
[email protected]> wrote:

by Dave Dykstra 
NIKHEF has some new software for accepting Certificate Authorities like cilogon-basic in combination a whitelist of accepted VOs.  I believe the software for lcmaps and argus is available now, but it's expected to take about a year to roll out.  I think one of the trickiest parts is a method to keep the list of acceptable VOs updated.  If you're interested in using the new software, contact [email protected]

Otherwise as Ken said you can install the cilogon-basic CA files.  The policy file he was talking about modifying to accept only Fermilab is cilogin-basic.signing_policy.
 
Right, we have seen this issue at some other EGI sites, because the CILogon Basic CA isn't supported by default. The CILogon OSG CA is, but most DUNE users aren't going to have a cert signed by that CA, and it's extremely impractical to change that.

At a couple of other EGI sites we solved the problem by having the sites add the CI Logon Basic CA files:

http://ca.cilogon.org/downloads

And installing them in the usual way. The default policy is to accept all CILogon Basic DNs, but for extra security you can change it to accept only DNs from Fermilab. So instead of allowing "/DC=org/DC=cilogon/*" you can do "/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/*" Since Fermilab has additional user validation beyond the minimal requirements of the Basic CA.

Another option would be to not run glexec; since only DUNE is being supported from these factory entires I suppose it is fine. However, if you prefer that we run glexec, we will certainly do so. What do you think?

I've added Dave Dykstra to the ticket. At some point CILogon Basic is supposed to be supposed within EGI software, but maybe that doesn't extend to WLCG rpms. Dave can comment on the roadmap more accurately than I can.

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi Ken,

Looking in the argus log, it maps the pilot properly, but you are using a
user certificate for which there exists no approved CA within the LCG rpms
(formatting courtesy of argus)
CN=UID:kherner,CN=Kenneth Herner,OU=People,O=Fermi National Accelerator
Laboratory,C=US,DC=cilogon,DC=org

The certificate you use to e.g. update this ticket (
/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner
1385) should work (or any CERN one).

Would you mind sending the job again with a different DN ?

Regards,
Daniela

On 20 April 2017 at 14:05, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Hi everyone,

The good news: the single-core tests did start correctly.

The bad news: they all immediately went held with a glexec setup error (not surprising.) Can you remind me how you want us to handle glexec? At the moment it is set to "optional" in this particular frontend group, but if you'd prefer us to not run it at all, I will set it to never. I don't think it matters much to us since I would imagine everything would map to a single account with or without glexec.

Cheers,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi,

I've killed the multicore jobs. Could you please submit a single core test
job ?
We're happy to enable DUNE multicore jobs, once DUNE runs actual multicore
jobs, just let us know.

Regards,
Daniela

On 19 April 2017 at 21:02, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
New values applied to CPU and mem, ITB reconfig'd and restarted.

-Marian

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
by [email protected] 
Alright, thanks for elaborating on this, I'll update factory entry
accordingly and let you know when done.

-Marian
(gWMS Factory Ops)

On 4/19/17 12:18 PM, Open Science Grid FootPrints wrote:
> [Duplicate message snipped]
 
Alright, thanks for elaborating on this, I'll update factory entry accordingly and let you know when done.

-Marian
(gWMS Factory Ops)

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
by [email protected] 
Hi Ken,

Yes, 2048 MB is fine with us.

Regards,
Simon

> [Duplicate message snipped]
 
Hi Daniela,

At some point it will be the former, but right now >99% of it is the latter (CMS-like.) The wasted CPU in DUNE's case comes from jobs requesting > 2GB of memory, and then the "other" CPUs indeed may not get utilized.

As for the memory limit, most people request 2000MB (our default) but some request 2048, getting confused about whether 1 GB = 1000 MB or 1024 MB. Would 2048 MB be all right with you?

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi,

It would probably be best to aim for the usual 2GB/core limit wrt memory.
As for multicore jobs: Are DUNE jobs true multicore jobs or are those
single core jobs forced into multicore slots, a la CMS ? Because the latter
causes an unacceptable amount of wasted CPU and I'd be reluctant to add
another VO to this club.

Regards,
Daniela

On 19 April 2017 at 11:51, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Hi,

Ideally, yes, we'd prefer multicore pilots at some point, but we can certainly stick with single-core pilots for now. Ops, please go ahead and change the entry to single-core until we hear otherwise. Are there any other limitations such as memory that we should set?

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
by [email protected] 
Hi,

Not sure I can update the ticket directly.
This looks like a multi-core job to me. You never mentioned multicore jobs
anywhere. Generally we do not support multicore jobs for small VOs  -- we
are obviously technically able to run multicore jobs, as we run them for
the LHC experiments, but this is a much bigger deal than just allowing dune
jobs on the existing infrastructure.
Would you be able to send a single  core test job ?

Regards,
Daniela

On 17 April 2017 at 22:40, Open Science Grid FootPrints <
[email protected]> wrote:

> [Duplicate message snipped]
 
Thanks, Jeff. That's all I really needed for now. Let's see what happens once they get a chance to run.

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
 
Hi Ken,

I see nothing out of the ordinary from the factory side, it appears pilots have made it to the site batch queue but have been pending for about 3 hours.

It could be there are simply no resources available at the site at the moment, let's give them a day or two to see what happens (unless the admins CC'd have anything to add).

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
 
I couple of days ago sent a few test jobs just to see if we could get lucky. Our frontend shows that we're requesting glideins and we have some in the idle state and some in the "pending" state. Could someone in ops take a quick peek and make sure we're not hitting any kind of brick wall in the logs?

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385
 
Hi Ken,

I just added DUNE to the requested CEs, in ITB only.

Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
by OSG-GOC 
Hi,

We'd like to try submitting DUNE gmws pilots to the Imperial CEs. They're already in the factories (the ceprod0[5-8].grid.hep.ph.ic.ac.uk entries) so I think we just need to add DUNE to the supported VOs list along with ATLAS and CMS. Let's start with the ITB factory only for now.

Thanks,

Ken

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Kenneth Herner 1385

GOC Ticket Version 2.2 | Report Bugs | Privacy Policy

Copyright 2018 The Trustees of Indiana University - Developed for Open Science Grid