Tuesday, February 20, 2007

RB Wrestling the comeback


This morning looking at the monitoring our RB does not look happy. You can judge yourself on the plot below. It clearly seems that when the submission rate is too high the workload manager can just not eat the jobs fast enough to reduce the queue length. I have asked help from Maarten, we'll see what he come up with. I think I will have a look in the rb code to find out what is going on...

Monday, February 19, 2007

Certificates and Mars

I was worried by the low number of jobs at LeSC. There was not much jobs there.
It is very difficult to get an hold on the output files of failing jobs. Thanks to sge we can find out where it is located
  • qstat -j jobid will print out the std.err and std.out of the jobid given
The problem was:

globus_i_gsi_gss_utils.c:2155: globus_i_gsi_gssapi_init_ssl_context: Error with openssl: Couldn't open bio for reading on file: /homes/lt2-lcg/grid-security/certificates/47d3d1a0.0

and that is because when untarring the files in the certificate directory one of the certificate
was not readable by the lt2-[users]. This is now fixed and I will chase up lhcb to understand
if they can run there without problem.

Wrestling with our RB

We are still observing very long time (several minutes) to have a job going from the waiting state to the scheduled state. This means that the network server of the rb is accepting the job but the workload manager is running out of steam to process it and do the match making.
  • I monitored the rb by looking at the number of entries in the input queue (/var/log/edgwl/workload_manager/input.fl). Checked the number of entries that matches the regular expression ("g$").
  • Plotting the number of entries waiting to be accepted by the workload manager as a function of time. The result is here.

The left scale (blue dots) is the number of jobs waiting to be matched. The right scale (red dots) is the number of jobs submitted per unit of 10 minutes

  • You can see a clear drop at the end of the x range. I think this is because I have reduced the number of threads for the network server and increased that number for the workload manager. The file to look at is /opt/edg/etc/edg_wl.conf .
    • For the NetWorkServer:
      • MasterThreads = 4;
      • DispatcherThreads = 6;
    • For the WorkLoadManager
      • NumberOfWorkerThreads = 10;
I will continue to monitor it during the night because the drop is not fully understood. Maybe it is the cms production that has slowed down and is giving some air to the rb.

Monday, February 12, 2007

Resurrection of the Blog

ICT the Grid

Today we are back in business with ICT to get their cluster on the Grid.
  • They will provide one machine and install RHEL3 i386 so that we don't have the RHEL4 problem.
  • We have to find out how to modify the information system since they are running pbspro which does not have exactly the same commands as pbs.
  • They will create the pool accounts and we have yet to make sure that we can run prolog scripts to get the lcg environment correct
QMUL
  • Atlas cannot install the tags. They tried to install the new software but it is not published correctly.
  • Maybe this is because we are publishing another subcluster to publish the 64 bit queues. I'll make a wiki entry with explanations how this was done. The dynamic information does not seem to be correct though.
Imperial Hep
  • Mona has enabled camont and total on our rb. We now need a site to test it and we also need to enable it on the lesc and ic-hep ce.