Friday, June 15, 2007

SAM Failures in London

Summary of SAM failures and solutions
  • mars-ce2: CA certificates updated but permissions where wrong for the lt2-lcg group and hence the certs where not readable. Fixed now
  • hep-ce:
    • Update of the images. Missing ssl and uuid libraries caused the lcg-cp tools to fail. Matt solved this
    • updated the CA but unfortunatly the crl cronjob did not run since it is being run by mona. Now fixed
  • gw-2 (UCL-CENTRAL): Investigated intermittent failures and discovered that the sam jobs are sometimes killed by sge which has a vmem limit of 2GB. The problem is that python when creating a new thread tries to use the max stack size of the parent process. Since sge set this with a very high value any new thread will thread will try to create a big stack and the vmem limit will be reached. The solution is to change the max stack size in the sge configuration. We tried a ulimit -s 10 in the jobmanager but since then gw-2 is failing the ops test consistently. William has been contacted the revert back this change and make the modification in the sge queue configuration.
    • Note: this problem was seen on the ic-hep cluster (ce00) and fixed using the stack size limit.
  • ce1.pp (RHUL): gatekeeper problem, it seems I cannot access with the ssh keys I am using at home. Have to check from IC.
It's a black week for the availability in London...

No comments: