Wednesday, October 08, 2014

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:


 xrootd.seclib /usr/lib64/libXrdSec.so
 xrootd.fslib /usr/lib64/libXrdOfs.so
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfun:libXrdLcmaps.so -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 #                                                                              
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:

################################

# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid http://esc.qmul.ac.uk/xrootd"
             " --actionid http://glite.org/xacml/action/execute"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

xrootd_policy:
pepc -> good | bad
################################################


Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.



Tuesday, June 04, 2013

Serial Consoles over ipmi

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled


Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled


Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console


This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

respawn
exec /sbin/agetty /dev/ttyS1 115200 vt100



Sunday, April 21, 2013

The art of cabling


The challenge of organising your cables behind your TV is nothing compared to that of a large computing cluster.

One of our standard racks contains 12 Dell R510s servers (for storage) and 6 Dell C6100 chases (providing 24 compute nodes) all 36 nodes are connected with a 10 Gb (SFP+), 1 Gb (backup) and 100 Mb (for IPMI) network cable. Connecting to 3 different network switches at the top of the rack. In addition the 18 "boxes" need a total of 36 power connections. A total of 144 cables per rack!


How to cope? Separate the network cables from the power cables, a possible source of noise. Use different colour cables for the different traffic and add unique id number for each cable. Use lose, removable cable ties. When a cable brakes don't remove it, just add a new cable.

The 10 Gb switches, in our case Dell S4810s, connect using 4 40Gb QSFP+ cables to two Dell Z9000 core switches. Having two core switches allows us to take one unit out of service without downtime (we use the VLT protocol and it works!). However this does add cable complexity. The backup 1 gig switches connect to each other in a daisy chain using 10 Gb cx4 cables, left over from before our 10 Gb upgrade. Finally the ipmi switches connect to a front-end switch using 1GBaseT cables.




The picture shows the inter-switch links. Visible are the orange 40Gb connections and blue 10Gb cx4 cables. In addition each 40 Gb cable has an ID indicating which rack it came from and which core switch its going too.



We have one rack full of critical, world facing servers. These servers need to be available all the times making it very difficult to reorganise the cabling. As a result over time, as we add and remove servers, the cabling becomes a mess. This is starting to become a risk! We are just going to have to accept some down time to sort it out in the near future.



Monday, April 15, 2013

virtualization performance hit


Like the rest of the world,  there is a lot of discussion going about the use of clouds and virtualization in gridpp. 

http://www.gridpp.ac.uk/gridpp30/mcnab-lhcb-vmclouds-march-2013.pdf

http://www.admin-magazine.com/HPC/articles/the_cloud_s_role_in_hpc

Using virtualization will have a performance impact, so using it for our type of computing (hpc/htc) may not be the best solution. However just what impact does it have? A quick search of the web suggests anywhere between 3 to 30%. Most of the overhead appears to be in the kernel and in i/o.

http://serverfault.com/questions/261974/how-much-overhead-does-x86-x64-virtualization-have

http://www.altechnative.net/2012/08/04/virtual-performance-part-1-vmware/

http://www.anandtech.com/show/3827/virtualization-ask-the-experts-1

I decided that I wanted to do some of my own tests with the focus on the type of work we do in gridpp. 

Testbed: 24 thread westmere processor running at 2.66 GHz + 48 Gig of memory using Scientific Linux 6.3 (basically RHEL6). I'm using the default install of KVM with the virtual image as a local file setup to use all 24 threads.

Benchmarks: 1) I unpack and make the ROOT analysis package using 24 threads; 2) as 1 but using only one thread. 3) I generate 500,000 Montecarlo events using the HERWIG++ Generator; 4) as 3 but I also include the time taken to unpack and install HERWIG++; 5) I run the HEP-SPEC06 benchmark. For tests 1 to 4 i use the TIME command to obtain the real time taken (smaller is better), for 5 I report the hep-spec score (larger is better).  I will run the benchmarks on the bare metal install and on the VM on the same hardware and compare the results.

Results:




Out of the box performance of KVM results in ~3% (CPU intensive) to 20% (sys call intensive) reduction in performance. There is some indication of correlation with ratio of sys time / user time (particular effect with make/tar/gzip?). This is not seen in HEP-SPEC result. SYS time is the CPU time spent within the kernel and from previous studies we expect this to incur a high performance hit in virtualization.

If I get the time I intend to repeat analysis using optimisations (e.g. guest image on LVM). Repeat analysis using fedora 18 ( ~RHEL 7). Repeat using sandybridge cpu. Look at network performance (eg iozone with lustre).

Wednesday, April 10, 2013

The Queen Mary Grid Cluster


The qmul grid/htc cluster is a high throughput (htc) research computing cluster based at Queen Mary, University of London. We primarily serve the scientific grid community and are funded by the griddpp
collaboration (i.e. uk stfc research council). By high throughput we mean the ability to do lots of individual separate jobs. Our main workload is data analysis for the ATLAS experiment at cern. We are the top site in the UK for this type of work, and one of the leading sites for the ATLAS LHC experiment in the world. We are part of the LondonGrid (hence the post to this blog!)

Our cluster comprises of:

For running the actual jobs
30 Dell C6100 using X5650s processors, contributing a total of 2880 job slots, and
60 older streamline nodes using E5420 processors, contributing a total of 480 job slots.

For Storage we run the Lustre parallel file system using
72 Dell R510s with 1800 TBytes of disk and
12 older Dell 1950s with MD100 disk arrays with 360TB of disk
Our actual provision is about 1600TB due to the use of raid 6 and "real" disk sizes.

We have a lot of development work to do over the next year which I hope to describe over the coming month in this blog including...

  • A new monitoring system probably based on opennms.
  • A new deployment system, to replace our hand made perl/mason/kickstart system probably using razor and puppet.
  • A cloud stack, we've been doing scientific computing using the grid software, but this model of computing is likely to be replaced with a cloud type model, we will need to look at the various options (OpenStack, CloudStack or OpenNebula).


The 11 racks of the QMUL cluster


Friday, March 11, 2011

RHUL cluster expands


Yesterday, RHUL took delivery of new storage and compute nodes to beef up its Tier2 cluster.
The GridPP and CIF funded kit was supplied by Dell and is being installed and configured by Alces.
The extra 6.3 kHS06 and 420 TB will more than double the capacity of cluster.
Once the installation is complete and accepted, work to integrate it with the existing cluster and bring up the gLite services will begin.

Friday, February 19, 2010

RHUL 'Newton' cluster comes home

After two years hosted by Imperial College, our 'Newton' Grid computing cluster has finally been relocated to Royal Holloway's new state-of-the-art computer centre. The move was carried out by Clustervision and everything went smoothly. Before the cluster goes back into production, analysing LHC data, a software upgrade to SL5 is planned.

A small part of Newton remains at IC: the racks were donated to become part of the particle physics cluster.