Guiding requirements, ideas and design
Requirements:
- An HPC cluster for general use
- Focus first on High Energy Physics work for ATLAS
- Support Open Science Grid
- Also support EGEE gLite
Guidelines
- A head node that concentrates all services and connects to the external world
- computing nodes (Worker Nodes or WNs in Grid speak) on a dedicated network
- OSG and gLite CEs cannot be on the same host, so we use Virtual Machines
- also gLite CE and UI cannot be the same host
- for symmetry, also separate OSG CE and UI
- the services that are not related to a specific Grid (like PBS, NFS etc) are ran on the non-virtual machine instead.
- we also keep a couple of VMs for WNs, to use all the CPU power of the head node.
Operating System
- gLite 3.1 requires Scientific Linux 4. Prefers 32bit on CE and UI
- OSG can run on SL4, 64bit is fine
- the bare metal requires 64bit OS to use the 16GB of RAM
so we run SL4.7 x86_64 on the bare metal (head and WNs) and SLC4.6 i386 on the CE and UI VMs.
Virtualisation
- both Xen (used by UJ IT dep) or VZ (used on Wits head node) could be nice.
- VZ may be trickier because of 64/32 bit issues (gLite less tested on 64bit)
- but we have ready-made WMware VM images of SLC4 from Gilda, ready for installing gLite CE and UI, so we start with VMWare Server.
- the VMWare web-based GUI does not work on OsX, or on SL4
- it's not very well documented or publicised, but one can do almost everything without using the VMWare GUI
Security
- users should only have interactive login access to the UIs, not the CEs, the WNs or the head
- ssh root login only on internal WNs
- use iptables to restrict access
Configuration
- try to centralise configuration as much as possible
- but we are in an hurry, and I'm doing this setup remotely, so skip netbooting
- skip Oscar too, not time to learn that one as well
- NIS helps for users and few other things
Mail
- we need to get all the mail sent from all the systems, as it contains logs, error reports etc
- do we need any outgoing mail ?
Hardware and services
- gridvm head node
- 8 CPUs: 4 Xeon E5405 2GHz (quad-core)
- RAM: 16GB
- Disk: 900GB by 6 (or7?) SAS HDs on RAID5
- Net:
- eth0 on UJ network and outgoing (152.106.18.0/24)
- eth1 on cluster switch (10.0.0.0/24)
- OS: SL4.7 x86_64
- services: NFSv4, NIS, PBS/Torque, Postfix SMTP
- VMWare virtual hosts:
- osg-ce
- osg-ui
- glite-ce
- glite-ui
- wn001-wn007 worker nodes
- 8 CPUs: 4 AMD Opteron 2350 2GHz (Barcelona)
- RAM: 16GB
- Disk: 2x150GB SATA
- OS: SL4.7 x86_64