UJ Research Cluster | SysAdm / LHCG Tiers and cluster sizing

Some notes about the LHC Tier levels and the sizing of the cluster - or, how big does a cluster has to be to be useful for LHC work?

LHCG and other documents

LHC Computing Grid - Technical Design Report
Jefferson Lab Computing
www.gridpp.ac.uk/tier2/Experiment_Tier-2s_v1.0.doc
http://www.gridpp.ac.uk/eb/ComputingModels/atlas_computing_model.pdf
http://pcbunn.cacr.caltech.edu/Tier2/Tier2_Overall_JJB.htm
note that the document is from 2001 - now Gigabit Ethernet has become a low cost and widely adopted commodity, and the savings allowed by 100BaseT Ethernet are not relevant anymore. Same applies to disk sizes, with >400GB sigle disks available on the market.
http://searchdatacenter.techtarget.com
http://www.clustermonkey.net

Requirements for CERN LHC Tier2 cluster,

from www.gridpp.ac.uk/tier2/Experiment_Tier-2s_v1.0.doc:

	Number of T1s	Number of T2s	Total T2 CPU	Total T2 Disk	Average T2 CPU	Average T2 Disk	Network In	Network Out
			KSI2K	TB	KSI2K	TB	Gb/s	Gb/s
ALICE	6	21	13700	2600	652	124	0.010	0.600
ATLAS	10	30	16200	6900	540	230	0.140	0.034
CMS	6 to 10	25	20725	5450	829	218	1.000	0.100
LHCb	6	14	7600	23	543	2	0.008	0.008

Sizing

NB: These notes are from July 2006. The landscape has changed in the while

A full scale Tier2 node would be quite expensive and would require substantial manpower, and expertise that is not locally available. It would also require a 140Mbps connection (assuming ATLAS) to an overseas Tier1, that I suspect not to be available in SA, or incredibily expensive.
The sizing of an average Tier2 node is probably excessive for the limit size of the HEP/LHC community of SA, and might only be justified on a regional scale (southern/whole Africa ? northern African countries have probably better network connections to Europe or Israel than to SA)
I would suggest to go for a 1/10 scale node (<=50CPU equiv), that would be sufficient for all non-LHC computing requirements, and a great testbed to build up local expertise in perspective of increasing involvment in LHC.
In an LHC perspective, this cluster could either play the role of a very small Tier2, or of a pretty good Tier3.

10 to 25 computing nodes
- prefer an established supplier (like SUN, IBM, HP, Dell...)
- single CPU, dual core AMD Opteron 175
- 2GB ECC RAM
- ~100GB SATA HD
- Gigabit Ethernet
- case: 1 rack-unit or blade (choose mostly on price)
1 service node (front-end, management and file server)
- single CPU, dual core Opteron or Xeon
- 2GB ECC RAM
- at least 500GB of mirrored or RAID-5 HD, but with space to grow
Gigabit Ethernet switch >10 ports, high performance

Motivations:

10 to 20 CPUs
AMD Opteron dual core, single chip
- AMD CPUs have better FP performance than Intel x86-class CPUs.
- Intel Itanium CPUs have very good FP, but are not really i386-compatible.
- Apple/IBM PowerPC CPUs have very good SIMD FP, but almost all non-theoretical HEP or Nuclear Physics computations are not vectorisable.
- the dual core CPUs also benchmark slightly better than a dual single-core CPU
- we get 2 CPU using 1-CPU motherboards - certainly cheaper
Gigabit Ethernet
- Most HEP computations are in the "Embarrassingly Parallel" or Near-EP classes (http://en.wikipedia.org/wiki/Embarrassingly_parallel), so there is no strong need for low latency interconnections like Myrinet or Infiniband
- Gigabit is not too expensive now
established supplier (like SUN, IBM, HP, Dell)
- Do-It-Yourself from "white boxes" is expensive in terms of qualified manpower and support
- prefer a large (US?) supplier with strong rooting in SA
  stress the fact that will also improve the skills of the SA workforce of the supplier
- SGI have fancy but expensive stuff on board
- LinuxNetworx and other specialized Linux cluster suppliers are small and in the US
physical form factor
- rack-mount absolutely necessary to allow for growing
- for less than 20 nodes a "blade" system is probably not interesting - saving rack space is not so important. But Blade Centers might have integrated management, lower power consumption and better cooling
- normal 1U PCs might also be re-deployed to other tasks at End-Of-Life
Linux Distribution
- Scientific Linux 4 ??
- Most use RHEL 3 or SL 3, with kernel 2.4
Data Storage and File Systems ?