Dell was responsible for the overall design and construction of CITerra. The cluster’s hardware and software consist of the following components.

Hardware

CITerra resides in 31 cabinets or “racks”, each about two feet wide, three feet deep, and six feet high. Each rack holds up to 42U of equipment. The heaviest racks weigh approximately 1750 pounds apiece.

The racks are located in a specially built facility in the Division of Geological and Planetary Sciences at Caltech. They stand side-by-side in a long row, with one two-foot-wide walkway about halfway down the row. Altogether, then, the cluster occupies a floor area 64 feet long and 3 feet wide, and weigh about 50,000 pounds.

The 31 racks are conceptually divided into five groups. Four of the groups have seven racks apiece, each group holding 256 compute nodes and their associated network equipment. Currently 2 of these groups are used for the cluster. The remaining group of three racks, which sits in the center of the row of racks, holds shared infrastructure equipment. Two of the three infrastructure racks hold most of the equipment for the shared filesystem. The third infrastructure rack holds the master and login nodes, along with a few other infrastructure elements.

Computational Elements

Compute Nodes

There are 512 compute nodes in active service, each with two processors or CPUs, each with 4 cores. The compute nodes act together as a unit, so that a single computational job could easily use 4096 cores, depending on the application software.1.

The compute nodes are Dell PowerEdge 1950s The 512 compute nodes are identical. Each compute node has:

  • two Intel Quad Core Clovertown processors
    • 2.33GHz
    • 8 of level 2 cache
    • 64-bit architecture (Nocona, EM64T)
  • 12GB of RAM
  • one hard disk
    • Serial attached SCSI
    • 154GB
    • 15kRPM
  • one Myrinet card
    • model M3F-PCIXD-2
    • in a 133MHz PCI-X slot
    • 2Gb/sec bandwidth each direction
    • fiber interconnect
  • one Gigabit Ethernet port
    • Intel controller
    • second port available
    • shared with BMC (see next item)
  • Baseboard Management Controller for remote management
  • one power supply

In addition to the 512 compute nodes, 4 compute nodes are held as spares, for a quick swap whenever an active compute node has a hardware problem.

Login Nodes

Two PowerEdge 1950s? serve as user login servers. These are where users perform their interactive work on the cluster. Typical user activities include editing program files, compiling applications, and launching and monitoring computational runs.

The login nodes, like the other infrastructure servers described in the following sections, are slightly more capable than the compute nodes. Each server has 12GB of RAM, redundant power supplies, and two hard drives connected to a hardware RAID 1 (mirroring) controller.

Master Node

A single PowerEdge 1950? serves as a central “master node” for the cluster. Several elements of cluster infrastructure run on or originate from this server. These include:

  • centralized OS & software reinstalls for the compute nodes and login nodes
  • DNS
  • ntp
  • default LSF master

The cluster services are designed so that if the master node fails, the cluster will continue operating. Other nodes serve as backup DNS, ntp, and LSF servers. Compute and login node reinstalls do depend on the master node, but reinstalls are not important to the daily operation of the cluster.

Shared Filesystem Elements

Ibrix servers

  • 16 Dell PE 1850s as segment server
  • 1 Dell PE1850 as the Fusion Manager
    • 3.6 GHz
    • 64-bit architecture
  • 6GB of RAM
  • one hard disk
    • SCSI
    • 120GB
    • 10kRPM
  • Two Qlogic Fibre Channel Adapter cards
    • model QLA2312
    • Set for failover
  • Two Gigabit Ethernet ports
    • Intel controller
    • Ports are bonded
    • shared with BMC (see next item)
  • Baseboard Management Controller for remote management

DataDirect Networks disk array

  • Two DDN SAN Controllers
    • model S2A8500
    • in dual mode for failover
  • 10 Disk Chassis
    • Model SFB2016
    • Each has 16 300GB 10k FC Drives
  • 2 Fiber Channel Switches

HPC Network

The HPC network is specifically used for message passing amongst the nodes. We use myrinet 2G hardware running the GM protocol. There are 4 CLOS and 1 spine switch. There are enought ports to use up to 1024 nodes.

  • 4 Clos256+256
    • 256 host ports, 256 interswitch ports on 64 quad-fiber ports
  • 1 Spine1024
    • 8 quad (32) 32-port switches presented on 256 quad-fiber ports (1024 ports total)

Commodity Network

Public Network

We currently use Gigabit ethernet out to the campus network

Private Network

Each node is connected using Gigbit ethernet connection to Nortel switch stacks. There are 7 switches per switch stack and each switch stack serves a subcluster of 256 nodes. Each switch stack trunks 4 GigE to a central switch stack that the headnodes and Ibrix server connect to.

  • Compute switch stacks
    • Nortel 5510–48T
    • 7 switches per subcluster
    • 4 trunks to the central switch stack
  • Central switch stack
    • Nortel 5510–48T
    • 3 switches

The Facility

Software

Operating System Environment

CITerra runs Linux, specifically Red Hat Enterprise Linux (RHEL). All nodes in the cluster run the same release, currently RHEL 4.

The cluster software environment that we use, Platform Rocks, is founded on RHEL2. Platform Rocks, a supported product of Platform Computing Inc., is a slightly modified version of software created by a community-based, open-source project called simply Rocks. The Rocks project receives funding from the National Science Foundation, like CITerra itself.

Rocks (and Platform Rocks) bundles the operating system and a number of separate cluster tools, within a framework of cluster management software and databases designed specifically for Rocks.

Filesystem

IBRIX Fusion is a fully-integrated, enterprise-class scalable file serving solution suite comprised of a highly scalable POSIX compliant parallel file system, scalable volume manager, high-availability features and a comprehensive management interface that includes a graphical user interface and a command-line interface.

Resource Management

We use LSF 6.2 (load sharing facility) from Platform computing.

Scientific Applications

 

1 The cluster configuration and software can readily handle up to full-sized, 4096 core jobs, without modification (↑)

2 Depending on local requirements, Rocks can instead be based on a Linux distribution, such as CentOS, that recompiles and repackages the Red-Hat-provided RHEL source code. Sites that choose this option can avoid paying license or support fees to Red Hat, while benefitting from most of the software quality inherent in RHEL. (↑)