Prior to Rocks 4.3, the BIOS-defined boot order of a compute node required that a network boot (known as PXE) come after local hard disk. In particular the boot order in BIOS would be set as
1. CDROM 2. Hard Disk 3. On-board Network Device (PXE) |
A user would have to intercept the boot sequence (often by hitting the F12 key on an attached keyboard) to force a network boot. Rocks also provided a small utility on each node (/boot/kickstart/cluster-kickstart-pxe) that would manipulate the two-bytes on the local hard disk to force BIOS to bypass booting from the local disk and try the next device on the boot list. When the boot order was set as above, the node would PXE boot and therefore re-install.
The logic for this structure was that a frontend did not need to know the state of node (whether it had failed and should be reinstalled or had some other intermediate state). Also it is not required that a frontend be up for a node to reboot itself. Another practical issue arises for PXE booting large clusters. Since the PXE client is in NIC firmware, no assumptions about timeouts, retries or other elements that figure into robustness could be made. Large cluster reinstalls (or reboots) for a kernel that comes over PXE would often result in hung nodes because of the low level of robustness of TFTP (the underlying protocol used to transfer initial kernel and ramdisk image for nodes booting over the network). For wholesale re-installation of large clusters, PXE does not scale well. For this, Rocks provides the installation kernel and initial ramdisk image on the local hard drive. The commmand /boot/kickstart/cluster-kickstart run on a local node will cause that node to re-install itself by using a local (hard disk) copy of the installation kernel and initial ramdisk.
The above boot order and behaviour continues to be supported in Rocks 4.3. That is, existing rocks clusters can be upgraded without requiring the cluster owner to change any BIOS settings. |