During the earliest years of building really large High Performance Computing
clusters, there was a need to provide comprehensive yet lightweight system
monitoring across the cluster. Existing tools just weren't getting it done given
they had too much overhead and different output formats making it difficult if
not impossible to correlate.
Collectl was developed to provide an integrated view of virtually all data one
might wish to look at, including high performance file systems and interconnects
such as InfiniBand. By synchronizing all data collection to the millisecond
across the entire cluster, it then becomes fairly easy to see what all systems
are doing a the exact same point in time.
Collectl has been deployed on some of the largest HPC systems in the world and
since I've begun working on cloud-based systems it has been extended via its
plugin API to deal with virtual environments and even show metrics by VM,
inclding networks and disks.
Mark Seger