Archive for February, 2002
- Added a new CLUSTER element to the XML with two attributes: NAME and LOCALTIME. Necessary for monitoring clusters in many different timezones.
- Fixed the getopt_long() call in gmond.c thanks to feedback from Meik Hellmund. The getopt_long() parameters didn’t match the switch() statement breaking the “trusted_host” option.
- Fixed a bug in gmond where connections from untrusted hosts caused segfaults. Error caused by passing datum_free() a NULL pointer in server_thread() of ./gmond/server.c.
- Changed the way transient nameservice errors are handled by pre_process_node() in ./gmond/listen.c. Previously, transient errors were retried but now they are treated as errors (although gmond will continue trying to resolve the host when it gets a new multicast packet from it)
- Updated the ganglia.spec file to merge gmond and gmetric into a single RPM, fixed some small bugs, and updated the RPM information.
- Preston Smith updated the FreeBSD monitoring code to include all metrics which are monitored under Linux except number of running processes, absolute cpu idle time, and shared memory. SMP users may find that freebsd’s cp_time sysctls is not completely accurate under FreeBSD stable meaning CPU%s might be inaccurate. However, it works under FreeBSD-CURRENT.
- Changed the gmetric options to also support long options and updated the help output (from -h, –help) to be much more descriptive
Added the getopt source to the ganglia library for systems (Solaris, FreeBSD) which don’t have it installed by default. Tested on Solaris 8, FreeBSD 4.5 and Alpha/Linux.
- Completely rewrote the underlying hash library because the original hash functions were over-engineered and had a memory bug on certain platforms. New hash functions are superlight and fast. Built test program and profiled/traced all memory functions using mpatrol. No leaks. Special thanks to Mike Howard for letting me test gmond on his cluster which displayed the memory bug. Also thanks to Alan Hagg and Rod Hernandez for patiently answering my questions about the memory bug on their clusters. You help was appreciated!
- Updated code to catch when transient nameservice errors occur and retry. Correctly handle hosts the don’t resolve instead of treating as an error
- Added a patch submitted by Joshua J England for gmond to correctly report the number of CPUs and their speed on alpha architectures
- Added a patch submitted by Eirikur Hallgrimsson and written by Yaroslav Klyukin for gmetric which allows users to chose which network interface gmetric multicasts metric data
- Changed the “safe_host” option to “trusted_host” to make it clearer. Also added the “num_nodes” and “num_custom_metrics” options for more efficient in-memory cluster image creation
- Reduced the number of total threads by one by removing the for(;;)pause() spin and having the main thread do server work
- created the function my_inet_ntop() function in libganglia to deal with the limitations of inet_ntoa in a multi-threaded environment
- changed the self-organzing behavior of gmond to recognize when a transient error occured on a remote gmond process
- added verbose error checking of gethostbyaddr() in listen.c
- Fixed a bug in ganglia-rdd.pl where stale hosts were not being removed from the in-memory hash (to match the XML output). No changes in the underlying databases were necessary only the data that is being put into them.
- Changed the Y-Axis on the hosts in the cluster overview to have the same range (min/max) in order to make better host comparisons. (Thanks to Tim Cera for making the suggestion)
- Changed the list of hosts that are down to a drop-down box. Previously, when a large number of machines went down the top right corner table cell would swell. The host list is also sort in order from the most recent crash to the oldest.
See a demo at http://ganglia.sourceforge.net/demo/!
Download it now, from the ganglia download site
I’ve had requests from users to integrate ganglia into their MPI installations. As a first step I’m releasing a Perl script which creates a dynamic load-balanced MPI machinefile.
Any nodes in your cluster that are down are not included in the list and the number of CPUs for each node is listed as well.
To get this great script, go to the download page
version 2.0.4 of the ganglia monitoring core is available for download from the ganglia download page this release is a very minor bug fix which only applies to clusters which have machines running AMD chips. some AMD machines were reporting double the number of processors than were actually there.
- increased the speed of the host security check for the XML port
- added commandline options for almost all compile-time opts for gmond
- added a –safe_host option to allow a host outside of the multicast channel to connect
- now gmond strips all quotes (”) from gmetric data to keep XML well-formed
- improved the self-organizing behavior of the gmonds