Ganglia on IBM
POWER systems "Best Practices"
On this page I describe my view how Ganglia could be used best on IBM
POWER5/6 systems.
Some things to consider before you start:
- Hostnames
- To Ganglia a new hostname is a new machine
- Has to resolve IP addresses so use DNS
- IP addresses stable
- Make sure you are not going to change IP addresses
- Time and date
- Make sure the time zone, time and date is consistent on all
machines in
a cluster
- Use of NTP is highly recommended
- So
- these are normal requirements on production machines
- for prototype and test systems – get this right before starting
Ganglia
- Read the simple Ganglia How-To available for people setting up
their first
Ganglia system at:
Best Practices
Best Practices – Preferred
Setup
- Define each System p machine with all its LPARs as a separate
cluster
- Use Unicast for network communication
- Define at least two LPARs per System p machine as gmond hosts for
gmetad
- One would be sufficient, however, two is better for high
availibility reasons
- Define those two LPARs in /etc/gmetad.conf
as the information brokers
for that machine
- From gmetad: Don’t poll the gmond hosts more frequently than
every 15
seconds
- Know upfront what time intervals to use for sampling ("RRAs"
stanza in /etc/gmetad.conf),
see below
- Use my extensions for
- Ethernet adapters
- Fibre Channel adapters
- Web interface
Best
Practices – Ganglia Sampling Intervals
Important to know:
- The sampling interval is defined in /etc/gmetad.conf.
- The "RRAs"
stanza is used to defined individual settings.
- The sampling settings are global.
- If no "RRAs"
stanza is defined a default configuration is used.
- For historic reasons all values are specified in intervals of 15
seconds.
Example:
Default settings in Ganglia
RRAs
"RRA:AVERAGE:0.5:1:240"
\
"RRA:AVERAGE:0.5:24:240"
\
"RRA:AVERAGE:0.5:168:240"
\
"RRA:AVERAGE:0.5:672:240"
\
"RRA:AVERAGE:0.5:5760:370"
Translation:
- Take 240 samples at 1
× 15 seconds intervals (used for display of hour)
- Take 240 samples at 24
× 15 seconds (= 6 minutes) intervals (used for display of day)
- Take 240 samples at 168
× 15 seconds (= 42 minutes) intervals (used for display of week)
- Take 240 samples at 672
× 15 seconds (= 168 minutes) intervals (used for display of month)
- Take 370 samples at 5760
× 15 seconds (= 24 hours) intervals (used for display of year)
Example:
1-minute sampling for one year
RRAs
"RRA:AVERAGE:0.5:4:525600"
Translation:
- Take 525600 samples at 4
× 15 seconds (= 1 minute) intervals
- 525600 = 60 (samples/hour) × 24 (hours) × 365 (days)
× 1 (year)
Example:
1-minute sampling for 6 months, 5-minute sampling for 2 years
RRAs
"RRA:AVERAGE:0.5:4:259200"
\
"RRA:AVERAGE:0.5:20:210240"
Translation:
- Take 259200 samples at every 4
× 15 seconds (= 1 minute) intervals
- 259200 = 60 (samples/hour) × 24 (hours) × 30 (days)
× 6 (months)
- Take 210240 samples at every 20
× 15 seconds (= 5 minutes) intervals
- 210240 = 12 (samples/hour) × 24 (hours) × 365
(days) × 2 (years)
Example: 15-second sampling for 1 day, 1-minute sampling for 2 months,
10-minute sampling for 1 year
RRAs
"RRA:AVERAGE:0.5:1:5760"
\
"RRA:AVERAGE:0.5:4:86400"
\
"RRA:AVERAGE:0.5:40:52560"
Translation:
- Take 5760 samples at every 1
× 15 seconds intervals
- 5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours)
- Take 86400 samples at every 4
× 15 seconds (= 1 minute) intervals
- 86400 = 60 (samples/hour) × 24 (hours) × 30 (days)
× 2 (months)
- Take 52560 samples at every 40
× 15 seconds (= 10 minutes)
intervals
- 52560 = 6 (samples/hour) × 24 (hours) × 365 (days)
× 1
(year)
Example:
1-minute sampling for 2 months, 5-minute sampling for 6 months,
15-minute sampling for 3 years
RRAs
"RRA:AVERAGE:0.5:4:86400"
\
"RRA:AVERAGE:0.5:20:51840"
\
"RRA:AVERAGE:0.5:60:105120"
Translation:
- Take 86400 samples at every 4
× 15 seconds (= 1 minute) intervals
- 86400 = 60 (samples/hour) × 24 (hours) × 30 (days)
× 2 (months)
- Take 210240 samples at every 20
× 15 seconds (= 5 minutes) intervals
- 51840 = 12 (samples/hour) × 24 (hours) × 30 (days)
× 6 (month)
- Take 105120 samples at every 60
× 15 seconds (= 15 minutes) intervals
- 105120 = 4 (samples/hour) × 24 (hours) × 365 (days)
× 3 (years)
Best
Practices – Default Ports
Ganglia by default uses the following ports:
- 8649
- Sending to other gmonds via UDP (udp_send_channel
in /etc/gmond.conf)
- Receiving from other gmonds via UDP (udp_receive_channel
in /etc/gmond.conf)
- Sending an XML description of the state of the cluster (tcp_accept_channel
in /etc/gmond.conf)
- 8651
- The port gmetad will answer requests for XML.
- 8652
- The port gmetad will answer queries for XML.
- This facility allows simple subtree and summation views of the
XML tree.
Best
Practices – Shared Ethernet Adapter Statistics
Question: How to monitor SEA
statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics
about Ethernet adapters if there are no interfaces defined on that
adapter.
- Only seldom interfaces are defined on SEAs.
- The AIX command 'entstat'
however provides these statistics.
Solution: Extension through
gmetric via a shell script
Best
Practices – Fibre Channel Statistics
Question: How to monitor Fibre
Channel statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics
about Fibre Channel adapters if there are no disks attached to the
adapter.
- Tapes, for instance, would be left out.
- The AIX command 'fcstat'
however provides these statistics.
Solution: Extension through
gmetric via a shell script
Best
Practices – Enhanced Web Interface
Please take a look at http://www.perzl.org/ganglia/webinterface.html.