SiteScope User's Guide

Monitoring SiteScope Server Health




For reliability of operations monitoring depends in part on the reliability of the monitoring application. SiteScope can monitor several key aspects of its own environment to help uncover monitor configuration problems as well as SiteScope server load. A Health button is part of the common navigation bar at the top of each SiteScope screen. Included in the button graphic is a status icon that indicates if the SiteScope Health monitoring has detected a problem that could be impacting monitoring performance. Click the Health button to go to the SiteScope Health page.

The SiteScope Health Page

The SiteScope Health page includes two tables that display information from the SiteScope's monitoring of its own health. These tables are:

  1. SiteScope Log Event Table
  2. SiteScope Server Load Table

Each table displays a set of information, including a status icon, indicating the state of a number of SiteScope performance parameters. The information in these tables is discussed below.

The SiteScope Health Page also includes other links for working with the SiteScope Health feature. This includes links to configure the Health page error thresholds and a link to disable/enable SiteScope Health monitoring.

SiteScope Log Event

SiteScope Log Event shows incidents of skipped monitors. A monitor will be reported as "skipped" if the monitor fails to complete its actions before before it is scheduled to run again. This can occur with monitors that have complex actions to perform, such as querying databases, stepping through multi-page URL sequences, waiting for scripts to run, or waiting for for an application that has hung.

For example, assume you have a URL Sequence Monitor that is configured to transit a series of eight web pages. This sequence includes performing a search which may have a slow response time. The monitor is set to run once every 60 seconds. When the system is responding well, the monitor can run to completion in 45 seconds. However, at times, the search request takes longer and then it takes up to 90 seconds to complete the transaction. In this case, the monitor will not have completed before SiteScope is scheduled to run the monitor again. SiteScope will detect this and make a log event in the SiteScope error log. The SiteScope Health monitors will detect this and make an entry in the SiteScope Log Event table.

Skipped monitors cause a number of problems. One is the loss of data when a monitor run is suspended due because a previous run has not completed or has become hung by a unresponsive application. Skipped monitors will also cause SiteScope to automatically restart itself. This is done in an effort to clear problems and reset monitors. However, this can also lead to gaps in monitoring coverage and data. Adjusting the run frequency (Update every) at which a monitor is set to run or specifying an applicable timeout value can often correct the problem of skipping monitors. Investigation of unresponsive systems that are being monitored may also be necessary.

Since it is often not obvious that a monitor is skipping, the SiteScope Health feature is designed to monitor the SiteScope logs and report on skipped monitor events. The results are shown in the SiteScope Log Event table as follows:

Name
This is the name for the log event. The default, Skipped <number> Time(s), is used a a label to indicate the number of monitors that are reported as skipping the number of times indicated by <number>. For example, the line named Skipped One Time will indicate data for monitors that have skipped once. This name, or label, is customizable in the SiteScope/groups/health.config file (see example below).

Status
The status column reports the status of the monitor as good if no log events are reported. If there was a log event, data about the most recent log event is displayed in the Status column. The text in the Status column is also a link to the SiteScope error log which contains detailed log events with information about what monitors may be skipping.

Per Hour
The Per Hour column shows a cumulative total of the log events meeting the criteria of the log event. In the case of Skipped One Time, the Per Hour column shows the total number of times that any monitor skipped one time in the last hour. In the case of Skipped Two Times, the Per Hour column shows the total number of times that any monitor skipped two times in the last hour. Any monitor that skipped two times will also have skipped one time with the first skip being added to the Skipped One Time total.

Since Restart
As with the Per Hour column, the Since Restart column shows a cumulative total of the log events meeting the criteria log event. In the case of Skipped One Time, the Since Restart column shows the total number of times that any monitor skipped one time since the last time SiteScope restarted. SiteScope is programmed to restart itself once per day or whenever any monitor skips six times or more.

The SiteScope Health monitors update the SiteScope Log Event Table whenever a new entry is added to the SiteScope error log file.

SiteScope Server Load

SiteScope Server Load table is the equivalent of a SiteScope monitor group that monitors server resources on the server where SiteScope is running. This includes monitors for CPU, disk space, memory, etc. along with a check of how many monitors are waiting to be run (see the Progress Report page). A problem with resource usage on the SiteScope server may be caused by monitors with configuration problems or may simply indicate that a particular SiteScope is reaching it performance capacity. For example, high CPU usage by SiteScope may indicate that the total number of monitors being run is reaching a limit. High disk space usage may indicate that the SiteScope monitor data logs are about to exceed the capacity of the local disk drives (see Log Preferences for SiteScope data log options).

The SiteScope Server Load monitors report their data to the SiteScope Server Load table as follows:

Name
This is the name of the resource or parameter that is being monitored. This is usually the same as the type of monitor being used. These names are customizable in the SiteScope/groups/health.config file (see example below).

Per Hour
The Per Hour column shows the average of the measured parameter for the last hour.

Since Restart
The Since Restart column shows the average of the measured parameter for the last hour.

Configuring SiteScope Health Indicators

The error, warning, and good status thresholds for the SiteScope Health Log Event and Server Load tables are set in the Configure SiteScope Health Indicators page. The following describes the configuration settings available.

Log Monitors Table

Name
Indicates the name of the event being monitored. For example, Skipped One Time is the the name for monitoring the SiteScope error log for monitors that have skipped once.

Found
This is a heading for the Per Hour and Since Restart columns displayed in the SiteScope Log Event Table.

Health: Warn if Greater
This is the warning threshold for the indicated line item. When the criteria for warning is met, the Health status icon will change to the warning icon. For example, if more than 5 "skipped four times" log entries are recorded in one hour, the status icon for the Skipped Four Times item will change to the warning symbol in the SiteScope Health Page and on the Health button in the navigation bar.

Health: Error if Greater
This is the error threshold for the indicated line item. When the criteria for error is met, the Health status icon will change to the error icon. For example, if more than 5 "skipped four times" log entries are recorded in one hour, the status icon for the Skipped Four Times item will change to the error symbol in the SiteScope Health Page and on the Health button in the navigation bar.

Alert if Greater
The Alert if Greater setting is used to set an e-mail alert threshold for the corresponding SiteScope Health measurement. Use the first text entry to set the number of log events that must be met to trigger an alert. For example, a value of 50 in the Per Hour line of the Skipped One Time row would trigger an alert if 50 or more skipped one time entries are recorded in the error log over the last hour. Use the selection box on the right to select where the e-mail alert should be sent. The default is None, which means that no alert will be sent.

SiteScope Server Load Table

Name

Time Interval
This is a heading for the Per Hour and Since Restart columns displayed in the SiteScope Server Load Table.

Health: Warn if Greater
This is the warning threshold for the indicated line item. When the criteria for warning is met, the Health status icon will change to the warning icon. For example, if the average CPU utilization on the SiteScope server exceeds during one hour, the status icon for the CPU item will change to the warning symbol in the SiteScope Health Page and on the Health button in the navigation bar.

Health: Error if Greater
This is the error threshold for the indicated line item. When the criteria for error is met, the Health status icon will change to the error icon. For example, if the average disk space usage exceeds 70 percent during one hour, the status icon for the Disk(s) item will change to the error symbol in the SiteScope Health Page and on the Health button in the navigation bar.

Alert if Greater
The Alert if Greater setting is used to set an e-mail alert threshold for the corresponding SiteScope Health measurement. Use the first text entry to set the monitor status value that must be met to trigger an alert. For example, a value of 80 in the Per Hour line of the CPU monitor row would trigger an alert if the average of SiteScope CPU utilization exceeded 80 percent during the last hour. Use the selection box on the right to select where the e-mail alert should be sent. The default is None, which means that no alert will be sent.

Working with health.config File

The entries in the health.config file are configurable via the Configure SiteScope Health Indicators page. You should use this interface to make changes to this file. In the case that you need to change the description that appears in the Name columns of the Health tables or the data labels (for example: Avg % used or Max % Full on) in the SiteScope Server Load table, you can edit the health.config file. To change the Name descriptions, edit the _name=namedescription entry. To change the data labels, edit the _valueLabel=descriptor entry.

Note: Do not add any extra spaces or lines to the file.

The following is a partial listing of the default health.config file showing the syntax of the entries in the file.

_health=good
_name=health config
_version=1.0
#
_class=logEvent
_hourWarn=30
_sessionWarn=720
_search=skipped #1
_sessionAlert=1200
_type=log
_hourAlert=50
_sessionError=1200
_emailSessionWho=_id:n
_hourError=50
_emailHourWho=_id:n
_name=Skipped One Time
#
...
#
_class=CPU
_hourWarn=50
_sessionWarn=40
_generalLabel=Avg % used
_sessionAlert=70
_type=schedule
_session=31
_valueLabel=Avg % used
_hourAlert=80
_sessionError=60
_emailSessionWho=_id:n
_hourError=70
_emailHourWho=_id:n
_numValues=59
_name=CPU
_hour=38
...
#
_class=MonitorsWaiting
_hourWarn=10
_sessionWarn=7
_generalLabel=Avg # monitors waiting
_sessionAlert=20
_type=schedule
_session=0
_valueLabel=Avg # monitors waiting
_hourAlert=20
_sessionError=15
_emailSessionWho=_id:n
_hourError=20
_emailHourWho=_id:n
_numValues=58
_name=monitorsWaiting
_hour=0





Copyright © 2003 Mercury Interactive Corporation.
All rights reserved.