PROBE: PROblem data augmented By Experience
Joseph L. Hellerstein
IBM Thomas J. Watson Research Center, Hawthorne, New York
jlh@watson.ibm.com
Chuanyi Ji
Rensselaer Polytechnic Institute, Troy, New York
chuanyi@ecse.rpi.edu
December 2, 1996
Summary
The PROBE Project is intended to provide researchers with
access to data needed to develop a new generation of technologies for managing
the availability and performance of information systems. These technologies
include: proactive detection of service-level degradations, models of MIB
variables to facilitate intelligent monitoring, measurement data warehouses
that enable advanced decision support for availability and performance
management, and more effective algorithms for diagnosing performance problems.
Also of interest are data that relate to the management of emerging computing
and communication technologies, such as LANs that support mobile users.
A repository of problem data is being established at Rensselaer Polytechnic
Institute. Guidelines for submitting data to the repository are discussed.
Motivation
Client/server computing. Intranets. Downsizing. These
trends have increased the number of systems to manage, added to the diversity
of data sources, and reduced the number of skilled people. As a result,
it has become increasingly difficult to manage the availability and performance
of workstations, LANs, etc. With the rapid evolution of computing and communications
technology, the management problem is getting worse. Addressing this situation
requires innovative, ground-breaking management technologies. For example,
at Rensselaer Polytechnic Institute, work in proactive detection holds
the promise for detecting problems before serious service degradations
result. At Columbia University, work in forecasting models for MIB variables
may simplify monitoring (e.g., making timestamp reconciliation easier).
At Queens University, work on measurement data warehouses may provide an
efficient infrastructure for a new generation of decision support tools
for availability and performance management. To succeed, these efforts
require data. These data are different from those that are routinely collected
for capacity planning. For example, proactive detection requires collecting
time-series data (e.g., sampled at one minute intervals) during periods
when problems are not present and during periods when problems are
present. The PROBE Project seeks to foster innovative, ground-breaking
technologies for availability and performance management by providing the
data necessary for the development of these technologies. This project
will establish a repository of problem data along with related information
that is necessary to develop, tune, and evaluate advanced management technologies.
Also of interest are data that foster research in systems and network management
of emerging technologies for computing and communications (e.g., mobile
networks). The intent is that customers, vendors, and others will contribute
appropriately sanitized data to the repository in accordance with guidelines
that we are establishing. The data will then be freely available to researchers
in universities, government, and private industry. It is expected that
the guidelines for submitting data as well as the procedures for accessing
data will evolve over time.
Data Submission Guidelines
The guidelines listed below are intended to give potential
data providers help as to the information needed by researchers using PROBE.
Supplying all of the information listed below could be burdensome. As such,
a provider may choose to supply a subset and then allow researchers to
contact him/her directly for further details.
The following guidelines are provided based on our understanding
of the requirements of existing research projects. We expect these guidelines
to evolve as new requirements arise. A data submission may consist of many
data sets from multiple measurement tools running on one or more managed
nodes. For example, a data submission may consist of Unix(TM) vmstat
data from several hosts on two connected LANs along with SNMP data from
the routers that connect the LANs. Another example would be RMF Monitor
III data collected from one or more partitions in an MVS Sysplex in combination
with data from an SNA network connected to the Sysplex. Sampled and event
data are preferred, since we anticipate much work in the area of real time
detection and diagnosis. Details of the data requirements are listed below:
- Background information
- Workload, including a general indication of how the system
is used and by whom. Also of interest are normal workload variations.
- Service-level expectations. Which user-groups expect
what service levels? How is this quantified (e.g., metrics used, time-of-day,
systems measured)?
- Configuration. This should include connectivity information
(e.g., the flow of transactions between CICS regions, links between routers),
descriptions of system components (e.g., line speeds, memory sizes, etc.)
, and mapping information (e.g., which hosts are running which applications).
- Problem description
- A concise statement of the operational problems present
in the data, the components affected (e.g., which node, subsystem, application,
disk), and the manner in which the problems were resolved (including actions
taken by end-users).
- A specification of the time periods during which the
problem was known to be present. If this cannot be stated precisely, then
time ranges should be used.
- Relevant log information is also of value (e.g., syslog
files or web server error logs collected during when problems are present).
- Variables measured
- Service level indicators. The data submission should
include at least one service level indicator, such as response time or
thruput. Service level variables should relate to the problems detected.
(Ideally, this should be a variable used by the site in a service level
agreement.)
- Other metrics. Multiple related measures, such as utilizations,
completion counts, service rates.
- Documentation of the variables. If a standard monitor
is used, specification of the monitor is sufficient. Otherwise, we need
the units in which the variable is measured (e.g., packets per second)
and the interpretation of the measurement (e.g., packets discarded due
to fragmentation errors). (Descriptions of how measured values are computed
are better still.)
- Number of observations for each variable. For sampled
data, the sampling rates should be sufficient to isolate the problem in
time (e.g., one minute samples are fine for a problem that lasted ten minutes).
Ideally, there should be a substantial number of observations of the problem
and non-problem data, tens or hundreds.
- Data quality. An issues that arose during data collection
that may affect the quality of the data (e.g., failure of a monitor). Please
be specific about the scope of the impact (when it occurred, what data
are affected).
- Data Formats
- Initially, our preferred format is an ASCII file that
consists of blank-separated columns. The first non-blank row should be
the variable name for each column. The file may be compressed using widely
available utilities such as PKZIP or tar.
- Each submission should include a description of the files
submitted, including their contents and format.
- Contact information. Contact information should be provided
so that the PROBE administrator can resolve questions about the data. If
it is acceptable to the data provider, contact information will be given
to researchers so that they can ask their questions directly. To minimize
the burden on data providers, researchers and the PROBE administrator will
record the information collected from data providers so that questions
need be answered only once.
- Analysis Process. For researchers developing decision
support tools, it is necessary to understand better how management data
are navigated and interpreted.
- Reporting and Analysis. What reports are produced (e.g.,
service-level, resource)? For what managed elements and which workloads
and administrative domains is this reporting done? Who uses the reports
(e.g., a manager with budget responsibility)? How are they used (e.g.,
to substantiate complaints, to negotiate service levels)? What variables
are examined in what sequence?
- Data management. What is the policy for aging data, such
as aggregating minute-granularity measures server CPU utilizations into
one hour averages and/or peak values?
Repository
The repository is administered and maintained by Professor
Chuanyi at Rensselaer Polytechnic Institute. Current access is by ftp
and is limited to a few researchers while the repository is undergoing
its initial construction. We are investigating the use of resources available
to the Computer Measurement Group (e.g., a web site) as an alternative
to using machines at RPI.
Responsibilities of Researchers
For the research community to get maximum benefit from the PROBE repository,
certain guidelines should be followed by researchers using PROBE data.
- Tools developed to access, reduce, and analyze PROBE data should be
submitted to the PROBE administrator for general use by the research community.
Where appropriate, the tools should be developed according to standards
specified by the PROBE administrator.
- Access to data providers must be done with respect for their limited
time. To this end, the PROBE administrator should be notified of contact
made. Also, researchers are responsible for recording the information obtained
(e.g., definitions of measurement variables, refinements of configuration
information) and providing this to the PROBE administrator.
Current Research Activities
Listed below are efforts that we anticipate will benefit
from the PROBE Project.
- Proactive detection of availability and performance
problems. The objective of this work is to detect impending network
faults before they cause serious problems. To this end, we want to characterize
performance and availability problems in production systems. Also, we want
to develop algorithms that are effective at early detection of these problems
and validate these algorithms in production environments. The PROBE data
is needed to identify appropriate models and to assess their generality.
Contact: Chuanyi Ji, Rensselaer Polytechnic Institute, chuanyi@ecse.rpi.edu.
- Forecasting models for MIB variables. The goal
of this work is to develop time series models that forecast values of MIB
variables. Such a capability will have many benefits. For example, the
models will provide a way to characterize performance problems through
their dynamic behavior. Also, the forecasting models will enable the development
of management systems that can more intelligently reconcile timestamps
from multiple measurement sources. Contact: Nikolaus Haus, Columbia University,
nrh2@columbia.edu
- Measurement Warehouse Design. This effort seeks
to apply technologies used in the data warehousing industry to the problem
of navigating performance and availability data. The data warehousing industry
commonly uses multidimensional databases to enable flexible drill-downs,
drill-ups, and computation of aggregates. These functions are central to
many performance management applications, such as diagnosing performance
problems and obtaining an integrated view of measured entities (e.g., an
application-view of data that spans multiple nodes). We focus on the efficient
computation of aggregates using techniques such as view materialization.
It is hoped that the PROBE data will characterize the kinds of aggregates
that should be computed and how navigation occurs between them. Contact:
Cadambi Sriram and Pat Martin at Queens University, Kingston, Ontario,
{sriram, martin}@qucis.queensu.ca.
- Techniques for Diagnosing and Resolving Performance
Problems Over the last ten years, a large number of systems have been
developed to aid in the diagnosis and resolution of performance problems.
Our previous work has evaluated commonly used algorithms for diagnosis.
By so doing, a number of deficiencies have been identified. Future efforts
will make use of the PROBE data to understand better when one technique
is preferred to another as well as to develop a new generation of techniques
that are more effective in practice. Contact: Joe Hellerstein, IBM Thomas
J. Watson Research Center, jlh@watson.ibm.com.