RRFW Working Draft: Distributed collector architecture

Status: pending implementation. Date: May 26, 2004. Last revised: June 14, 2004

Introduction

In large installations, one server has often not enough capacity to collect the data from all the data sources. In other cases, because of the network bandwidth or security restrictions it is preferrable to collect (SNMP) data locally on the site, and transfer the updates to the central location less frequently.

Terminology

We call Hub servers those which run the user web interfaces and optionally threshold monitors. These are normally placed in the central location or NOC datacenter.

Spoke servers are those running SNMP or other data collectors. They periodically transfer the data to Hub servers. One Spoke server may send copies of data to several Hub servers, and one Hub server may receive data from many Spoke servers.

In general, the property of being a Hub or a Spoke is local to a pair of servers and their datasource trees, and it only describes the functions of data collection and transfer. In complex installations, the same instance of RRFW may function as a Hub for some remote Spokes, and as a Spoke for some other Hubs simultaneousely.

We call Association a set of attributes that describe a single connection between Hub and Spoke servers. These attributes are:

Association ID

Hub server ID, Spoke server ID

Transport type

Transport mode

Transport parameters

Compression type and level

Tree name on Hub server

Subtree path on Hub server

Tree name on Spoke server

Path translation rules

Transport

The modular architecture design should allow different types of data transfer. The default transport is Secure Shell version 2 (SSH). Other possible transports may be RSH, HTTP/HTTPS, rsync.

Two transport modes should be implemented: PUSH and PULL. In PUSH mode, Spoke servers initiate the data transfer and push the data to Hub servers. In PULL mode, Hub servers initiate the data transfer and ask Spokes for data updates. It should be possible to mix the transport modes for different Associations on the same server, but within each Association the mode should be strictly determined. The choice of transport mode should be based on local security policies, and server and network performance.

Optionally the compression method and level can be configured. Although SSH protocol supports its own compression, more aggressive compression methods may be used for the sake of better bandwidth usage.

Transport agents should notify the operator in cases of delivery failures.

Operation

For Spoke servers, distributed data transfer will be implemented as additional storage type. For Hub servers, this will be a new collector type.

Each data transfer is a concatenation of messages. Messages may be of one of two types: CONFIG and DATA. Spoke server generates the messages and stores them for the transfer. Messages are delivered to Hub servers with a certain delay, but they are guaranteed to arrive in sequential order. For each pair of servers, messages are consecutively numbered. These numbers are used for failure detection.

A Spoke server keeps track of its configuration, and after each configuration change, it sends a CONFIG message. This message contains information about mapping between Spoke server tokens and datasource paths, and a limited set of parameters for displaying and monitoring the data.

After each collector cycle, Spoke server sends DATA messages. These messages contain the following information: timestamp of the update, token, and value. The format of the message should be designed to consume minimum bandwidth.

Hub server picks up the messages delivered by the transport agents. Upon receiving a CONFIG message, it sets a preconfigured delay, in order to collect as many as possible CONFIG messages. Then the data transfer agent generates a new XML configuration based on the messages, and starts the compilation of configuration. The DATA messages are queued for the collector to pick up and and store the values. It must be ensured that all DATA messages queued for the old configuration are processed before the compilation starts.

In case of fatal failure and loss of data, Hub server ignores all DATA messages until it gets a new CONFIG message. A periodic configuration update schedule should be defined. If no configuration changes occur within a certain period of time, Spoke server periodically sends the CONFIG messages with the same timestamp.

Message format

Message is a text in email-like format: it starts with a header, followed by an empty line and the body. Single dot (.) in a line specifies the end of the message. Blocks within a CONFIG message are separated with semicolon (;), each block representing a single datasource leaf.

Example:

 MsgID:100001
 Type:CONFIG
 Timestamp:1085528682

 level2-token:T0005
 level2-path:/Routers/RTR1/Interface_Counters/Ethernet0/InOctets
 vertical-label:bps
 ....
 ;
 level2-token:T0006
 level2-path:/Routers/RTR1/Interface_Counters/Ethernet0/OutOctets
 vertical-label:bps
 .
 MsgID:100002
 Type:DATA
 Timestamp:1085528690