Add a job/debug monitor stream(s) to the server
Created by: robertmaynard
Issue
By leveraging the pub-sub zeroMQ model we can allow the server to start broadcasting a stream of status and monitoring information.
This solves two large and outstanding issue when dealing with Remus. The first is that the server is a black box that has zero ways of informing clients or third parties about what is happening, if any internal errors are occurring, etc. The second issue is that the client is limited to using a busy wait to check on status events occurring, but if we allow the client to also use a pub / sub style connection to the server we can make a more efficient status monitoring client.
Technical Issues
The primary issues with the pub / sub model is the classic slow joiner issue. The problem is that you can't determine when a subscriber starts to get messages. Even if the subscriber is started before the publisher, the subscriber will always miss the first few messages that the publisher sends. This is because as the subscriber connects to the publisher, the publisher will have already sent messages that will be missed by the client.
If the monitor was emitting just general status messages and the goal was to show the overall health of the server, the slow joiner issue would not be a problem. But as it can be used to monitor specific jobs we need to some way to minimize the severity or even occurrence of the slow joiner . A couple of decent solutions are proposed in the Node Coordination ( http://zguide.zeromq.org/page:all#Node-Coordination ) section of the ZMQ guide. Personally I think the best way for Remus is:
- Server opens PUB socket and starts sending non job related messages and regular heartbeat messages.
- Client / ServerMonitor connect SUB socket and when they wait for a message to arrive from the PUB Socket. From there they send a message over the classic Req/Rep client socket to the server stating what channels should be created ( e.g. start sending info for all jobs )
- Now that the publisher has all the necessary information, it starts to send real data.
Extending the Server
Here is a very high level requirements for the publication on the server
- Pub socket will always exist
- Pub connection details will be controlled by
remus::server::ServerPorts
- A New request type will be added to the classic client interface. This request response will be the endpoint for publication socket. This solves the entire discovery problem that you have with figuring out the port the pub socket has bound too. This also means that anything that wants to act like a server monitor will have to use both a req/rep and pub/sub socket.
- Server method variables will control the type of information broadcasted on the Pub socket. The classic verbosity level controls of DEBUG, WARN, ERROR are a parallel issue to the pub socket. A client might only care about Job status publications and not about the general health of workers that are connected, in that use case the classic logging levels are not useful. Instead I propose we use:
- Jobs: Job status information, formatted in a way that a sub can filter based on job uuid.
- Worker: Information about what and when workers are connecting, taking jobs, asking for jobs and heart beating.
- Errors: System wide errors only. This will include server exceptions, workers being marked as dead, jobs failing, etc.
Client monitoring of the Server
To monitor the activity of the server a new remus::client
class called Monitor`` (or
ServerMonitor``` ?) will be created. This class must be extensible so that the user can plug it into their own code easily. A quick draft of what the Monitor class would look like is:
class Client
{
public:
...
remus::client::Monitor monitorServer();
};
**Edited: With new Monitor design**
class Monitor
{
public:
typedef remus::function<void(const std::string& domain,
remus::thirdparty::cJSON* msg,
remus::Client* source)> MonitorFunction;
std::set<std::string> domains() const;
//func is expected to have the following type signature
// operator()(const std::string& domain, cJSON* msg)
//
//will return the domain string that can be used to unsubscribe
void subscribe(const std::string& domain, MonitorFunction function);
void unsubscribe(const std::string& domain);
};
};
So it than becomes fairly easy to construct a JobMonitor
class JobMonitor
{
JobMonitor( remus::proto::Job job,
remus::client::Monitor monitor);
remus::proto::JobStatus latestStatus();
//maybe even allow buffering of status
std::vector< remus::proto::JobStatus > BufferedStatus;
};
Pub/Sub Message Layout
The message layout will be required to be a multipart message as ZMQ only supports prefix filtering. So that means that the first message component will be the key we will need to filter on.
The easiest method will be to make the first message in itself a key value pair where the key component is one of the following:
- Job
- Worker
- Error
And the value component is the following:
- Job value would be the UUID of the job
- Worker values would be the socket id of the worker in md5 form
- Error would be the component that caused the error, initially the placeholder 'server' can be used