diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/gondul-receiver.rst | 183 |
1 files changed, 183 insertions, 0 deletions
diff --git a/doc/gondul-receiver.rst b/doc/gondul-receiver.rst new file mode 100644 index 0000000..f83ce7c --- /dev/null +++ b/doc/gondul-receiver.rst @@ -0,0 +1,183 @@ + +================================ +API for receiving time-base data +================================ + +Background +========== + +Toda, Gondul has three different "collectors". The ping-collector, the +snmp-collector and the dhcp log-tailer. + +They all write data directly to the postgres backend. + +Over the years we've tried different methods of storing time series data for +actual graphs. To support this, we've stored some data in two sources. Most +recently, we've stored stuff in postgres and influxdb. + +In addition to actually storing this data in different locations, we some +times need to "massage" data, or change the database schema. A prime example +for SNMP is to actually establish a tree-structure for port-data by picking +up ifTable and ifXTable and building a "ports"-tree using ifIndex. An other +example is normalization of MAC addresses for example. + +We also need to do something with Virtual Chassis MIBs. + +While we've been able to do all this, the fact that these collectors all +write directly to the postgres database creates a strong cross-dependency +between the collectors, the database schema and the API. It has also created +a strong depenendency to time series database tools. + +This has made it difficult to safely experiment with enriching input-data +without introducing critical bugs in collectors or breaking the north-bound +API. + +This document outlines a way to reduce this problem. + +Concept +======= + +The concept is to create a generic "time-based" API for all time-oriented +data. The API will cover high-frequency producers like the ping collector, +but also low-frequency producers like the operations log, or the DHCP +tailer. + +While the API will be generic, it will provide just enough data to allow the +receiver to identify the type of data and apply enrichments to it and thus +treat it diffrently. By default, the data posted will just be written to +postgres, but through enrichment add-ons, we can also chose to split a +single SNMP poll into multiple entries (e.g.: individual entries for virtual +chassis nodes), or re-arrange the data to produce interface-mapping. + +The enrichment will also be able to do basically anything with the data, +including sending it to multiple other APIs - e.g. influx. + +While the first version does not deal with authentication, future versions +should. + +Core API +======== + +The core API accepts N or more metrics in a single post. + +The core of the API will accept 3 fields: + +- "source" - a text-string identifying the source of the data, e.g. "dhcp", + "ping", "snmp". This should be sent either as a json text field, or as + part of the url. E.g., allow posting to + https://user:pass@gondul/api/write/gtimes/dhcp . The benefit of linking + this with the URL is that it will simplify authentication in the future, + allowing "write-only" accounts. +- "metadata" - this is a generic JSON object that contain a number of fields + that will be indexed upon or used by enrichment. Example: { "server": + "dhcpserver1", "time": "2019-01-05T15:00:10Z" }. + "time": "2019-01-05T12:00:10Z", " +- "data" - an array of json-objects. Each object in the array must either + have a "time" field or the "metadata"-field must have a time field. + + +Examples +======== + +Example 1, dhcp:: + + { + src: "dhcp", + metadata: { + server: "dhcpserver1" + }, + data: [ + { + type: "assignment", + time: "2001-01-01T15:12:01Z", + ip: "2001:db8::1", + circuit: "vlan123:e3-1:ge-0/0/1", + msg: "blatti foo" + }, + { + type: "renew", + time: "2001-01-01T15:32:01Z", + ip: "2001:db8::1", + circuit: "vlan123:e3-1:ge-0/0/1", + msg: "blatti foo something" + } + ] + } + +Example 2, ping:: + + { + "src": "ping", + "metadata": { + "time": "2019-05-01T15:01:12Z" + }, + "data": [ + { "s": "e1-3", "l": 0.91211 }, + { "s": "e1-2", "l": 0.12211 }, + { "s": "e1-1", "l": 0.12311 }, + { "s": "e3-1", "l": 1.12111 }, + { "s": "e3-2", "l": null }, + { "s": "e3-3", "l": 0.91211 }, + { "s": "e3-4", "l": 0.91211 } + ] + } + +Example 3, oplog:: + + { + "src": "oplog", + "data": [ + { + "system": "floor", + "user": "kristian", + "message": "lol", + "time": "2019-04-19T15:00:10Z" + } + ] + } + +Note that "metadata" is optional. + +Implementation plan +=================== + +The plan would be to start small. The first candidate is the dhcp log +tailer, which needs to support IPv6 and thus needs a change. + +The first implementation would be a "hard-coded" perl API since that is what +we already have. There is no current plan to migrate other producers to the +new API at this time. + +Since particularly the ping collector can produce quite a lot of data, some +care might be needed to support it. This will most likely require a +different apporach than the old CGI-based perl way of doing things. + +To allow a flexible enrichment-scheme, it might be necessarry to implement a +separate service in a more modern language. There are currently three worthy +alternatives: + +Node.js has the benefit of using JavaScript which is already heavily used in +Gondul, and is fairly fault-tolerant. There are also already plans to +utilize node.js to do server-side parsing of health data. However, I'm +unsure if it offers the speed or integration we need. + +Python is an other alternative, which is also already used. It is slightly +more mature than Node.js, but also doesn't really offer much else. + +The third alternative is Go, which will certainly provide us with the speed +we need, but might not allow the development pace we require during an +event. + +No conclusion is offered and at any rate, no plans to actually implement +such a service exist nor should one be planned until we have more experience +from the DHCP-collector implementation. + +Storage +======= + +Storage is deliberately left OUT of the API definition, but for +implementation-purposes we should assume postgres as the primary target with +influx as a sencodary target. Details of how this is done is intentionally +left out of this document as this should not be relevant to any user of the +API. + |