doc/gondul-receiver.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

================================
API for receiving time-base data
================================

Background
==========

Toda, Gondul has three different "collectors". The ping-collector, the
snmp-collector and the dhcp log-tailer.

They all write data directly to the postgres backend.

Over the years we've tried different methods of storing time series data for
actual graphs. To support this, we've stored some data in two sources. Most
recently, we've stored stuff in postgres and influxdb. 

In addition to actually storing this data in different locations, we some
times need to "massage" data, or change the database schema. A prime example
for SNMP is to actually establish a tree-structure for port-data by picking
up ifTable and ifXTable and building a "ports"-tree using ifIndex. An other
example is normalization of MAC addresses for example.

We also need to do something with Virtual Chassis MIBs.

While we've been able to do all this, the fact that these collectors all
write directly to the postgres database creates a strong cross-dependency
between the collectors, the database schema and the API. It has also created
a strong depenendency to time series database tools.

This has made it difficult to safely experiment with enriching input-data
without introducing critical bugs in collectors or breaking the north-bound
API.

This document outlines a way to reduce this problem.

Concept
=======

The concept is to create a generic "time-based" API for all time-oriented
data. The API will cover high-frequency producers like the ping collector,
but also low-frequency producers like the operations log, or the DHCP
tailer.

While the API will be generic, it will provide just enough data to allow the
receiver to identify the type of data and apply enrichments to it and thus
treat it diffrently. By default, the data posted will just be written to
postgres, but through enrichment add-ons, we can also chose to split a
single SNMP poll into multiple entries (e.g.: individual entries for virtual
chassis nodes), or re-arrange the data to produce interface-mapping.

The enrichment will also be able to do basically anything with the data,
including sending it to multiple other APIs - e.g. influx.

While the first version does not deal with authentication, future versions
should.

Core API
========

The core API accepts N or more metrics in a single post.

The core of the API will accept 3 fields:

- `source` - a text-string identifying the source of the data, e.g. "dhcp",
  "ping", "snmp". This should be sent either as a json text field, or as
  part of the url. E.g., allow posting to
  ``https://user:pass@gondul/api/write/gtimes/dhcp`` . The benefit of
  linking this with the URL is that it will simplify authentication in the
  future, allowing "write-only" accounts.
- `metadata` - this is a generic JSON object that contain a number of fields
  that will be indexed upon or used by enrichment. Example: ``{ "server":
  "dhcpserver1", "time": "2019-01-05T15:00:10Z" }``. 
- `data` - an array of json-objects. Each object in the array must either
  have a "time" field or the "metadata"-field must have a time field.


Examples
========

Example 1, dhcp::

   {
      src: "dhcp",
      metadata:  {
         server: "dhcpserver1"
      },
      data: [
         {
            type: "assignment",
            time: "2001-01-01T15:12:01Z",
            ip: "2001:db8::1",
            circuit: "vlan123:e3-1:ge-0/0/1",
            msg: "blatti foo"
         }, 
         {
            type: "renew",
            time: "2001-01-01T15:32:01Z",
            ip: "2001:db8::1",
            circuit: "vlan123:e3-1:ge-0/0/1",
            msg: "blatti foo something"
         } 
      ]
   }

Example 2, ping::

   {
      "src": "ping",
      "metadata":  {
         "time": "2019-05-01T15:01:12Z"
      },
      "data": [
         { "s": "e1-3", "l": 0.91211 },
         { "s": "e1-2", "l": 0.12211 },
         { "s": "e1-1", "l": 0.12311 },
         { "s": "e3-1", "l": 1.12111 },
         { "s": "e3-2", "l": null },
         { "s": "e3-3", "l": 0.91211 },
         { "s": "e3-4", "l": 0.91211 }
      ]
   }

Example 3, oplog::

   {
      "src": "oplog",
      "data": [
         {
            "system": "floor",
            "user": "kristian",
            "message": "lol",
            "time": "2019-04-19T15:00:10Z"
         }
      ]
   }

Note that "metadata" is optional.

Implementation plan
===================

The plan would be to start small. The first candidate is the dhcp log
tailer, which needs to support IPv6 and thus needs a change.

The first implementation would be a "hard-coded" perl API since that is what
we already have. There is no current  plan to migrate other producers to the
new API at this time.

The first implementation would not offer much in the way of generic storage
for other users than the dhcp collector.

Since particularly the ping collector can produce quite a lot of data, some
care might be needed to support it. This will most likely require a
different apporach than the old CGI-based perl way of doing things. 

To allow a flexible enrichment-scheme, it might be necessarry to implement a
separate service in a more modern language. There are currently three worthy
alternatives: 

Node.js has the benefit of using JavaScript which is already heavily used in
Gondul, and is fairly fault-tolerant. There are also already plans to
utilize node.js to do server-side parsing of health data. However, I'm
unsure if it offers the speed or integration we need.

Python is an other alternative, which is also already used. It is slightly
more mature than Node.js, but also doesn't really offer much else.

The third alternative is Go, which will certainly provide us with the speed
we need, but might not allow the development pace we require during an
event.

No conclusion is offered and at any rate, no plans to actually implement
such a service exist nor should one be planned until we have more experience
from the DHCP-collector implementation.

Storage
=======

Storage is deliberately left OUT of the API definition, but for
implementation-purposes we should assume postgres as the primary target with
influx as a sencodary target. Details of how this is done is intentionally
left out of this document as this should not be relevant to any user of the
API.