During the RIPE 70 Meeting in Amsterdam this week (on 13 May around 10:00 UTC), we experienced a network outage at AMS-IX. Let's see how this was monitored by various tools.
Note: this is a repost from an article that was originally published on RIPE Labs.
The outage was nicely visible on the AMS-IX traffic staticstics (see Figure 1 below).
This was a good opportunity to monitor this with various tools the RIPE NCC makes available.
The RIPE Atlas seismograph tool visulises multiple ping measurements. In the three figures below you can see ping measurements towards three different RIPE Atlas anchors, all located in Amsterdam: one in the RIPE NCC network, one at Surfnet and one at Afilias.
You can clearly see the time during which very few ping measurements were successful from the RIPE Atlas probes. Some of the probes could still get to it, because there was some kind of network connectivity. They probably did not go through AMS‑IX, but this is mostly speculation. This particular tool can clearely show that there was a problem and that the problem was closer to the RIPE Atlas anchor rather than the probes.
Figure 6 below show the same event in DNSMON measuring towards K-root.
On the Y axis you can see a list of RIPE Atlas anchors from all over the world. During the interruption, even though they were still collecting data, they couldn't connect to our infrastructure, so they couldn't send data to the RIPE Atlas storage . That causes the white "no data" gaps in the image. Eventually, when the connection was restored, they started to send all the delayed reports -- but not neccesarily in order, so the graph contain spots. When the data transfer is finished, the gaps are expected to disappear. This means that by now these gaps are gone.
Of course it would be very nice if we could use RIPE Atlas data to understand such events while they are happening. We are doing research on algorithms for real time analysis which would allow this in the future based on the RIPE Atlas streaming interface.