CTWUG Status Page

Spent some time and created a CTWUG status page.

It’s available here: https://status.ctwug.za.net/
It’s also available on https://status-inet.ctwug.za.net/ (This will only point to the internet version of the page in case your wug is down).

All checks on that page are done every 5 minutes. The page is also set to refresh every 5 min so you don’t need to refresh. You will get the odd thing going off once in a bit, but if it stays off for extended period of time it’s worth investigating.

It contains the following sections:

###Summary
Contains a summary of nms monitored devices, ports (interfaces) and services. Busy expanding this to include all OSPF routers.
###CTWUG Sites
List of CTWUG sites and their status.
###DNS Server (172.18.1.1)
DNS servers, their status and the result of an SOA lookup for each zone and view. These are servers responding on 172.18.1.1. This also allows one to compare SOA serials visually across the different servers for a zone for problems.
###Tunnels
List of all tunnels. Note each tunnel appears twice because we check both ends.
###IRC Servers
Status of IRC servers.
###NTP Servers (172.18.1.1)
Status of time servers responding on 172.18.1.1.
###SSL Certificates
Status of SSL certificates for CTWUG sites. We using Let’s Encrypt which is free but has 90 days validity. These renewals are mostly automated but this bit should pickup if something goes wrong.

15 Likes

Feel free to post suggestions etc. here. I will upload the code to gitlab at some point. So people can submit PRs.

Ah nice. Thanks @spin for all the time and effort you are doing a great job and is by far the best technical officer i have seen :smiley:

6 Likes

Some to dos:

  • Need to add mail server checks (POP/SMTP etc.)
  • Want to see if I can check the DNS SOA Serials so we can monitor DNS change propagation.
1 Like

Feedback ::

howsit i had a look myself. was wondering if there is anyway to make this viewable on the forum page itself on the side.
Maybe only list :

hostname ,site , status and time … If im correct it could squeeze in on the main forum page on the left . keep imind im not a web guru, its just a thought :slight_smile:

Just a thought i had.


rmx

Thanks. Don’t really want to clutter the forum up. Plus there is way to much information for a simple summary.

To do a forum update I might make a notification bot that posts to a Status Alerts thread automatically when things are down. Think that would be more useful and easier to do.

I will also add status link right at the top of the page on the far right next to Finances.

1 Like

I can’t wait for our new landing page with a summarized version on the right hand side. :wink:

2 Likes

ok kwlio, that would work also.

great work spin :sunglasses:

Haven’t seen the code, but looking at the page right now, most stuff is green, except for 1 node there is 2 sensors showing red.

Maybe a good idea, is to summarize the good stuff, and make it expandable in case you want to look at it, and make the red stuff stand out at the top below the summary, so that it isnt needed to scroll down.

That way you dont have to scroll or search if something is down, it can be the 1st thing you see when you visit the page. Once you then want to add it to the forums or main website in a small widget, its shorter to say “all monitored stuff = OK” or show what stuff is not ok.

ps. great stuff so far, like it a lot.

2 Likes

Yeah most of it is green. It’s a simple PHP page with SQL queries to pull the data.

All of it is from NMS so I don’t want to go overboard and redesign an interface. Any summary that I do do will essentially be the summary table at the top. I reproduce it below.

Total Up Down Ignored Disabled
Devices 203 196 7 0 0
Ports 855 656 106 8 75
Services 70 70 0 0 0

The Services are all services (DNS, http, SSL, NTP etc. etc). The tunnels are essentially in the ports count. Might want to show that a bit more detail in a summary, though the summary would break if no tunnels are available in any case. Devices are OSPF routers mainly (but also the wugpi/servers themselves)

I will probably drop the ignored and disabled columns. That would be a short and sweet network summary. It tells me all services are 100%. Most devices are up. And a bunch of ports are down, though maybe not more than usual.

I’d also like people to start disabling unused ports as ports that are not disabled are counted as down if nothing is connected to them. It also lightens the load of the polling as it doesn’t need to poll so many disabled ports.

Maybe just a small addition is to add a time when the last poll of the device was done. Or just an overall time of last completed poll.

But looks good nice and clear.

1 Like

Code up at:

2 Likes

great stuff spin thanks for all the effort

nice @spin ,
Job well done

This site will not be working right for a bit. Couple of hours, or maybe till tomorrow.

Updated the site a little:

  • added area summaries
  • added an inet dns entry for it
  • It was also moved to a server at jypels along with nms.
2 Likes

@MDE that’s a good idea I never replied on. Everything on that page is checked every 5 min though. The only checks that are down sometimes are the tunnels and there I show the poll time (mainly because I’m still tuning things).

The other checks are always active and run on a cron. The only normal case that they would be down for longer than a couple of minutes is when the server running it is down, but this would include the status page itself.

Updated the status page adding back services that went missing. I.e. fixing the storm damage essentially.

3 Likes

CTWUG status page has been resurrected. Some things still not 100% but should be a good indication.

2 Likes