Housekeeping

CTWUG Committee have noted several problems creeping into the our wug and many of these can be avoided with a little housekeeping. We are talking about things like:
[ul]
[li]OSPF routers not having hostnames[/li]
[li]OSPF routers not having relevant logins[/li]
[li]OSPF routers not having the right WMS scripts etc.[/li]
[/ul]
These issues make it a hassle to manage the CTWUG and means we can do less of the fun stuff.

We are doing a little housekeeping with Project Spring Clean. Rather than having silly manual processes to follow up on this I’ve created a script to to check many of these issues and email the WIND node admins to notify them of issues. These should be sent later today. And will continue to be sent. This process is not perfect as many ips are fairly difficult to link to a wind node, but hopefully this process will actually result in this being cleaned up as well. Another benefit.

I actually envisage this to be a continual maintenance process but to get it rolling will be a little bit of a push. This will automate the nagging process and make it much more convenient for everyone.

This script check OSPF routers things like:
[ul]
[li]Does it have a IP address/hostname on wind? [Fix this by adding an ip address on the wind node.][/li]
[li]Does it have more than one? :slight_smile: [Fix this by deleting duplicate ip address entries on the wind node. If you want to keep the duplicate name add it as a CNAME.][/li]
[li]Is the ctwug/ctwug read-only login enabled?[/li]
[li]Is the router being managed properly by wms.[/li]
[li]Are the scripts loaded?[/li]
[li]What RouterOS version is running? [Try and keep it to 6.0 or later.][/li]
[li]What version of the WMS scripts are running?[/li]
[/ul]

You may notice logins on the CTWUG read-only user. This script is not changing anything, though, with the node’s permission we may enable that (with a proper login though).

Most of these issues can be fixed by installing wms scripts. Or alternatively removing OSPF on routers that don’t really need it. If you do remove the rb from ospf it will be deleted by my script after two week.
Some useful resources on this and WIND issues:

[ul]
[li]http://wiki.ctwug.za.net/WMS[/li]
[li]http://wiki.ctwug.za.net/Node_Update[/li]
[/ul]
The deadline for most of these fixes is 1 Nov 2014. We hope to get everything in ship shape by then. If new issues emerge later you would usually get about 30 days to sort those out before we will review these and see what further action is needed.

Plan is once the routers are all in ship shape to load them into a network monitoring tool and properly monitor traffic etc. on these. Once we are there we can start monitoring links alerting on crap links etc. as well.
I may also expand the tool to ensure more of our wind data is up to date. Check active IPs for reverse dns entries. Check subnet ranges on node for consistency etc. Heck if this is managed well we could have a very clear idea of all the ip ranges on the wug without looking at our wiki pages.

I hope everyone will see this as a postive development to make sure our 630 active backbone OSPF routers are all in tip top shape. If those are running smoothly sure the CTWUG will be running smoothly.

An example of a positive impact this has had already is that I have noticed an issue in the WMS scripts and fixed it. You would have seen v24 rolling out last night.

Also see the homeless routers below. Closed the other thread.

4 Likes

Well done :slight_smile:

Small issue people. I’ve discovered a lot of zombie ip address entries in WIND. They are from nodes that have been deleted. It seems wind leaves the ip_address entry in the table when a node gets deleted. Some of the multiple hostname issues may be a result of this.
Positive side is we would not have discovered it if we had not done this.

I’ve resolved this so if you don’t receive an email about this again you’re ok.

4 Likes

Awesome stuff Spin we are on the right track from the get go and I take my hat off to you for the great work you have done this far. Thanks :smiley:

Status update:
637 OSPF routers
191 of those routers have at least one flagged issue
There are 399 issues flagged in total.
Top 3 issues are:

[ul]
[li]Router never connected to wms.ctwug.za.net. No scripts?[/li]
[li]ctwug/ctwug login not available or enabled.[/li]
[li]No hostname on WIND for ip.[/li]
[/ul]
Let’s try and improve on these numbers.

The script that runs this does this in about 5 minutes :slight_smile:

If the scripts are automated, with the stats of routers with errors, maybe we can add an extra section to the network monitor on the main website showing these numbers?

3 Likes

Status update:
637 OSPF routers
174 of those routers have at least one flagged issue. Good improvement.
There are 345 issues flagged in total. Also improvement.

Top 3 remains the same.

Hi can we maybe get a update on which ips, routers or wind needs to be updated? Or will we know via the automated emails? Like I understand if we don’t get mails notifying us of these issues we are in the clear, is this statement correct?

Yes, no news is good news.
Except: The only problem is the ips that I can’t find a node for. The “homeless”. So if you have no ips there we can go ahead. Actually we need to allocate those ips on wind in which case those guys will also get emails.

Ok and thanks for clearing things up :smiley:

To be even more clear. The script will only email the admins of a particular node about once a week. If you haven’t received an email for more than a week you are probably fine.

However if you have no ips or ip ranges on your node or your wind user details (email) is out of date, then you are definitely not fine :smiley:

OSPF Routers: 641
Routers with issues: 169
Total issues: 378
Routers not linked to a node: 39

If this is your ip below please add a ip address entry in your node on WIND. Please do so ASAP.

I’ve removed a few myself now and added unknown entries into the likely nodes.

I’ve added the likely nodes but I’m not very certain. Especially MLB. It’s a mess.


lockjaw:
172.18.22.123
172.18.22.125

mlb:
172.18.87.217
172.18.106.105
172.18.106.106
172.18.106.225
172.18.106.227
172.18.111.105
172.18.111.106
172.18.111.107

hyperlink:
172.18.123.198

something near callisto:
172.18.193.126

?
172.18.235.253
172.18.235.254
172.18.240.250
172.18.248.178

more mlb:
172.18.249.161
172.18.249.162
172.18.249.164
172.18.249.167
172.18.249.168
172.18.249.169
172.18.251.142

?
172.18.252.38
172.18.252.218

acidice or close to it:
172.26.16.253
172.26.16.254
172.26.24.254

I think we must really start sticking to the rules that have been set out for the wug and it’s users…

If wind details are not entered within 48 hours of being connected then your link gets disconnected.

So many other problems as well which are quite unnecessary and just gives the comms extra work

OK after updating the rbs above I ran my script again:
Script took 754 seconds to run. (that’s logging into 641 rbs or so)
OSPF Routers: 641
Routers with issues: 154
Total issues: 352
Routers not linked to a node: 28

Thanks for your thoughts. Jy slaan die spyker op sy kop. Let’s see what happens 1 Nov :rolleyes:

OSPF Routers: 645
Routers with issues: 146
Total issues: 329
Routers not linked to a node: 28

Numbers not improving fast enough…

Script took 132 seconds to run.
Sent 2 emails in this run.
OSPF Routers: 644
Routers with issues: 120
Total issues: 283
Routers not linked to a node: 27

Progress!

Hi Spin,

I found a issue on a OSPF router between Goose and Europa, specifically on Goose’s end.

When the scheduler checks every 20 minutes if game time is suppose to run, it tanks the CPU and this sometimes causes the link to drop and the wireless never to recover. (Few nights ago it caused heavy packet loss)

Log:

oct/02 22:20:24 wireless,info D4:CA:6D:28:E8:B1@Wacko-Europa: lost connection, med
ium-access timeout
oct/02 22:20:31 script,info ctwug_gametime: setting gametime ON
oct/02 22:20:31 wireless,info D4:CA:6D:28:E8:B1@Wacko-Europa established connectio
n on 5020000, SSID http://ctwug.za.net/europa1
oct/02 22:20:31 route,ospf,info OSPFv2 neighbor 172.18.44.252: state change from F
ull to Down

This is a RB711, so it could be that the CPU is just to weak to handle it, but for now I have disabled gaming time and the scheduler on that routerboard so that I can see stability and run some more tests.

Will post updates here.

2 Likes

Let me to know the cause if you figure it out.

1 Like