Blog-2010-03-22
On my MythTV Backend, I find there are a number of error conditions that I want to monitor and be alerted about if they should happen. For example, as of late, I’ve been having issues with the one of the drives in my RAID configuration (under load I’m getting errors that I think are a result of an old SATA controller), which causes the RAID to drop into degraded mode, and error messages to be logged by the kernel. In a situation like this, I wanted a tool that could monitor my log files and email me if “interesting” things happen.
Now, the first thing I did was search the web for something that would do the job. swatch popped up immediately as one alternative. It’s a nice, simple Perl script which takes a configuration file that defines a log file to monitor, and a series of rules which define what to look for. Unfortunately, it can only monitor one log file at a time (you need to run multiple instances and have multiple configuration files if you want to monitor multiple files), and it has to run continuously in the background. And, quite frankly, the configuration file is a tad byzantine for my taste.
Another common option is logwatch. This application is definitely a lot more flexible, but the configuration is, again, rather complicated. And, at least as far as I can tell, it’s really meant to be run once a day for a given date range, as opposed to operating as a regular, polling application.
And thus ended my search, with the conclusion that it’d really be a lot simpler just to write my own tool. And this pwatch was born. pwatch is a simple Perl script that takes an Apache-style configuration file and processes your log files. Each matching event triggers an action, and then the event is recorded in an SQLite database. Run pwatch again and it’ll skip any events it’s seen before and only report new ones. The result is that you can just fire off pwatch in a cronjob on a regular basis (I run it every five minutes), and it can alert you if something interesting has happened.
Now, pwatch is pretty basic at this point, and I probably won’t add much more to it unless people ask for it (or unless I need it). For example, at this point, the only action it knows how to take on an event is to send out an email. But adding new features should be trivial enough, so if anyone has any ideas, let me know. And if you find pwatch useful, send me an email!
Using IPv6 to mitigate SSH attacks
So, one of the ongoing issues that anyone with a public-facing server has to deal with is a barrage of SSH login attempts. Now, normally this isn’t a problem, as a decent sysadmin will use fairly strong passwords (or disable password-based logins entirely), disable root logins, and so forth. But it’s certainly an irritant, and so it’s worth implementing something to mitigate the issue.
Now, traditionally, there are a few general approaches people take:
- Use iptables or something similar to throttle inbound ssh connection attempts.
- Coupled with the previous, implement tarpitting (this slows down ssh responses, which means the attacker wastes resources on your server).
- Implement something like fail2ban to automatically detect attacks and dynamically add them to a set of block rules (managed with something like iptables).
- Move SSH to a non-standard port.
All of these work reasonably well, and particularly for the lazy, something like fail2ban on Ubuntu is dead easy to deploy and works quite nicely. Of course, there’s always the chance that you lock yourself out if you fail at a few login attempts, so it’s not without it’s risks.
But I recently discovered a fifth option which, at least at this stage of IPv6 growth, works incredibly well: disable inbound SSH over IPv4. See, most attackers aren’t v6 connected. Meanwhile, acquiring v6 connectivity remotely is usually just a matter of running a Teredo tunneling client. The result is perfectly workable remote accessibility, while the number of SSH attacks is cut down to essentially zero.
Of course, this won’t last forever. In the future, v6 is likely to get deployed more widely, and I suspect I’ll start seeing v6-based ssh attacks. But until then, this solution is dead simple to deploy and works great!
Update:
And naturally, just a day after I finish writing this, I decided to fiddle around with NX for remotely accessing this server, and lo and behold, NX doesn’t support IPv6. :) So, I’m back to using fail2ban, until NX can get their act together (though, to be fair, latency over my v6 tunnel has an unfortunate negative impact on NX performance, and so I’m not sure I’d use v6 even if I could).
Mysterious Uptime
Or: Why RAID isn’t foolproof.
First, a little bit of background. At home, I have a MythTV installation. And as part of that installation, I have a MythTV Backend, which is basically a glorified fileserver that sports a couple video capture cards, the MythTV scheduling and recording software, a mysql database, and a few other odds and ends (not the least of which is this web server). Now, being a fileserver, one of the jobs that machine fulfills is to provide large amounts of storage, primary for MythTV recordings, and since I don’t want to lose those records, I have my storage set up in a RAID-1 mirror, which basically takes two drives and makes it look like a single drive, while underneath, anything written to the logical drive is actually written out to both physical disks. That way, if something bad happens, I have what amounts to a live backup that I can quickly switch to (in addition to my regular, nightly incremental and weekly checkpoint backups).
So I came home on Wednesday night to discover something rather annoying: Some sort of write error had occurred on one of those physical disks, and so the mirror was degraded and deactivated. Now, this has happened in the past (I think it’s related to a buggy DMA implementation on my SATA controller), but usually recovery is pretty easy: remove the bad disk from the mirror, then re-add it, which causes Linux to synchronize the two disks, using the good disk as the primary. But for some reason, this time, it wasn’t so easy.
See, when I ran a command to view the status of the mirror, I found both drives marked as “removed” (ie, taken out of the mirror), and one marked as a “spare”. That itself is kinda weird, as usually it’s one active, and one failed. “Whatever”, I told myself, “I’ll just take the spare out of the mirror, re-add it, and then add the other drive, and voila, that should be it”. But when I attempted to re-add the spare, I got the weirdest error message:
cannot find valid superblock in this array - HELP
I can tell you right now, when your computer is imploring you for help, it’s probably a bad thing. Now, for those not in the know, a superblock is kinda like a special marker on the disk, and in this case, it tells Linux which mirror the disk belongs to, along with a bunch of other metadata. This error indicates that this decidedly important piece of bookkeeping information was, supposedly, absent. That’s bad. Unfortunately, googling around lead me nowhere. Even more confusing, when I attempted to mount (ie, attach, connect, etc) one of the halves of the mirror, the OS detected the filesystem, and the contents of the mirror looked to be intact. And running a tool to examine the RAID mirror components returned what looked like perfectly normal data.
In the end, I gave up for the day, figuring I would come up with some strategy for moving forward the next day. Eventually, I settled on breaking the mirror up, mounting both drives separately, and then using a tool like rsync to manually back up the primary disk to the secondary… not an ideal solution, as a disk failure means you lose everything since the last snapshot, but it’d do the job, and I wouldn’t have to deal with RAID headaches anymore.
So this evening, I fire up zaphod (that’s the fileserver name) into single user mode, and as I watch the kernel messages scroll by, I see the RAID mirror… start up perfectly normally. Examining the mirror showed one active disk, and one re-syncing, suggesting that the kernel was rebuilding the RAID successfully. What. The. Heck. And as of this writing, I still have absolutely no idea what on earth went wrong, or how it magically got fixed.
Lucky.
Again with the NetBSD
Well, it’s been a couple days now, and I continue to fiddle around with NetBSD… it’s definitely not going to be displacing Ubuntu any time soon, but it’s definitely an amusing project to play around with.
Most recently, as I was testing out Evolution (my email client) compiled from pkgsrc, I discovered that it started up incredibly slowly. Like, 5 minutes from invocation to a window popping up on my desktop. So, a little Google-fu, and I found myself here. It turns out that one of the things Evolution does a lot is attempt to open shared libraries that don’t exist. Unfortunately, those failures are very expensive, and as of 5.0.2, NBSD’s linker doesn’t cache the failures.
And this is where that blog post comes in. The author of that post wrote up a negative lookup cache and incorporated it into the NBSD dynamic linker. By itself, that’d be interesting, but what’s deeply cool about this is that I was able to get a patch representing his change, tweak them, apply them to my local copy of the NBSD source, and then build out and install a new version of the dynamic linker. Result: startup times went from minutes to seconds. I’d call that a huge win.
What this fundamentally speaks to is just how open and easy it is to fiddle around with the internals of NetBSD. The entire system is designed to make it trivial to alter the base and rebuild it out from scratch, which makes it possible to do the kinds of things I just did. Very cool!
Next up: Attempt to hack nouveau DRI support into the kernel so I can get reasonable video performance.