Mark Loveless, aka Simple Nomad, is a researcher and hacker. He frequently speaks at security conferences around the globe, gets quoted in the press, and has a somewhat odd perspective on security in general.

Server Error

Server Error

This is what the server should look like, a typical mail server being poked and prodded endlessly.

This is what the server should look like, a typical mail server being poked and prodded endlessly.

It doesn’t matter what they are, if you have certain errors on a server console, you reboot. This is about one of those errors.

At this point I don’t even remember the error. I had been periodically checking the console of the web server as the previous night I’d upgraded the RAM. As the main servers themselves are old HPs, I was slightly worried, so I’d hit the KVM switch and check all of them - including the web server. But it was Talon, the mail server, and its console that caught my eye. It had an error on it and I decided to reboot it. I found the console was frozen, so I ssh’d in and did a shutdown. It was a full halt, because I had the old RAM from the web server upgrade and decided to go ahead and put that RAM in Talon - a quick upgrade.

The RAM went it, the server came up and recognized the RAM just fine, but it failed to boot up. It hung right after this message:

dev/sda5: clean, xxxxxxx/xxxxxxxx files, xxxxxxx,xxxxxxxx blocks

I thought either a graphics display or even the next partition - the swap - had an issue. Not the first time I’d seen this type of error. I’d boot off of a USB stick and keep investigating.

Things Get Weird

The USB stick was DOA. That was odd, I mean I couldn’t even use it. I keep a stick with the current Ubuntu on it handy so I can boot up a server or laptop and work on it. Not this time, it was dead. It should have been a sign since this was the second strange failure.

Bear in mind that this HP ProLiant ML10 v2 is slow to boot up. It’s a server with a sophisticated BIOS that is practically its own operating system and has all kinds of self checks, so a “quick” reboot is anything but. From error-free restart to login prompt is easily a full three and a half minutes, so a few reboots and googling error messages in between and in the blink of an eye 40 minutes have gone by.

In retrospect I really hadn’t done much, but about half the time I could get the grub screen to allow me into advanced options, and I could boot a kernel into recovery mode and tweak something (run update-grub, configure to boot without swap, etc). But finally, after two hours I’d hit a pattern - either I’d get dropped into busybox or it would not even hit grub and freeze. The errors seem to point to the volume group, with inconsistent errors when I got them at all, so I began to think that the problem was a block on the hard drive that had the volume group name was intermittently working.

During the rathole that was the googling of errors looking for clues and endless reboots, it occurred to me I might need to simply wipe, reload, and restore Talon from backup. I went to check the backup hardware (2TB drive attached to another server via USB) and the thing was dead. Not the external drive enclosure, as I quickly tested the drive with another enclosure, the drive itself was dead. They come in threes, and this is the third.

Shit, this just got pretty dark.

Resolution

I found a working USB stick and got a copy of Ubuntu onto it, and booted up Talon. I tried renaming the volume group on one of the boots that worked, changing all of the references in fstab and grub as best I could. This didn’t seem to really help, as the problem was still there with a new volume group name. Finally on one of the USB boots I ran fsck, which fixed some things. Moment of truth, I rebooted. I got a new error:

Volume group “talon-vg” not found
Cannot process volume group talon-vg 

While this might seem bleak, I was elated. This was the old volume name now reappearing. I guess fsck fixed something and it had to do with this old volume name and the block on the hard drive it was written to. I rebooted onto the USB, renamed the volume back to its original name, updated fstab, and could now easily boot into recovery mode. A quick update-grub and a final reboot, and the problem was fixed. Finally.

During all of these reboots, I had replaced the bad backup drive with another drive I had - it’s only a terabyte but it will do for now. First task after Talon came up? Started a backup.

Conclusion

From the first error at about 3:30pm until I started the backup, about 5 hours passed. I cannot express how happy I am that it worked out. The mail server is the main reason I have servers and static IP addresses.

On the list of things to do is a wipe and reload of Talon, and not using ext4 just because I’ve been using it for ages. Maybe I’ll do the reload on completely new hardware. We’ll see.

I’ll cover Talon’s history and configuration in another blog post, but let’s just say I could lose possibly any other computer in the house and it wouldn’t be as bad as losing this one. So glad to have this precious baby back up and running.

The Hacker Burner Phone

The Hacker Burner Phone

The AT&T Fiber Upgrade Pt. 1

The AT&T Fiber Upgrade Pt. 1