nick.org down!

2008-05-14 569 words 3 minutes

nick.org - down!

After almost four years of no unscheduled downtime, nick.org came to a screeching halt Sunday. I noticed an email from my brother, Kevin, indicating that I had farked something up. “That’s weird”, I thought, “it was working fine last night.” I went to the website and found it very unavailable, with nothing but a strange Apache directory listing. I attempted to ssh and found that it was unavailable too. Uh oh. I happened to be talking to my parents via Skype at the time (it was Mother’s day) and juggling Jerry on my knee.

“I’ve been hacked” - was my first instinct. I started to look at other sites that I’m hosting and found other strange Apache error messages. Oh no. I continued to talk to my parents, figuring that I couldn’t do much about it now

but I let them know what was going on. My dad is very committed to his online journal and I didn’t want to disappoint his adoring fans for long.

I continued to poke around and found some log entries (sent via remote syslog) that indicated a hardware failure of some sort. A double device failure? Oh no

data loss! I grew more uncomfortable and decided I needed to tend to this now. Eriko wanted to go downtown for some shopping and it took us a few minutes to get ourselves and Jerry ready.

At the data center, I was confronted with a RAID controller which had gone out to lunch. I power-cycled the machine and it believed all the disks were still part of the RAID5 group - but once it started quotascan, the controller locked again. My experiences at Isilon have given me quite a bit of knowledge into how drives and controllers fail, so I decided to see if I could identify which drive (if it were a drive) that was causing the problem. I rebooted again and sat and watched the drive lights. Blink, blink, blink, stop. Drive 5. Stuck.

I pulled out drive 5 and rebooted. Success! I immediately rsynced a backup copy of the data. The sites I’m hosting are mainly read-only, besides uploading pictures, blog entries, or the log files from the websites themselves but I hadn’t yet setup an automated backup mechanism on this new server.

I left the data center with the server in a WOR (window of risk) and headed home, glad that I had at least restored service for the time being.

On Monday, I returned to the data center with another drive. To my dismay, the RAID controller believed it was 250.9 GB instead of 251 GB. Argh! Reallocations! My only option was to switch to the RAID1 set (which is the boot disk) so that the data isn’t at risk. I’ll rsync to the RAID5 set as well as a remote location, which will provide some buffer until I can obtain another disk.

Here’s a shameless plug for Isilon - but boy would I have loved to have an Isilon cluster. I could have a lost a head w/o making the data unavailable, and I would be able to reprotect the data (at the expense of free space, which I have plenty of) without having to immediately replace a drive. This suits me, the busy (part-time) storage administrator, to a T. The experience left me even more convinced that our product is light-years ahead of anything else.