I work in a shop large IT shop with teams so we have no solo system administrators. The company is rather large, and it has some very sensitive systems that no one ever wants to see down. It's our job to prevent down time and we do our best to prevent it. We do our best to be prepared to recover from it. Even with the best planning, things happen something may fail. I remember a massive failure occurred one day, after hours and my boss (and eventually his boss) came on site to do what they could to help out. My boss worked on keeping people out of the area so we could work directly with vendor support and not be interrupted by people asking the status. Even though the stakes were high, we were told to take our time and double check our work and not to rush results. We eventually recovered and it wasn't a traumatic experience.
I also suffered a catastrophic failure at home. The Internet died. There have been kitchen fires that were less dramatic. “What happened? What's going on? When is it going to be back?” I decided to save electricity at home by turning off two 1u rack-mount systems a go virtual. ESXi is a free hypervisor from VMware. That is, it's free when you enter in the free license code into the system. Until you enter in that code, the system is in a trial phase. In ESXi, trial means that you can use the system until the timer runs out (60 or 90 days. Not sure). After that, you can't do much of anything. Your virtual servers will continue to run, but you can't start any. So that means if there is a power outage, no servers will start. It turns out that I had created my pfsense firewall as a vm on my ESXi server, suffered some power problem, and the vm would not start. This was easily fixed by actually entering my key in, but showed me some things that I had neglected as a system administrator.
Later, I had decided to add some storage to my ESXi server, so I had to announce to my family that the Internet will have to go down for a while. “How long?” was the response from my wife. “Uhhhhh... like 30 minutes?” was my estimate. My younger kids asked if they could complete the level they were on, and my teenager wanted to know if this means Netflix would be down. I talked the teens into switching over to regular cable and waited for the kids to finish so I could add the disk drive. I shutdown the server, added the drive, started it up. Fixed the VM boot order because the firewall wasn't set to start on a reboot for some reason. Tested that Internet connection, then walked around the house letting everyone know we were back.
Why is it so different at work? Why was taking care of the home so much more stressful than at work? It's because at work, I am a system administrator, but at home, I was just being “a computer dude”.
Backups – I had backed up my firewall configuration, but it was a long time ago and I didn't capture my latest changes. I also forgot that I had backed it up and only found the backup file later by accident after the incident. Regular backups are critical, even at home!
Disaster Recovery – I never thought to just put my wireless router in place to bring up the internet while I worked on the ESX server. This could have saved a lot of aggravation by have some kind of alternate plan.
Communication – The reason the outage at work went so well and the one at home was crazy was because my boss had come in to communicate to everyone what the status was. He provided a buffer so that we could work on fixing the problem. Some kind of status update is needed, even at home.
Regular Checkups – Beyond just monitoring tools, I need to log into all of the systems and just do a basic check. Just verify everything is working right, and the logs look good.
Planned Downtime – Patches, hardware upgrades, configuration tweaks... all of these things should have been planed out. I'm not sure how I will announce to the family that stuff is going to happen, but some method needs to be worked out so alternate plans could be made while I take the system offline.
So, my family may be my toughest customer, but they are also my best teacher. I just hope my next budget proposal gets approved.