Tuesday, July 8, 2014

My family is my most difficult client

I work in a shop large IT shop with teams so we have no solo system administrators. The company is rather large, and it has some very sensitive systems that no one ever wants to see down. It's our job to prevent down time and we do our best to prevent it. We do our best to be prepared to recover from it. Even with the best planning, things happen something may fail. I remember a massive failure occurred one day, after hours and my boss (and eventually his boss) came on site to do what they could to help out. My boss worked on keeping people out of the area so we could work directly with vendor support and not be interrupted by people asking the status. Even though the stakes were high, we were told to take our time and double check our work and not to rush results. We eventually recovered and it wasn't a traumatic experience.

I also suffered a catastrophic failure at home. The Internet died. There have been kitchen fires that were less dramatic. “What happened? What's going on? When is it going to be back?” I decided to save electricity at home by turning off two 1u rack-mount systems a go virtual. ESXi is a free hypervisor from VMware. That is, it's free when you enter in the free license code into the system. Until you enter in that code, the system is in a trial phase. In ESXi, trial means that you can use the system until the timer runs out (60 or 90 days. Not sure). After that, you can't do much of anything. Your virtual servers will continue to run, but you can't start any. So that means if there is a power outage, no servers will start. It turns out that I had created my pfsense firewall as a vm on my ESXi server, suffered some power problem, and the vm would not start. This was easily fixed by actually entering my key in, but showed me some things that I had neglected as a system administrator.

Later, I had decided to add some storage to my ESXi server, so I had to announce to my family that the Internet will have to go down for a while. “How long?” was the response from my wife. “Uhhhhh... like 30 minutes?” was my estimate. My younger kids asked if they could complete the level they were on, and my teenager wanted to know if this means Netflix would be down. I talked the teens into switching over to regular cable and waited for the kids to finish so I could add the disk drive. I shutdown the server, added the drive, started it up. Fixed the VM boot order because the firewall wasn't set to start on a reboot for some reason. Tested that Internet connection, then walked around the house letting everyone know we were back.

Why is it so different at work? Why was taking care of the home so much more stressful than at work? It's because at work, I am a system administrator, but at home, I was just being “a computer dude”.

Backups – I had backed up my firewall configuration, but it was a long time ago and I didn't capture my latest changes. I also forgot that I had backed it up and only found the backup file later by accident after the incident. Regular backups are critical, even at home!

Disaster Recovery – I never thought to just put my wireless router in place to bring up the internet while I worked on the ESX server. This could have saved a lot of aggravation by have some kind of alternate plan.

Communication – The reason the outage at work went so well and the one at home was crazy was because my boss had come in to communicate to everyone what the status was. He provided a buffer so that we could work on fixing the problem. Some kind of status update is needed, even at home.

Regular Checkups – Beyond just monitoring tools, I need to log into all of the systems and just do a basic check. Just verify everything is working right, and the logs look good.

Planned Downtime – Patches, hardware upgrades, configuration tweaks... all of these things should have been planed out. I'm not sure how I will announce to the family that stuff is going to happen, but some method needs to be worked out so alternate plans could be made while I take the system offline.

So, my family may be my toughest customer, but they are also my best teacher. I just hope my next budget proposal gets approved.

Wednesday, July 27, 2011

Yes, there IS a difference betwen shutdown -r and reboot

I and a fellow admin were testing a server, trying to figure out why the file systems were not unmounting during a reboot.  We looked and looked and everything appeared to be configured correctly, except certain init scripts were not getting called when rebooting.  After some time of working on this, we decided that maybe "reboot" doesn't actually do what we think it does.  After some searches, we discovered that there IS a difference between the reboot command and shutdown -r.

The post the explained it best, in my opinion, was from vreference.com where Forbes Guthrie explains it.  To sum it up, reboot and halt, does not execute the kill scripts.  They just unmount the file systems and either reboot or stop the system.  You are best off using the shutdown command to reboot or shutdown the system.

Forbes says this about what to use:

"So, should I use reboot or init 6?  – neither!  My advice is to use the shutdown command.  shutdown will do a similar job to init 6, but it has more options and a better default action.  Along with kicking off an init 6, the shutdown command will also notify all logged in users (logged in at a tty), notify all processes the system is going down and by default will pause  for a set time before rebooting (giving you the chance to cancel the reboot if you realize that you made a mistake)."

 So using shutdown is a good habit I am going to have to get into.

Monday, December 20, 2010

LXC on Ubuntu 10.10 for Minecraft Server

My family has all taken to Minecraft.  We got accounts for all six of us, two “grown ups”, two teens, and two little guys.  We quickly decided that we would run three different servers.  The grown up server is where we build things.  The teen server is where they grant themselves TNT and blow up everything.  And the little kid server is a kind of a combination of both.

Originally, I put them all on the same physical server and tried different port numbers.  That works, but wasn’t ideal.  We wanted our own IP addresses.  Less confusion, especially for the little guys.  I have enough hardware that we ran one instance on each server.  That was fine until I lost the grown up server due to hardware failure on the root partition.  I still can’t believe I didn’t use RAID.

Because of the desire for seperate IP addresses for each instance, and the desire to avoid total loss again, I decided to use Linux containers (LXC).  Here is my notes for getting minecraft servers up and running in a linux container.

I put the mincraft data into it’s own LVM partition because I’m going to use DRBD once I rebuild my other server.

(These are notes so they have very little explanation)
Get what we need installed
apt-get install lxc vlan bridge-utils python-software-properties screen libvirt-bin debootstrap
Create the partitions
lvcreate -L 2g -n minecraft1 datavg
mkfs.ext4 /dev/datavg/minecraft1

mkdir -p /data/minecraft1
vi /etc/fstab
/dev/datavg/minecraft1 /data/minecraft1 ext4 defaults 0 2
We need a root file system.  My main OS is 64bit.  Since the kernel in the LXC instances is the same as the host, I didn’t see a reason to get the 32bit root.

mkdir /data/images
cd /data/images
wget http://download.openvz.org/template/precreated/ubuntu-10.10-x86_64.tar.gz

mount /data/minecraft1
cd /data/minecraft1
mkdir roofs
cd rootfs
tar zxvf /data/images/ubuntu-10.10-x86_64.tar.gz .
cp /etc/resolv.conf ./etc
echo route add default gw >> ./etc/rc.local

vi ./etc/rc.local (move the route to be above the 'exit' statement)
Update sources (http://repogen.simplylinux.ch/)
vi ./etc/apt/sources.list
deb http://us.archive.ubuntu.com/ubuntu/ maverick main restricted universe multiverse
deb http://us.archive.ubuntu.com/ubuntu/ maverick-security main restricted universe multiverse
deb http://us.archive.ubuntu.com/ubuntu/ maverick-updates main restricted universe multiverse
deb http://archive.canonical.com/ubuntu maverick partner

Chroot so we can update
chroot /data/minecraft1/rootfs /bin/bash

Get SSH up
update-rc.d ssh defaults
In the change root, you want to add your accounts and change the passwords you need to before the server comes up.
Install Sun JRE (to run minecraft server)
apt-get update
apt-get upgrade
apt-get install sun-java6-jre

LXC config files
cd /data/minecraft1
vi fstabnone /data/minecraft1/rootfs/dev/pts devpts defaults 0 0
none /data/minecraft1/rootfs/proc    proc   defaults 0 0
none /data/minecraft1/rootfs/sys     sysfs  defaults 0 0
none /data/minecraft1/rootfs/dev/shm tmpfs  defaults 0 0

vi minecraft1.conf# Container with network virtualized using a pre-configured bridge named br0 and
# veth pair virtual network devices
lxc.utsname = minecraft1
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.hwaddr = 4a:49:43:49:79:bf
lxc.network.ipv4 =
lxc.tty = 6
lxc.mount = /data/minecraft1/fstab
lxc.rootfs = /data/minecraft1/rootfs

Configure networking.  I have two interfaces.  I left eth0 alone and added eth1 to be the bridged interface.

vi /etc/network/interfacesauto br0
iface br0 inet static
bridge_ports eth1
bridge_stp off
bridge_fd 0
bridge_maxwait 0
post-up /usr/sbin/brctl setfd br0 0

/etc/init.d/networking restart
Start the container
lxc-create -n minecraft1 -f /data/minecraft1/minecraft1.conf
lxc-start -n minecraft1 init

The container has started.  SSH to it.
Minecraft server setup
This is how I get a vanilla multiplayer server running

mkdir -p /minecraft
cd /minecraft

vi update_minecraft_server.sh
rm /minecraft/minecraft_server.jar
cd /minecraft

chmod 755 update_minecraft_server.sh

vi start_minecraft_server.sh
java -Xmx2048M -Xms2048M -jar minecraft_server.jar nogui

chmod 755 start_minecraft_server.sh
To start the server:

cd /minecraft

Saturday, December 11, 2010

Cacti fixed. Ooops. :)

I finally tracked down what happen.  This is the problem with changing too many things at the same time.  I looked over the permissions for the cacti user in the MySQL database and decided that it didn't need the full permissions, so I removed the ability to create temporary tables and some other things.  I think normally Cacti doesn't use temporary tables, but I added the Boost plugin and it DOES need that.

The SNMP results get put in the database and written to disk later.  To do this, Boost uses temporary tables.  Since it was unable to flush the table, the database kept filling up.  It got up to 8 million rows before I figured out how to fix it.

I also learned that you can't "backfill" data into an RRD file, at least not using the poller.php.  When I was finally able to flush the data out of the boost table, I lost a lot of the data because of the feature that writes data to the RRD file as someone requests the graph.  Once I started checking all the graphs, I pretty much locked them out of receiving old updates that were stored in the table.

I really need to make some kind of alert for the poller tables in Cacti.

Friday, December 10, 2010

MySQL on my Cacti server freaked out

I'm pretty sure this was because I was using a schema from a new version of Cacti while using the old php files to access it.  It's complicated why, but it comes down to the reason I hate RedHat/CentOS.  No real easy way to upgrade major versions.  I really wish there was something as simple as "apt-get dist-upgrade" for those systems.

Monday, November 22, 2010

OpenDNS for internet filtering

I learned that OpenDNS has an option to perform filtering, and that it's actually free to use.  I decided to log into my old OpenDNS account and try out the filtering.  Filtering is configurable and you can make at loose or as tight as you would like.  There are general categories you can add or remove to your preferences like porn, or file sharing site.  As well as time wasters, religion, and politics.

To get started, OpenDNS first has to know which IP address you are coming from.  I figured the best way for my site (my home) was to set up my home server to use ddclient.  On my CentOS server, I simply had to do 'yum install ddclient'.  Then OpenDNS provides a configuration sample for it (http://www.opendns.com/support/article/192)

Now that OpenDNS know where I'm coming from, I have to tell my systems to use the right DNS servers. There are great guides for configuring you home routers on the site.  Mine was a bit different because of Verizon, but the general idea was the same.  Once I figured that out, I was on my way.

Next I decided to sign up for the filtering.  I started with the 'low' option and the added some things to customize.  Even though you can test with http://www.theinternetbadguys.com, I checked with the obvious http://playboy.com.  Sure enough the site was blocked.  Awesome!  I did find out that I had to unblock "Adult Theme" if I wanted to access Reddit.

I like this approach much more than having to setup a squid proxy like I tried in the past.  The biggest reason is that since it's done in DNS, I set this up right on my router and ALL devices are filtered on my network.  That means the iPad, iPhones, and the Wii are protected, as well as the desktops and laptops.  I also like that I don't need to setup a cron job that down loads a block list every day and restart the service.  It's a very elegant solution to a long standing problem and it works great.  I think that I'm going to configure this for a couple of friends for their family networks.

Overall, I really like this method of filtering.

Friday, November 19, 2010

Day 2 at LISA10

Linux Performance tuning

Slides: http://sites.google.com/site/linuxperftuning/
Speaker: http://en.wikipedia.org/wiki/Theodore_Ts'o

I was very excited to take the Linux Performance tuning.  I wasn't sure who the teacher was, but it was an engineer from Google.  A class in performance tuning by Google, I figured this would have to be good.  So did everyone else.  It was one of the few classes at LISA that was not only sold out, but they had to add extra seats to the room.

The class began by establishing what it is we are trying to accomplish.  Goals of performance tuning are to speed up single task,  or graceful degradation of application as load increases.  You don't want your web server to come crashing to a halt when load increases, you want it to gracefully degrade as it takes on more and more load.

In order to start tuning something, you have to know where it stands currently so you can measure whether or not you actual did something.  You must first establish a baseline.  Using the baseline as the starting point, you make a single change to the system, test, then measure the results against the baseline.  Then you just keep repeating the process, making only one change at a time.

Some basic tools to start with are 'free', 'top', and 'iostat'.  Using the free command, see if the swap space is being used.  If you are using swap, you should increase the memory of the system.   Run the top command, and check what's running.  Check the I/O with iostat and use the -k option to get kilobytes instead of blocks.  These are just some of the simple tips we started with.

We got into deep details of file system optimizations by the guy that maintains the code for ext4.  There were a lot in the slides that I still have to decipher which is why this post took so long to post.

One interesting thing I learned was something called short stroking.  Short stroking is where you partition the disk to use the outer rim (the beginning of the disk) where you get 100% performance vs the interior of the disk.  Short stroking is about 10% - 30% of the disk so you are going to lose some space, but gain a lot of speed.  Combine this with multiple disks and you can get near SSD speeds.

Other bits that jump out at me:

Mounting with noatime can help I/O because you no longer have to make a write for each file access.

Increasing the size of the journal can help performance a little bit.  'sudo dumpe2fs -h /dev/sda5`

ionice - “This  program  sets  or gets the io scheduling class and priority for a program.  If no arguments or just -p is given, ionice  will  query  the current io scheduling class and priority for that process.”

nttcp - (new test TCP program) - The nttcp program measures the transferrate (and other numbers) on a TCP, UDP or UDP multicast connection.

When tuning the network, you can tune for latency or for throughput.

NFS - no_subtree_check - If a subdirectory of a filesystem is exported, but the whole filesystem isn't then whenever a NFS request arrives, the server must check not only that the accessed file is in the appropriate filesystem (which is easy) but also that it is in the exported tree (which is harder). This check is called the subtree_check. This option disables subtree_check.

You can bump the number of NFS threads up from the default.  This is in /etc/sysconfig/nfs or /etc/defaults/nfs-kernel-server

“Use the hard option with any resource you mount read-write. Then, if a user is writing to a file when the server goes down, the write will continue when the server comes up again, and nothing will be lost.” (http://uw714doc.sco.com/en/NET_nfs/nfsT.mount_cmd.html)

Remove outdated fstab options.  Just use “rw,intr”

More NFS performance info: http://www.linuxselfhelp.com/HOWTO/NFS-HOWTO/performance.html

strace – system call tracer
ltrace – library tracing

Optimize the stuff that used often versus once

perf – kicks ass.  Learn it. (https://perf.wiki.kernel.org/index.php/Main_Page)

After the class, I went to a few talks.  There was Splunk, PostgreSQL, and some cloud computing thing.  Then I attended the Minecraft get-together.  I heard of Minecraft, but now I saw a demo on the large projection screen.  Now (a week later) I have four account for the family and we are playing on our server.  We're all hooked.  I can't recommend this thing enough.  (http://minecraft.net/)