Aeleen's (
http://www.oreillynet.com/pub/au/678) and (
http://www.aeleen.com/home.htm) class on “Administering Linux in a Production Environment” was pretty good. It's more of a primer of some new things that have come up and tools to think about getting into. For me, it seemed to be like a show case of technology that any sort of training. If you weren't already familiar with the tools, you might be a bit lost. If you already were familiar with it, then it was already old hat. I did get a couple of juicy bites from the talk though.
A good place to see what's up and coming in the Linux kernel is:
http://www.kernelnewbies.org
I would still like to take another look at LVM snapshots. I understand that they were hard to work with, at least as far as restoring from a snapshot, but maybe there is a scripting solution to making this easy? If this can be done, it would make the whole patching thing a lot easier.
Saw a tweet go by talking about OpenTSDB (
http://opentsdb.net/) and that it might overtake RRDTool. It does look interesting. I got a chance to talk to Tobias a little bit about his thoughts. He heard of it, but hasn't really looked into the product. The issue for OpenTSDB is that it's built on top of HBase (
http://hbase.apache.org/). Generally this is for large installations.
On a side note, I brought up the caching problem we had with Cacti and the I/O destroying the DRBD setup we had, until we used the big installation addon to Cacti which caches the results then writes to disk in batches. I wanted to add RRDTool to Nagios monitor results, but was afraid of running into the same I/O issue. Tobias mentioned that in RRDTool 1.4, there is a caching daemon that will help with this. I'll need to take this back as some homework.
Cloud computing came up in the training class that made me think about some of the things overheard in the hallways. Rumblings about the possibilities of doing something “in the cloud”. We should probably get that EC2 cluster up and running to give it a test go, as well as looking into the free Amazon Cloud.
http://aws.amazon.com/free/
Aeleen asked about tools we were using and Bigfix came up. She mentioned it was a great tool and she loves it followed by pretty much the same response from a number of people in the class. I took a quick glance at it, and it looks like it does answer a lot of the issues we have been trying to work through, like patch management, and vulnerability assessment. I need to see if we can get a demo or something of this product.
Power and wireless issues causing major distractions in the session. The power problem wasn't that big of a deal since we are all using laptops, we just switched to battery power. I didn't know the wireless AP's were affected too. I spent a good amount of time trying to reconnect to the wireless, and having a hard time keeping up with the material because of that. It did get fixed up by the break.
I'd like to look more into condensing our sudoers file into a sort of one size fits all solution by getting more mileage out of groups, and using the server group declarations. I knew you could restrict by user and user groups, but I didn't know that you could restrict by hosts and host groups as well. A one file solution makes a Cfengine integrated sudoers file easy to implement.
Right as SELinux section started, Rik Farrow makes an appearance. Neat! It seems that SELinux is getting pushed pretty hard this year.
One big thing that was pointed out is that if your system is dipping into the swap as evidenced by the 'free' command. You should probably add more memory to your system if possible. Especially if this is a server and you are experiencing performance issues. If that's the case, the memory is the first thing to look at to make the problems go away.
Up until now, I thought that the difference between 32bit and 64bit OS is that on the 64bit OS you can compute some crazy big numbers. Other than that, there is little value into going to 64bit. Especially considering support, and problems with 64bit vs. 32bit libraries. I find out that this isn't really true. The biggest difference is that you run into a lot of memory problems with 32 bit systems because of how much memory the kernel can address. With 64 bit, a lot of the tricks involved with dealing with a lot of memory go away. But how much memory is a lot? Even so much as 3gig gets tricky. Since most of out server are shipping with 8gig, this alone should drive us into building 64bit systems by default. There are also network considerations when dealing with 32bit and 64bit systems. The message from this session is loud and clear, “Why would you run 32bit if 64bit is an option?”
Facebook Vendor Talk
I got to sit in on a open discussion with a panel of Facebook system administrators. It was a little slow to start, but got a bit interesting.
First, they mentioned that they are still using Cfengine2 and plan to continue using it into the foreseeable future. I found this interesting because it hinted at something undesirable in regards to Cfengine3.
They also talked about how the administrators create tools and that when building tools, open sourcing them is usually a consideration from the start. The once that they do provide end up here:
http://developers.facebook.com/opensource/
I managed to get into something that I've always been curious about in really large server environments like Google, Yahoo!, Facebook, and so on. I know what a sysadmin does normally, but when dealing with really large environments, there is a much greater division of labor. What does a sysadmin end up doing in this case? For the guys on the panel, they end up mostly developing automation tools. They don't touch the hardware at all now, and seldom even get into configuration. You aren't tweaking a web server anymore, you are tweaking thousands of web servers. I understand why it's like that, but I don't think it's anything I'm interested in doing. I guess it's the difference between working on a large garden and a huge factory farm. There is just something more intimate about the garden.