Linux admin tips

This is aimed at people who administer Linux systems, but whom are not (yet) Linux geeks and now have some Linux based/related systems to take care of. i.e. usually coming from another platform such as Windows or NetWare. The term '*nix' is used to cover the whole family of Unix related operating systems that includes BSD, Linux, and now even the modern Mac OS (hidden under the covers). Certainly getting some formal Linux training will help, but have found that much of that training very quickly assumes that all computer techs are programmers (such an invalid assumption that is throughout the Open Source community).

This document is from the SUSE distro perspective, usually with Micro Focus's Open Enterprise Server (OES) and/or GroupWise running on it but certainly can help for many other applications. Assumed is that you are in an expandable terminal window that you can expand past 24 lines of 80 characters, such as when using a tool such as PuTTY. Commands will be in italics and should be typed in the same case as given as the examples as Linux is case sensitive (lower case vs UPPER CASE), and sample output will be in a fixed font.

The first three commands are the ones I do most times I check on a server, basically asking these questions:
- How long has the server been up?
- Is any process using way more CPU or memory than normal? (requires noting the patterns of the server to know what 'normal' is for a particular 'box')
- Are we getting tight on memory?
- Are we getting tight on drive space?
Certainly there are system management systems that are available that watch for these, but they aren't always an option due to their cost and insufficient budget for them, nor are they immune from failure. Checking these periodically and knowing what is normal for your systems is a very good habit.

CPU: top All basic Linux training should have shown you the Top tool that lists the processes running, though I have found that they tend to either gloss over it, or hit you with ALL the bells and whistles. The basics to look for starting from the top of top are:
- 1st line: (also available with uptime). How long has the server been up? Are any of the load average numbers getting close to the number of CPU cores?
- 2nd line: Making sure there are no zombie or stopped processes.
- 3rd line: How much of the CPU is being used right now? More details at Scout
    - %us User level processes (what you usually want the system for)
    - %sy System level apps (kernel level keeping the system working)
    - %ni Niced User level processes, i.e. that will only use spare cycles.
    - %id Idle, just waiting for work. If this low, the CPU is working hard and doesn't have much excess capacity.
    - %wa Wait: If this is high, the CPU is ready to run, but is waiting on I/O access to complete (such as waiting for disk or network).
    - %hi and %si Hardware/Software Interrupt servicing. High numbers means a problem with the hardware or software in question.
    - %st Steal: If is high, your virtualization environment has issues, either too many guests on one host or just not enough CPUes to go around.
- 4th and 5th lines are similar to the free command explained below.
- The remainder shows the top processes, with the default sorting by CPU usage with highest to the top(as in show the top busy processes). You can flip the sort order by pressing:
    - M (Shift-m) to sort by memory used
    - T (Shift-t) to sort by how long the process has been running
    - P (Shift-p) to get back to the default sort by amount of processor used.
    - c (lower case c) to toggle seeing the full command for a given process.
The top line shows you how long this server has been up. Occasionally a process can gradually take more and more memory or CPU cycles, aka memory leaks or stuck threads, and top is where you would likely see it first if you are checking regularly.

Memory: free -m To see what memory the server sees and how much is actually used. Confirms that there really is the amount you expect running which catches memory failures and theft. Shows how much of your swap space is used. If you are running out of swap space, or it is really heavily used, increasing assigned/installed RAM may be in order. You don't want to endlessly grow your swap space unless poor performance and wasted electricity is acceptable (spinning harddrives takes more energy and a longer time to do anything than RAM). understanding free's output.

Drive Space: df -h Shows you what drives your system has and the percentage use. Check frequently to make sure you are not running tight anywhere as it can be bad to run out of space in root ( / ) just like it is bad on C: on a Windows box

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              18G  8.2G  8.7G  49% /
devtmpfs              2.9G  120K  2.9G   1% /dev
tmpfs                 2.9G  144K  2.9G   1% /dev/shm
admin                 4.0M     0  4.0M   0% /_admin
/dev/pool/DATA        900G  710G  191G  79% /opt/novell/nss/mnt/.pools/DATA
DATA                  900G  706G  191G  79% /media/nss/DATA

Root ( / ) looks good, but we will have to periodically watch the DATA volume. The two instances of DATA are one and the same, basically showing if all parts of the NSS mount process has completed (pool vs volume in the pool)

BTRFS messes up this nice simple process and is clearly by dev for dev leaving a mess for the rest of us to figure out. You can't put databases on it (corruption issues) which is why OES does not use it. Coming up for some good Ops level checks is in the queue.

- To figure out what is using space on a drive, use du -hx --max-depth=1 This will give you a summary of space used in each directory/folder from the level you are running it at, but only for the mount point you are on. Then for directories that look too big, you can cd into them and repeat until you have found what is taking up the room. Note that this command can take a bit of time to run, so be patient with it. This is a good exercise to do, to know your system. On Windows systems, I use a similar function that is a part of Total Commander (Alt-Shft-Enter)

One problem I've seen is where some log files grow way too fast for some reason and the above process with 'du' will find that fast enough. If the logs in question are already being bzipped, then either remove them or expand space for to allow for a year's worth. Otherwise when you have a log file that is huge (greater than 20MB and growing) investigate logrotate. While many logs such as apache's are set with logrotate as a default, there are some that are perceived should never grow much so haven't had a logrotate process setup for them. I usually just look at the existing rotation files in /etc/logrotate.d, make a copy of the one that is closest to my needs, and edit it accordingly. You don't want to be in the situation this guy ends up in.

As a general rule, Operating Systems do not automatically cleanup old files in their TEMP folders so you can get quite a lot of files building up on some systems. To clear /tmp and any others you need old files cleared, edit /etc/sysconfig/cron on SLES/SLED.

- The 'Gnome System Monitor' can show you some of the above in a more graphical manner including some trending. It is included on SLES 10 and up.

- For a deeper dive, Cyberciti's Top Linux Monitoring Tools will certainly take you there and some from the classical Linux point of view.

- Nagios is The Standard in system monitoring, is free, and like all such systems does take some effort to configure. A preconfigured instance is also now included with OES as a part of NoRM starting with version 11.2.