Random overload!

What is happening to my server?!

I run a bunch of servers, most of them tiny and used for redundancy, but one that is central to my business. Lately, I’ve been noticing that the main server is consistently being bombarded by high CPU usage and multiple crashes, which, when it’s the primary email handler for my clients to access their incoming email, is a big problem.

Full disclosure with this: I’m not going to give away much in the way of details (log messages and such) as I don’t want to risk any privacy breaches with my clients data.

Where for art thou logs?

Seeing as how I’ve never run into that problem before on CentOS 7, or CentOS 6 in the years I’ve been using them both, I started investigating. Problem is, I couldn’t find any reason for WHY the system kept crashing. In my ignorance, I turned to the guys over at the Sysadministivia to get some advice. (Those guys are great by the way).

Brent got back to me really quickly and told me that CentOS 7 doesn’t store the journal persistently by default! How crazy is that?! After turning on persistent journaling (they even told me how to do that) and remote logging (rsyslog is still amazing), so I could at least go through the logs next time the problem happened.

An excerpt from Brent’s response to me:

Before we go into anything else, I should note that the default CentOS 7 behaviour for journald is "auto" storage, meaning: log to volatile memory (RAM) if the directory /var/log/journal does NOT exist (and it
doesn't, in default cases). If you want persistent logging (and it sounds like you do), you can either:

- uncomment "#Storage=auto" in /etc/systemd/journald.conf and change to
"Storage=persistent" (in which case it will force-create the directory if it doesn't exist), OR

- simply just mkdir -p /var/log/journal

The Problem is found!

When it did happen again, about a week later, I was able to examine all of my logs, and lo and behold, found so many references to PHP FCGI processes crashing from a lack of resources (DDoS), always from the same IP range (who knew Russia was so interested in my Small Business?) that I was able to just mass drop of all packets and requests coming from them. If I didn’t keep my systems patched religiously, I would be in much bigger trouble right now!

Lessons I learned on this issue:

  • Ask experienced people for advice when you need it. I don’t have any kind of formal training in Systemd and Journald, so was very confused why I couldn’t examine my logs using the provided journalctl tools.
  • Having the redundant servers backing my primary server is great, as it kept all of my services running without issues.
  • Keeping all of my servers updated and patched daily, in sequence so I never have an outage, is a great way to run small business servers.

Next is playing with a better, more automated way of updating and rebooting my servers in sequence.

Thanks again to the guys at Sysadministrivia for guiding me in being able to actually get the information I needed to fix the issue. If you want to hear their comments about it, check it out at their website, or follow the link to S2E3: Ass-Backwards Passwords, and check their show notes for their response to my email.