In case you were on here yesterday, for a brief period (a very long hour for me) the site was down. What happened? A confluence of things:
- The main hard drive filled up with an ever-expanding collection of source code
- I split said source code onto several machines, requiring a re-installation of the SCC software
- I ended up with repositories incompatible with the old version of the SCCS
- So I had to update the operating system
- Which updated everything, making my configuration files pretty much unusable
Here is the skinny.
1. Filled Up Hard Drive
The first problem I encountered were yelps from the server itself: it sent me messages gasping for air, as all internals started dying because there was no room left. Temporary files would not write; log files and rotation would die; even cron jobs would shatter.
When I went online, it didn’t take long to figure out what the problem was. The issue, though, was to figure out who was the main culprit and where I could trim “fat” most easily. Since I couldn’t write temporary files, I started in the easiest place: the login directory (my login user’s home directory), looking for something that could be removed.
Once I had a few megabytes available, trimmed by removing a few ZIP files with pictures I uploaded indirectly, I could du the drive. The command I use is generall “du | sort -n” because that gives me the worst offenders at the bottom, where I see them first. I could easily see that my source repository was taking up 75% of the drive. Nothing else was even remotely relevant at that point.
2. Splitting the Repository
My first inclination was to save the revision history. I am really good at keeping everything under source control, and that includes projects that are not software-related. My writing projects, including my novels, are all versioned automatically, to the tune of 5-minute-increment updates. You can follow as I write, down to the point where you can tell where I went and got a coffee.
So I had software projects, non-software projects (writing and graphics), and web sites under SCCS. Sadly, the repository is opaque: you can’t tell which projects take up space. I also made the (stupid) mistake of placing everything under a giant canopy, where it would have been just as easy to separate repositories.
If I had done that, then I would have immediately known that the main culprit was the sites, taking up over 70% of the storage thanks to all the photos in albums. It really takes a lot of writing to even remotely compare with a sunset; the old saying that a picture is worth a thousand words is definitely an understatement: the final cover design file is about twice as big as the entire text of the novel!
To split the repository, I would have to dump it. I was using SVN (switched to Git), which means I had to create a file that would contain the entire repository, and then split that file into three, one each for the subrepositories.
The command to do that, by the way, is “svnadmin dump url > file”. Url and file here are the actual URL of the repository and the file name you want to save to. Once you get the file, you can then manipulate it using the new command svndumpfilter.
It didn’t work that way for me. Thanks to autocommits, I had accumulated 11000+ revisions, and the dump crashed miserably after about 5000. I started doing partial dumps – versions 1 to 999, 1000 to 1999 etc. That worked, but I would have to piece them back together later. And, really, when did I need that history?
So I went for a different approach: I simply exported the repository into a new directory structure and backed it up to a RAID drive. Then I started three brand new repositories on the server.
3. Upgrading the Operating System
That’s when I realized there was a version mismatch between the server on which I created the repo and the server that ran SCCS. When I tried to connect, the server said it didn’t recognize the repository because it was in an unknown file format.
I had multiple options: recreate the repos using the old software; pick a different software altogether; upgrade the server to a new release.
I went all out: upgrade to the new release. It was by far the most extensive change, and was crazy-unnecessary for the purpose of this exercise. But I knew that the release I was using was so old, it was going to be sunset soon. Might as well bite the bullet and live a dangerous life.
Since I use Ubuntu, upgrading was simple: change to root, type do-release-upgrade, answer questions for hours. That’s when the server started falling apart: the change from 12.04 to 14.04 brought a change in Apache versions, and since Apache is one of the worst pieces of software to configure, all hell broke loose. The server would not start up. Which is what you may have seen.
The upgrade as a whole, though, went without a glitch. It pays off to wait for a year before upgrading.
4. Upgrading Apache Configuration
As mentioned, the only pain point (as usual) was Apache. I have a hate-hate relationship with Apache and continue using it only out of inertia, since there are plenty other options available.
It’s not that Apache is bad – it really isn’t. Apache’s problem is that it’s still stuck in the web the way it was done in 1994: everything is complicated, and nobody knows why.
Apache configuration works that way, too. As an example, I will note this tiny piece of configuration:
ProxyPass /waveplot/latest.png https://surf.mrgazz.com/waveplot
ProxyPassReverse /waveplot/latest.png https://surf.mrgazz.com/waveplot
What that does is map a request to this site, mrgazz.com, to a different site. So when you ask for http://mrgazz.com/waveplot/latest.png, this site actually goes out to another server, surf.mrgazz.com, and fetches the image from there. Why not use the URL directly? Because it’s an HTTPS URL and the site doesn’t have a proper security certificate.
The question is, why do I need a ProxyPass AND a ProxyPassReverse line? The reasoning of the Apache people is that the two commands do two different things: the first one goes and fetches the image; the second one changes the response so that all references to the original site are replaced with references to the new site (that is, surf.mrgazz.com is replaced with www.mrgazz.com).
And that’s the crux of the problem: as a system administrator, I shouldn’t have to care HOW Apache does things. I should simply care about what I want Apache to do. But that’s not the way Apache works to this day.
So, simply switching from Apache 2.2 to 2.4 (the small version increment is misleading) left me with an unusable server. The problem was that many of even the most fundamental things had been changed for no apparent reason, and there was no fallback. In particular, the Authz directives had been changed. You used to say
Order Allow, Deny
Allow from all
Now you say
Require all granted
Somehow, nobody had added this very simple change into the Apache update script. Which is really stupid, because it may have helped a ton of people transition painlessly, whereas its failure would have been only sporadic, while not doing it guaranteed the sites would break.
5. Upgrading Other Software
Of course, that was all painless. Screw Apache.
Rebooting the system was painless. It took a lot longer than usual (90 seconds instead of 25) for reconfiguration, but luckily Ubuntu Server is smart enough to know it cannot ask questions while it’s booting, so there was no hanging console.
Everything booted fine, and aside from wasting an hour (actually, a few hours) of my time and a gallon of adrenaline, there were no negative side effects.
7. Next Changes
The next project for this site is going to be an upgrade from Joomla 2.5 series to the new 3.3. I didn’t think it would make a lot of difference, but the new version is a huge step forward. In particular, the templates found in 3.3 are mobile-friendly, which is definitely something I’d like to see for this site.
Also, I really really have to clean up the menu structure and content. Lucky me, all of it is version-controlled, so I can be gradual in planning and push everything out once I am satisfied I will satisfy you!