Republished from 04/29 as it was lost due to a Docker Container crash… Irony!
I have an article in the recently released “DZone Guide to Building and Deploying Applications on the Cloud” entitled “Fullstack Engineering in the Age of Hybrid Cloud”. In this article I discuss the need and skills of a Fullstack Engineer with relation to troubleshooting and repairing complex, distributed hybrid cloud applications. My recent experiences with troubleshooting issues with my Docker WordPress container only reinforce the details I wrote about in this piece. Without my comprehensive understanding of both the infrastructure and application layer I don’t believe I could have achieved resolution (if I have, but more on that later).
My Docker WordPress container has always had issues with the “Error Connecting to Database” issue, but initially it would happen once a month and I would just re-start the container. I had read that the issue was fixed by moving to WordPress 4.5, so I upgraded, which came with its own challenges given these containers are supposed to be immutable.
Unfortunately, I designed my container when Docker architecture was in its infancy and so separating out and linking a MySQL container and the WordPress container as well as storing data on a separate volume are all features which emerged, or became more easily used, in later versions. Eventually, I will need redesign around 1.11 features, but for now, I’m just trying to keep up what I currently have. I did try just moving the database files onto permanent storage mapped in to the container as a volume, but all I did was fight with file permissions for a day and MySQL never ended up starting.
Recently, it became more and more difficult to keep the container up, so I upgraded to the latest Ubuntu 14.04 kernel and when that didn’t seem to help the issue I upgraded Docker from 1.4 to 1.11. None of these seemed to correct the issue. However, Docker 1.11 leverages the new architecture and uses cgroups, which resulted in cgroup out of memory thread killer posting messages to my console.
Now, I could see that mysqld was being terminated at some point due to insufficient memory. To solve the memory issue, I tried optimizing the WordPress LAMP stack for low memory and even migrated from a 1G virtual machine to a 2G instance. It seems no matter how much memory I threw at this problem the longest the WordPress site would be active before the database connection issue appeared was an hour.
Totally baffled at this point, I started chasing down a lead regarding WordPress issues occurring on my cloud service provider. It seemed the issue I was seeing was happening to many others on Digital Ocean, perhaps this was a VPS (DO’s Droplet architecture is VPS-based) issue and not a Docker issue. DO responded on its forum to the various postings stating that running out of memory is common result of the known XML-RPC Denial of Service attack. XML-RPC is the API interface for WordPress.
Wait! What am I doing? No one’s going to bother attacking my little old blog, it can’t be that. Back to optimizing memory use. Oh crud, this is still not getting me anywhere after two weeks.
Unfortunately, again my immutable container architecture limited my ability to see logs and SSH connections were often terminated due to low memory as well. Once I terminated the container without committing the container the logs were lost. So, I had to modify the current container to use an external volume for all the log files and now wrote them out to permanent storage.
Whoa! What do I find in the apache2 access.log after the next time the issue occurs? Well, when I did a tail of the last 200 entries I found my site was being attacked by a Googlebot, and there were a lot more entries in addition to those. In the end, I was a victim of a denial of service attack.
I believe its important to look at what data I had available and the characteristics identified by the logs and error messages. Nothing screamed DoS attack consuming mass number of threads on the Apache server and driving memory usage to 0 so that the memory manager was sacrificing threads to keep the OS alive (does that make anyone else think of Kirk screaming to Scotty, “all power to life support”?). When the attack stopped, mysqld_safe restored the thread, but it seems the socket or some other interprocess mechanism didn’t allow WordPress to communicate with the MySQL.
Piecing this together after the fact required a mix of skills. It might have been easier if I was doing live monitoring and tracking inbound requests while also constantly checking that WordPress could communicate MySQL, but realistically, this is a dramatic step when all else has failed.
Through this I learned a lot about container architecture, but this issue is probably still lingering. I’m just denying all requests to access XML-RPC from outside IP addresses at this time and the WordPress has been up for over 24 hours. More importantly, it really reinforces what I wrote about in the article and I don’t believe I could have reached this point if I didn’t have a good understanding of the infrastructure, operating system, networking, Docker and LAMP stack