On Sunday last, a (Linux) server in my infrastructure that was running a fairly conservative number of docker containers in production was brought to its knees. The monitoring data (from prometheus) showed that cpu was all gobbled up (from an average of less than 2% to a steady 75%-ish) and remained gobbled up until the server was rebooted. Notably, the disk usage and throughput went down during the event, and memory usage did not change notably, nor was it notable high.
On review of the messages log, one of the last entries before the event was documentation of an apache OOM (out of memory) event. On this server, apache is only running inside containers, which are generally limited to 500Mb (by docker). So presumably, a docker container running apache ran out of memory and tried to recover some memory and that was what triggered the event.
Reviewing the log of requests before the emergency, it's not clear which container or url or urls might have been generating so much memory use. It is a fast server, but the apache containers are all running mod_php, so it's entirely possible that some sequence of requests could generate a lot of parallel apache workers all bloated with php memory.
This particular infrastructure design has been in production for a year and a half, and this server is the fastest and most lightly loaded, so it makes me think that it's not an obvious general design problem, but perhaps an edge case specific one to one of my recently adopted sites. When reviewing the urls that were accessed in the minutes before the oom event, three possibilities stand out: an old Drupal 6 site with a calendar view that is being spidered, the same site that has a small custom code app in it, and a new wordpress site showing lots of admin-ajax.php calls.
For all of these cases, reducing php memory, reducing the maximum number of workers and/or increasing the memory of the docker containers would all be reasonable strategies.
BUT - what surprises me the most is that this kind of event isn't better handled already. Specifically - a much better strategy for an out of memory event would be to shut down the container and restart it. Obviously, that wouldn't be the responsiblity of the apache process itself, but it does surprise me that the standard apache image doesn't have some magic in it to help facilitate this kind of option.
Of course, this example is also a reasonable argument for the use of php-fpm, but up until now I'd been crossing my fingers that varnish in front of my containers might go some way to handling the common mod-php complaints of memory usage.
I'll also confess that I have not been monitoring container memory usage (because cadvisor is such a resource hog), but that would be smart.
The Tyee is a site I've been involved with since 2006 when I wrote the first, 4.7 version of a Drupal module to integrate Drupal content into a static site that was being generated from bricolage. About a year ago, I met with Dawn Buie and Phillip Smith and we mapped out a number of ways to improve the Drupal integration on the site, including upgrading the Drupal to version 5 from 4.7. Various parts of that grand plan have been slowly incorporated into the site, but as of next week, there'll be a big leap forward that coincides with a new design [implemented in Bricolage by David Wheeler who wrote and maintains Bricolage] as well as a new Drupal release of the Bricolage integration module . Plans Application integration is tricky, and my first time round had quite a few issues. Here's a list of the improvements in the latest version: File space separation. Before, Drupal was installed in the apache document root, which is where bricolage was publishing it's co...