On Sunday last, a (Linux) server in my infrastructure that was running a fairly conservative number of docker containers in production was brought to its knees. The monitoring data (from prometheus) showed that cpu was all gobbled up (from an average of less than 2% to a steady 75%-ish) and remained gobbled up until the server was rebooted. Notably, the disk usage and throughput went down during the event, and memory usage did not change notably, nor was it notable high.
On review of the messages log, one of the last entries before the event was documentation of an apache OOM (out of memory) event. On this server, apache is only running inside containers, which are generally limited to 500Mb (by docker). So presumably, a docker container running apache ran out of memory and tried to recover some memory and that was what triggered the event.
Reviewing the log of requests before the emergency, it's not clear which container or url or urls might have been generating so much memory use. It is a fast server, but the apache containers are all running mod_php, so it's entirely possible that some sequence of requests could generate a lot of parallel apache workers all bloated with php memory.
This particular infrastructure design has been in production for a year and a half, and this server is the fastest and most lightly loaded, so it makes me think that it's not an obvious general design problem, but perhaps an edge case specific one to one of my recently adopted sites. When reviewing the urls that were accessed in the minutes before the oom event, three possibilities stand out: an old Drupal 6 site with a calendar view that is being spidered, the same site that has a small custom code app in it, and a new wordpress site showing lots of admin-ajax.php calls.
For all of these cases, reducing php memory, reducing the maximum number of workers and/or increasing the memory of the docker containers would all be reasonable strategies.
BUT - what surprises me the most is that this kind of event isn't better handled already. Specifically - a much better strategy for an out of memory event would be to shut down the container and restart it. Obviously, that wouldn't be the responsiblity of the apache process itself, but it does surprise me that the standard apache image doesn't have some magic in it to help facilitate this kind of option.
Of course, this example is also a reasonable argument for the use of php-fpm, but up until now I'd been crossing my fingers that varnish in front of my containers might go some way to handling the common mod-php complaints of memory usage.
I'll also confess that I have not been monitoring container memory usage (because cadvisor is such a resource hog), but that would be smart.
I lived, worked and studied in Costa Rica from 1984 to 1989. Ostensibly, I was there to study Mathematics at the University, and indeed I graduated with an MSc. in Mathematics supervised by Ricardo Estrada (check that page, he even advertises me as one of his past students). And yes, I do have a nine page thesis that I wrote and defended in Spanish somewhere in my files, on a proof and extension of one of Ramanujan's theories. But mathematics is a pretty lonely endeavour, and what drew me back to Central America (after the first visit, which was more of an accident), was the life and politics. The time I lived there was extremely interesting (for me as an outsider, though also painful and tragic for it's inhabitants) because of the various wars that were largely fuelled by US regional hegemonic interests (of the usual corporate suspects and individuals) and neglect (of the politicians and public) - the Contra war in Nicaragua, the full-scale guerrilla wars in El Salvador and...