I found this little problem the other day: there’s this server that runs for a while and then falls over. It’s then restarted by its startup script and the whole process repeats itself. This doesn't sound that bad as it isn't business critical although there is a significant loss of data, so I decided to take a closer look and to find out exactly what's going wrong. The first thing to note is that the server passes all it's unit tests and a whole bunch of integration tests. It runs well in all test environments using test data, so what's going wrong in production? It's easy to guess that in production it's probably under a heavier load than test, or than had been allowed for in its design, and therefore it's running out of resources, but what resources and where? That's the tricky question.