Today, we experienced an issue that Jupyterhub became unusable. I said that it perhaps had something to do with the fact that so many of you are using remote SSH extension of VSCode. I examined it and I am now almost confident that it was the culprit. I implemented a workaround, so it should not happen again, but my workaround may cause another problem; please keep reading if you are working with VSCode Remote SSH extension.
The issue was that, if you connect to a server using VSCode Remote SSH extension and then disconnect from it, some processes are left running uselessly consuming some amount of memory. As far as I can tell they don’t know when to quit and stay forever. This happens regardless whether you gracefully "Close Remote Connection" or close the VSCode window. As time goes by, such processes pile up and eventually exhaust the entire memory of the server.
Below I demonstrate such processes are piling up when you repeat making and closing connections. The right window repeatedly runs "pstree u24099" to show the tree of processes run by user u24099.
So I just made a monitoring script that keeps watching processes and killing those that "should not be running." The issue is how to know if a process "should not be running." By heuristics based on observation, it kills processes that meet the following conditions.
The first one is obvious.
The most important is the second condition. The observed problem was that, even though the sshd process (the process that runs on the server while you are connecting via SSH) exits after you close the remote SSH connection, its descendant processes (bash and dozens of VSCode-related processes) keep running. In Linux (Unix), when a process loses its parent and thus becomes an "orphan" process, the process ID 1 (called init) becomes its nominal parent. I use this as a sign that it should be killed (note that, if I kill an orphan process, its children in turn become orphans, which will then be killed in the next shot).
Here is a demonstration showing those left-over processes are killed by the monitoring process. The left window runs a process that kills all such processes every five seconds.
It’s not that all orphan processes should be killed. There may be a process that is orphan but still doing something useful (those produced by VSCode Remote SSH extension are clearly not). As a matter of fact, /usr/lib/systemd seems to start running and immediately becomes an orphan upon you connect via SSH, but orderly exits after you close the remote connection. Third condition avoids shooting this very process.
You don’t have to know the exact details of what the monitoring process is doing or run such monitoring processes by youself. I am taking care of all. What you should know is that, as will be clear from the above description, it may backfire; it may shoot necessary processes. You may be using another software that correctly manages orphan processes you don’t want to be shot.
If you observe your (useful) processes repeatedly die while you are actively using them, please let me know with description of the problem.
There is nothing special in this (taulec.zapto.org) environment. It means that it should be happening whenever Remote SSH extension of VSCode is used anywhere and by anybody ... After writing this, I thought it would be very surprising if this is indeed the case and the bug is left unnoticed or not fixed. By a quick Googling I came across this thread, which says it is the expected behavior and it will shutdown after eight hours. It says it (intentionally) keeps running for eight hours in case the user reconnects shortly. Well, we could wait for eight hours and see what happens, but the demonstration showed a fresh process tree gets created even when I immediately reconnect, and this will be the reason why the server quickly ran out of memory during the lecture. If waiting for eight hours is the expected behavior, spawning a fresh process tree again upon reconnect is definitely not. Let me know if you find a better information about the issue. Even better, after confirming this is indeed acknowledged as a unsolved issue, don’t you want to become the one who fixes it for the world?