Re: Autobuild hung today
Date: Sun, 1 Apr 2007 20:24:49 +0200
> On Sat, Mar 31, 2007 at 07:37:10PM +0200, Guenter Knauf wrote:
>> I had yesterday this total hung, and it made the machine unaccessable
>> from remote.
> The hanging problem is known, but I don't see how it could make the
> inaccessible. In any case, this is clearly a serious problem. I've just
> disabled the SSH tests for now until this can be sorted out. The sshd
> started for the curl tests shouldn't have any effect on remote access--it
> runs on a different port and should be completely independent of the sshd
> running on port 22 (I'm assuming that's where you're having the remote
> accessibility problem).
no. The machine became totally unaccessable because of 100% cpu usage, and the harddisk was glowing....; it wasnt even possible to reboot the machine from physical console properly;
I've no idea yet what really happened - all I can say is that the last life sign of the box was the autobuild, after that all services stopped responding, and the HD kept enormously busy.
>> In addition now after the recent changes I see that none of the test
>> servers start anymore;
>> since I changed the user running autobuilds from root to a normal user I
>> thought first that I missed something, but I can see a couple of other
>> autobuilds now suffering from same issue....
>> any ideas, someone?
> I posted the patch to fix the hang here and on the libssh2 mailing list
> yesterday; I hope it's applied soon. I suspect the hang caused all the
> test servers to fail to be shut down properly when the process was finally
> killed manually after hanging, causing problems on future test runs.
I was forced to hard-reset the box after the hung, so that was not the problem in first turn, however you were right that for whatever reason all testservers hung after the next run;
my guess is that CVS didnt its job right, and improperly updated some (test?) files...
I've seen that a couple of times already in the past - dont know why this happens; normally it should rename the old file, and put the new from CVS in place in case it cant patch; but for (to me) unknown reasons sometimes this doesnt work.
I've just checked a bit with older snapshot builds which worked fine so far; then I did remove the old curl dir, and fetched a fresh copy from CVS, and seems this fixed it now (hopefully - too early to finally say).
Probably its possible to make an initial check if there are testservers hanging around, and kill them before we try to start another test round?
Perhaps in testcurl.pl when we check for old build-* dirs we should first check for the *.pid files, and use them to kill the hanging testservers before we remove the dirs?
Received on 2007-04-01