Debian bug tracking URL: #550116

To: submit@bugs.debian.org
From: Ivan Zahariev <famzah@icdsoft.com>
Subject: False "Can't send mail: sendmail process failed" errors

Package: bsd-mailx
Version: 8.1.2-0.20071201cvs-3

You can find this bug report in HTML format at http://famzah.net/bsd-mailx-waitchild-bug/

There are sporadic false error messages when sending an email message. Here is an example:
~$ echo This delivery will actually succeed | mail root@example.com
Can't send mail: sendmail process failed
~$

The email message is actually sent and "sendmail" did not fail.

Here is a snippet of the source code of the affected functions and files: http://famzah.net/bsd-mailx-waitchild-bug/affected-source.c.html
This bug is caused by the patch in "send.c" for the bug report #145379.
Under certain circumstances, a race condition can occur if:
1. The parent fork()'s a process and exec()'s "sendmail" in "send.c". The child process is born.
2. The child starts, finishes quickly and exits. The parent has not called wait_child(pid) in "send.c" yet.
3. The parent immediately gets SIGCHLD because the child exited already. The sigchild() handler in "popen.c" reaps the child via waitpid() and exits directly because findchild(pid, 1) returned NULL. It returned NULL because the PID of the child process has not been added to the "child" structure list at all.
4. The execution of the parent process is resumed in "send.c", and it now calls wait_child(pid). The function wait_child(pid) returns "-1" because wait_child(pid) in "popen.c" calls waitpid(pid, ...) again for the same child PID, which the sigchild() handler already reaped. The second call to findchild(pid, 1) by wait_child(pid) in "popen.c" returns NULL too, because as already stated the PID of the child process has not been added to the "child" structure list. As a result, the false error message "Can't send mail: sendmail process failed" is given.

This bug happens only rarely, usually when the system is under load and the parent process lags a bit after the child one. But it does happen. We send about 15 messages every hour on 36 servers each, and we get 10 false error messages on average for 24 hours (0.08% false error rate).

To always reproduce the problem, add a sleep(5) in the parent process before calling wait_child(pid) in "send.c". This simulates that the task scheduler re-scheduled the parent process for later, when the child process has already exited. Note that Linux does not guarantee if the child or the parent process will execute or finish first, thus it is practically possible that the effect of this sleep() happens on real systems, as it does on many of ours. Here is the modified "send.c" which you can use to always reproduce the bug: http://famzah.net/bsd-mailx-waitchild-bug/send-reproduce-bug.c.html

I developed a small patch to fix the problem: http://famzah.net/bsd-mailx-waitchild-bug/waitpid-sigchld.patch.html
A version suitable for downloading: http://famzah.net/bsd-mailx-waitchild-bug/waitpid-sigchld.patch
No error handling is done for the sig*() functions but that is the way the authors of bsd-mailx use them.

This affects all current versions of "bsd-mailx" in Debian >=5.0 and the old "mailx" in Debian 4.0. The problem was first encountered on Debian 4.0 with "mailx" version "8.1.2-0.20050715cvs-1" and was later confirmed and debugged on Ubuntu 9.04 (running Debian 5.0) with "bsd-mailx" version "8.1.2-0.20081101cvs". Ubuntu inherits these packages from Debian.