Problem
A multiprocessor program using OpenMPI 1.2.6, with OFED 1.2.5 has some nodes fail with the following message:
[0,1,0][/home/henrik/src/openmpi-1.2.6/ompi/mca/btl/openib/btl_openib_component.c:1334:btl_openib_component_progress] from hidden.hidden.dk to: hidden.hidden.dk error polling HP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 7510264 opcode 42
[hidden.hidden:29673] [0,1,7]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
The error was caused by calling system(3) in the short time between calling MPI_Send and the corresponding MPI_Recv which resulted in the send buffer being corrupted and the MPI_Recv call failing.