sg14: Re: [SG14] [EXTERNAL] Re: Unusual environment error handling

From: Hoemmen, Mark <mhoemme_at_[hidden]>
Date: Tue, 3 Dec 2019 20:55:37 +0000

Hi Ben! From an “HPC cluster running MPI” perspective, uncaught C++ exceptions are supposed to trigger MPI_Abort (like std::terminate), but in practice may cause the application to deadlock.

Synchronization to check errors is expensive and can be hard to implement in some contexts (e.g., when using subcommunicators – subsets of the entire parallel machine).

Best practice is for users of MPI to maintain “local” and “global” error states and carefully piggyback error information on communication so that it can propagate, but I don’t see a lot of applications doing that in a consistent way.

mfh

On 12/3/19, 1:06 PM, "SG14 on behalf of Tjernstrom, Staffan via SG14" <sg14-bounces_at_[hidden]<mailto:sg14-bounces_at_[hidden]> on behalf of sg14_at_[hidden]<mailto:sg14_at_[hidden]>> wrote:

From an FPGA perspective, in our case we sit at the end of a DMA-driven message queue. In as much as we ever pass up errors (as opposed to a plain hardware crash - undefined behaviour in hardware is not fun) it's a specific message on that queue. That probably mirrors something outcome-like in the CPU realm. A hardware crash causes a watchdog callback that triggers extraordinary shutdown mechanisms in the CPU code - so akin to exception handling, but using our own triggers / handlers.

From: SG14 [mailto:sg14-bounces_at_[hidden]] On Behalf Of Ben Craig via SG14
Sent: Tuesday, 03 December, 2019 14:47
To: sg14_at_[hidden]rg
Cc: Ben Craig <ben.craig_at_[hidden]>; Olivier Giroux <OGiroux_at_[hidden]>
Subject: [SG14] Unusual environment error handling

Olivier Giroux had a comment on one of the other reflectors a few months ago, asking whether we are only focusing on yesterday's error handling problems, and not looking enough at tomorrow's error handling problems.

With that in mind, what is the status quo for error handling in various unusual environments? I think I've got a good handle on the status quo for kernel and embedded systems, but I don't know what people do for GPUs, FPGAs, or cluster (e.g. GPUs, FPGAs, HPC clusters, etc...).

Do these systems all use some variation of return codes? Exceptions? Perhaps the only error conditions can be expressed and propagated with floating point NaNs? Maybe applications in these environments almost never run into errors that need to be propagated up the stack?

Please, share your experiences.

________________________________

IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Received on 2019-12-03 14:58:01