Follow-up simplifications after removing the zerocopy machinery, plus a real
backpressure bug the cleanup surfaced:
- SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and
then marked the connection broken. At high frame rate the 128-deep queue
fills in tens of ms, so any filesystem stall longer than ~2s dropped the run
even though the writer was alive and heartbeating -- defeating the whole
BUSY-heartbeat backpressure design. Block instead while the peer is alive
(!broken && active); the real liveness decision already lives in SendAll's
peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This
makes the queue path consistent with the send path: both wait out arbitrarily
long stalls and only give up when the peer goes genuinely silent.
- Drop the dead per-connection data_sent counter (written, never read) and the
redundant ImagePusherQueueElement.image_data set on the TCP path (only the
HDF5 pusher reads that field).
- Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable).
Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw
writer double connects, ACKs START, then stops draining for 4s while still
sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out
the stall on the zero-copy queue path and deliver all 1000 images. Verified to
fail (115/1000 delivered, connection dropped) against the old 2s-deadline
behavior.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>