v1.0.0-rc.154 #64

Merged
leonarski_f merged 7 commits from 2606-tcp into main 2026-06-25 18:12:00 +02:00

7 Commits

Author SHA1 Message Date
leonarski_f cdbd0b7bed TCPStreamPusher: flush queued images before END so writer can't drop tail
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m40s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m14s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m39s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m24s
Build Packages / build:rpm (rocky8) (push) Successful in 10m30s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m23s
Build Packages / build:rpm (rocky9) (push) Successful in 11m9s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m7s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 10m9s
Build Packages / Generate python client (push) Successful in 18s
Build Packages / Build documentation (push) Successful in 51s
Build Packages / Create release (push) Skipped
Build Packages / XDS test (neggia plugin) (push) Successful in 6m25s
Build Packages / XDS test (durin plugin) (push) Successful in 7m32s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 7m37s
Build Packages / DIALS test (push) Successful in 11m29s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 9m2s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 8m46s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 9m40s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 8m23s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 8m59s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 9m50s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 10m36s
Build Packages / build:rpm (rocky9) (pull_request) Successful in 10m21s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 9m53s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 10m24s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 8m19s
Build Packages / Generate python client (pull_request) Successful in 13s
Build Packages / Create release (pull_request) Skipped
Build Packages / Build documentation (pull_request) Successful in 38s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 8m22s
Build Packages / DIALS test (pull_request) Successful in 11m31s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 5m41s
Build Packages / Unit tests (push) Successful in 1h17m11s
Build Packages / Unit tests (pull_request) Successful in 1h8m58s
Under load on the CI machine, the JFJochIntegrationTest_TCP_* tests
intermittently failed with the writer reporting fewer images than sent
(e.g. 4 of 5) and the DATA-ACK count short by one.

EndDataCollection sent the END frame as soon as it could acquire
send_mutex, without waiting for the per-connection WriterThread to drain
its queue. SendImage only enqueues; the WriterThread actually transmits
DATA frames. On a fast machine the queue is already empty when END is
sent, but on a loaded machine the WriterThread falls behind, END
overtakes the still-queued trailing image(s), the remote writer sees END
and finalizes, and the late DATA frame is silently dropped by
ProcessDataImage's Finalized case (no ACK emitted).

Fix: flush before END. WriterThread now drains the queue until the end
sentinel instead of bailing the moment c->active is cleared, and
EndDataCollection stops (and joins) the WriterThread before sending END.
Joining the writer is a race-free barrier: every image is on the wire
before END goes out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 16:49:25 +02:00
leonarski_f 26554c86e3 VERSION: 1.0.0-rc.154
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 14m3s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 15m4s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 15m17s
Build Packages / build:rpm (rocky8) (push) Successful in 15m11s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m28s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m41s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 15m41s
Build Packages / XDS test (neggia plugin) (push) Successful in 7m15s
Build Packages / Generate python client (push) Successful in 27s
Build Packages / XDS test (durin plugin) (push) Successful in 8m10s
Build Packages / Create release (push) Skipped
Build Packages / XDS test (JFJoch plugin) (push) Successful in 8m19s
Build Packages / Build documentation (push) Successful in 1m8s
Build Packages / build:rpm (rocky9) (push) Successful in 12m52s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 11m44s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 12m57s
Build Packages / DIALS test (push) Successful in 13m10s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 10m37s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 12m4s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 10m1s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 11m13s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 11m57s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 11m50s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 10m47s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 10m16s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 11m47s
Build Packages / Generate python client (pull_request) Successful in 13s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 9m10s
Build Packages / Create release (pull_request) Skipped
Build Packages / build:rpm (rocky9) (pull_request) Successful in 12m53s
Build Packages / Build documentation (pull_request) Successful in 42s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 7m17s
Build Packages / DIALS test (pull_request) Successful in 14m6s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 6m16s
Build Packages / Unit tests (push) Failing after 1h41m22s
Build Packages / Unit tests (pull_request) Successful in 1h29m54s
2026-06-25 15:58:28 +02:00
leonarski_f 51bfb57782 Regenerate OpenAPI 2026-06-25 15:56:44 +02:00
leonarski_f a02fd19af4 broker: optional TCP liveness/backpressure timeouts in config
Expose the two new TCPStreamPusher knobs through the broker config so they can be
tuned per deployment, like send_buffer_size already is:

- peer_liveness_timeout_ms  -> SetPeerLivenessTimeout()
- max_backpressure_timeout_ms -> SetMaxBackpressureTimeout()

Both are optional fields on tcp_settings; when unset the parser leaves the
pusher's built-in defaults (15s / 60s) in place, so existing config files keep
working unchanged. The parser also ignores non-positive values.

The generated broker/gen/model/Tcp_settings.{h,cpp} are hand-patched to mirror
exactly what openapi-generator emits for these fields (verified against the
send_buffer_size pattern and by a from_json/to_json round-trip), so re-running
update_version.sh reproduces them with no diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f 3cd96b8607 TCPStreamPusher: hard backpressure cap so a wedged writer can't hang the run
The peer-liveness timeout only catches a *silent* writer. A misbehaving writer
that keeps sending BUSY heartbeats while never draining (e.g. a permanently
wedged filesystem) would otherwise block SendAll -- and, through it, the queued
SendImage path and the end-of-run frame_transformation_futures.get() -- forever.

Add a progress-based cap in SendAll: if no bytes leave the socket for
max_backpressure_timeout (default 60s, tunable via SetMaxBackpressureTimeout)
the connection is declared dead regardless of heartbeats. It is one global cap,
enforced everywhere SendAll runs, so it bounds both mid-run stalls and
finalization. Generous relative to the 15s liveness window, since a heartbeating
peer is given more grace than a silent one -- but finite.

Add TCPImageCommTest_WedgedWriter_DroppedByBackpressureCap: a writer that ACKs
START then stalls forever while heartbeating (cap 1.5s, liveness 5s) must have
its connection dropped, and neither the producers nor EndDataCollection may
hang. Verified to hang (timeout) with the cap disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f 2a9fd084ab TCPStreamPusher: post-zerocopy cleanup + fix queue-path backpressure drop
Follow-up simplifications after removing the zerocopy machinery, plus a real
backpressure bug the cleanup surfaced:

- SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and
  then marked the connection broken. At high frame rate the 128-deep queue
  fills in tens of ms, so any filesystem stall longer than ~2s dropped the run
  even though the writer was alive and heartbeating -- defeating the whole
  BUSY-heartbeat backpressure design. Block instead while the peer is alive
  (!broken && active); the real liveness decision already lives in SendAll's
  peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This
  makes the queue path consistent with the send path: both wait out arbitrarily
  long stalls and only give up when the peer goes genuinely silent.
- Drop the dead per-connection data_sent counter (written, never read) and the
  redundant ImagePusherQueueElement.image_data set on the TCP path (only the
  HDF5 pusher reads that field).
- Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable).

Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw
writer double connects, ACKs START, then stops draining for 4s while still
sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out
the stall on the zero-copy queue path and deliver all 1000 images. Verified to
fail (115/1000 delivered, connection dropped) against the old 2s-deadline
behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f f859f8108f TCPStreamPusher: remove MSG_ZEROCOPY machinery, use plain blocking send
The MSG_ZEROCOPY path was the common factor behind the occasional mid-run
writer disconnects and added substantial failure surface for no benefit at
this throughput (tens of MB/s): the socket error queue raises POLLERR as a
normal event (entangled with liveness detection), and the per-connection
completion-id counter was reset every run while the kernel's sk_zckey is
monotonic for the life of the socket, so on a persistent connection the
bookkeeping diverged from run 2 onward.

Replace it with a straightforward synchronous send():
- SendAll/SendFrame lose all zerocopy params; DATA payloads are sent with a
  plain ::send(MSG_NOSIGNAL), so the image-buffer slot is owned by the kernel
  once send() returns and WriterThread releases it immediately.
- Drop ZeroCopyCompletionThread and the zc_pending/zc_mutex/zc_cv/zc_*_id
  state, SO_ZEROCOPY setup, and the errqueue include.
- StopDataCollectionThreads now drains and releases any queued-but-unsent
  slots on the stalled-writer path (active==false makes WriterThread exit
  without draining) instead of Clear()-ing them, avoiding a slot leak.

The BUSY-heartbeat peer-liveness timeout (backpressure tolerance) is kept;
it is independent of zerocopy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00