v1.0.0-rc.154 #64
Merged
leonarski_f
merged 7 commits from 2026-06-25 18:12:00 +02:00
2606-tcp into main
7 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cdbd0b7bed |
TCPStreamPusher: flush queued images before END so writer can't drop tail
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m40s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m14s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m39s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m24s
Build Packages / build:rpm (rocky8) (push) Successful in 10m30s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m23s
Build Packages / build:rpm (rocky9) (push) Successful in 11m9s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m7s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 10m9s
Build Packages / Generate python client (push) Successful in 18s
Build Packages / Build documentation (push) Successful in 51s
Build Packages / Create release (push) Skipped
Build Packages / XDS test (neggia plugin) (push) Successful in 6m25s
Build Packages / XDS test (durin plugin) (push) Successful in 7m32s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 7m37s
Build Packages / DIALS test (push) Successful in 11m29s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 9m2s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 8m46s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 9m40s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 8m23s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 8m59s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 9m50s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 10m36s
Build Packages / build:rpm (rocky9) (pull_request) Successful in 10m21s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 9m53s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 10m24s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 8m19s
Build Packages / Generate python client (pull_request) Successful in 13s
Build Packages / Create release (pull_request) Skipped
Build Packages / Build documentation (pull_request) Successful in 38s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 8m22s
Build Packages / DIALS test (pull_request) Successful in 11m31s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 5m41s
Build Packages / Unit tests (push) Successful in 1h17m11s
Build Packages / Unit tests (pull_request) Successful in 1h8m58s
Under load on the CI machine, the JFJochIntegrationTest_TCP_* tests intermittently failed with the writer reporting fewer images than sent (e.g. 4 of 5) and the DATA-ACK count short by one. EndDataCollection sent the END frame as soon as it could acquire send_mutex, without waiting for the per-connection WriterThread to drain its queue. SendImage only enqueues; the WriterThread actually transmits DATA frames. On a fast machine the queue is already empty when END is sent, but on a loaded machine the WriterThread falls behind, END overtakes the still-queued trailing image(s), the remote writer sees END and finalizes, and the late DATA frame is silently dropped by ProcessDataImage's Finalized case (no ACK emitted). Fix: flush before END. WriterThread now drains the queue until the end sentinel instead of bailing the moment c->active is cleared, and EndDataCollection stops (and joins) the WriterThread before sending END. Joining the writer is a race-free barrier: every image is on the wire before END goes out. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
26554c86e3 |
VERSION: 1.0.0-rc.154
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 14m3s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 15m4s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 15m17s
Build Packages / build:rpm (rocky8) (push) Successful in 15m11s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m28s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m41s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 15m41s
Build Packages / XDS test (neggia plugin) (push) Successful in 7m15s
Build Packages / Generate python client (push) Successful in 27s
Build Packages / XDS test (durin plugin) (push) Successful in 8m10s
Build Packages / Create release (push) Skipped
Build Packages / XDS test (JFJoch plugin) (push) Successful in 8m19s
Build Packages / Build documentation (push) Successful in 1m8s
Build Packages / build:rpm (rocky9) (push) Successful in 12m52s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 11m44s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 12m57s
Build Packages / DIALS test (push) Successful in 13m10s
Build Packages / build:rpm (rocky8_nocuda) (pull_request) Successful in 10m37s
Build Packages / build:rpm (rocky9_nocuda) (pull_request) Successful in 12m4s
Build Packages / build:rpm (ubuntu2404_nocuda) (pull_request) Successful in 10m1s
Build Packages / build:rpm (ubuntu2204_nocuda) (pull_request) Successful in 11m13s
Build Packages / build:rpm (rocky8_sls9) (pull_request) Successful in 11m57s
Build Packages / build:rpm (rocky9_sls9) (pull_request) Successful in 11m50s
Build Packages / build:rpm (rocky8) (pull_request) Successful in 10m47s
Build Packages / build:rpm (ubuntu2404) (pull_request) Successful in 10m16s
Build Packages / build:rpm (ubuntu2204) (pull_request) Successful in 11m47s
Build Packages / Generate python client (pull_request) Successful in 13s
Build Packages / XDS test (durin plugin) (pull_request) Successful in 9m10s
Build Packages / Create release (pull_request) Skipped
Build Packages / build:rpm (rocky9) (pull_request) Successful in 12m53s
Build Packages / Build documentation (pull_request) Successful in 42s
Build Packages / XDS test (JFJoch plugin) (pull_request) Successful in 7m17s
Build Packages / DIALS test (pull_request) Successful in 14m6s
Build Packages / XDS test (neggia plugin) (pull_request) Successful in 6m16s
Build Packages / Unit tests (push) Failing after 1h41m22s
Build Packages / Unit tests (pull_request) Successful in 1h29m54s
|
||
|
|
51bfb57782 | Regenerate OpenAPI | ||
|
|
a02fd19af4 |
broker: optional TCP liveness/backpressure timeouts in config
Expose the two new TCPStreamPusher knobs through the broker config so they can be
tuned per deployment, like send_buffer_size already is:
- peer_liveness_timeout_ms -> SetPeerLivenessTimeout()
- max_backpressure_timeout_ms -> SetMaxBackpressureTimeout()
Both are optional fields on tcp_settings; when unset the parser leaves the
pusher's built-in defaults (15s / 60s) in place, so existing config files keep
working unchanged. The parser also ignores non-positive values.
The generated broker/gen/model/Tcp_settings.{h,cpp} are hand-patched to mirror
exactly what openapi-generator emits for these fields (verified against the
send_buffer_size pattern and by a from_json/to_json round-trip), so re-running
update_version.sh reproduces them with no diff.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
3cd96b8607 |
TCPStreamPusher: hard backpressure cap so a wedged writer can't hang the run
The peer-liveness timeout only catches a *silent* writer. A misbehaving writer that keeps sending BUSY heartbeats while never draining (e.g. a permanently wedged filesystem) would otherwise block SendAll -- and, through it, the queued SendImage path and the end-of-run frame_transformation_futures.get() -- forever. Add a progress-based cap in SendAll: if no bytes leave the socket for max_backpressure_timeout (default 60s, tunable via SetMaxBackpressureTimeout) the connection is declared dead regardless of heartbeats. It is one global cap, enforced everywhere SendAll runs, so it bounds both mid-run stalls and finalization. Generous relative to the 15s liveness window, since a heartbeating peer is given more grace than a silent one -- but finite. Add TCPImageCommTest_WedgedWriter_DroppedByBackpressureCap: a writer that ACKs START then stalls forever while heartbeating (cap 1.5s, liveness 5s) must have its connection dropped, and neither the producers nor EndDataCollection may hang. Verified to hang (timeout) with the cap disabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
2a9fd084ab |
TCPStreamPusher: post-zerocopy cleanup + fix queue-path backpressure drop
Follow-up simplifications after removing the zerocopy machinery, plus a real backpressure bug the cleanup surfaced: - SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and then marked the connection broken. At high frame rate the 128-deep queue fills in tens of ms, so any filesystem stall longer than ~2s dropped the run even though the writer was alive and heartbeating -- defeating the whole BUSY-heartbeat backpressure design. Block instead while the peer is alive (!broken && active); the real liveness decision already lives in SendAll's peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This makes the queue path consistent with the send path: both wait out arbitrarily long stalls and only give up when the peer goes genuinely silent. - Drop the dead per-connection data_sent counter (written, never read) and the redundant ImagePusherQueueElement.image_data set on the TCP path (only the HDF5 pusher reads that field). - Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable). Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw writer double connects, ACKs START, then stops draining for 4s while still sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out the stall on the zero-copy queue path and deliver all 1000 images. Verified to fail (115/1000 delivered, connection dropped) against the old 2s-deadline behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
f859f8108f |
TCPStreamPusher: remove MSG_ZEROCOPY machinery, use plain blocking send
The MSG_ZEROCOPY path was the common factor behind the occasional mid-run writer disconnects and added substantial failure surface for no benefit at this throughput (tens of MB/s): the socket error queue raises POLLERR as a normal event (entangled with liveness detection), and the per-connection completion-id counter was reset every run while the kernel's sk_zckey is monotonic for the life of the socket, so on a persistent connection the bookkeeping diverged from run 2 onward. Replace it with a straightforward synchronous send(): - SendAll/SendFrame lose all zerocopy params; DATA payloads are sent with a plain ::send(MSG_NOSIGNAL), so the image-buffer slot is owned by the kernel once send() returns and WriterThread releases it immediately. - Drop ZeroCopyCompletionThread and the zc_pending/zc_mutex/zc_cv/zc_*_id state, SO_ZEROCOPY setup, and the errqueue include. - StopDataCollectionThreads now drains and releases any queued-but-unsent slots on the stalled-writer path (active==false makes WriterThread exit without draining) instead of Clear()-ing them, avoiding a slot leak. The BUSY-heartbeat peer-liveness timeout (backpressure tolerance) is kept; it is independent of zerocopy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |