26554c86e3a955ea996ee2586afccca00a139edd
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3cd96b8607 |
TCPStreamPusher: hard backpressure cap so a wedged writer can't hang the run
The peer-liveness timeout only catches a *silent* writer. A misbehaving writer that keeps sending BUSY heartbeats while never draining (e.g. a permanently wedged filesystem) would otherwise block SendAll -- and, through it, the queued SendImage path and the end-of-run frame_transformation_futures.get() -- forever. Add a progress-based cap in SendAll: if no bytes leave the socket for max_backpressure_timeout (default 60s, tunable via SetMaxBackpressureTimeout) the connection is declared dead regardless of heartbeats. It is one global cap, enforced everywhere SendAll runs, so it bounds both mid-run stalls and finalization. Generous relative to the 15s liveness window, since a heartbeating peer is given more grace than a silent one -- but finite. Add TCPImageCommTest_WedgedWriter_DroppedByBackpressureCap: a writer that ACKs START then stalls forever while heartbeating (cap 1.5s, liveness 5s) must have its connection dropped, and neither the producers nor EndDataCollection may hang. Verified to hang (timeout) with the cap disabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
2a9fd084ab |
TCPStreamPusher: post-zerocopy cleanup + fix queue-path backpressure drop
Follow-up simplifications after removing the zerocopy machinery, plus a real backpressure bug the cleanup surfaced: - SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and then marked the connection broken. At high frame rate the 128-deep queue fills in tens of ms, so any filesystem stall longer than ~2s dropped the run even though the writer was alive and heartbeating -- defeating the whole BUSY-heartbeat backpressure design. Block instead while the peer is alive (!broken && active); the real liveness decision already lives in SendAll's peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This makes the queue path consistent with the send path: both wait out arbitrarily long stalls and only give up when the peer goes genuinely silent. - Drop the dead per-connection data_sent counter (written, never read) and the redundant ImagePusherQueueElement.image_data set on the TCP path (only the HDF5 pusher reads that field). - Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable). Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw writer double connects, ACKs START, then stops draining for 4s while still sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out the stall on the zero-copy queue path and deliver all 1000 images. Verified to fail (115/1000 delivered, connection dropped) against the old 2s-deadline behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
fc68a9baed |
v1.0.0-rc.146 (#56)
Build Packages / Unit tests (push) Skipped
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m34s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m0s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m23s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m16s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m49s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 8m32s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 9m15s
Build Packages / XDS test (durin plugin) (push) Successful in 7m16s
Build Packages / Generate python client (push) Successful in 16s
Build Packages / build:rpm (rocky9) (push) Successful in 10m12s
Build Packages / Create release (push) Skipped
Build Packages / Build documentation (push) Successful in 47s
Build Packages / DIALS test (push) Successful in 10m18s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 5m46s
Build Packages / build:rpm (rocky8) (push) Successful in 1h41m2s
Build Packages / XDS test (neggia plugin) (push) Successful in 1h59m18s
This is an UNSTABLE release. The release has significant modifications for data processing - in case of troubles go back to 1.0.0-rc.144. jfjoch_process: Generate a dedicated file (_process.h5), which can be used as a replacement for the _master.h5 file for a reanalyzed dataset. jfjoch_process: Improve the performance of scaling and merging, implement on the fly scaling. jfjoch_writer: All final data analysis results are repopulated in the _master.h5 file. jfjoch_scale: Dedicated tool for rescaling/merging existing data. jfjoch_viewer: Fix bugs where pixel labels where displayed on a wrong pixel. WARNING! Scaling and merging are experimental at the moment, and may not provide reasonable results for the time being. Reviewed-on: #56 |
||
|
|
c1c170112c |
v1.0.0-rc.136 (#45)
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 13m48s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 13m57s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m15s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m35s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 15m29s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m55s
Build Packages / XDS test (durin plugin) (push) Successful in 8m17s
Build Packages / build:rpm (rocky9) (push) Successful in 12m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 11m2s
Build Packages / Generate python client (push) Successful in 30s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 51s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 7m13s
Build Packages / DIALS test (push) Successful in 13m19s
Build Packages / XDS test (neggia plugin) (push) Successful in 5m52s
Build Packages / Unit tests (push) Successful in 1h18m25s
Build Packages / build:rpm (rocky8) (push) Successful in 7m2s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.132. * jfjoch_broker: Improve logic regarding indexing architecture and thread pools (work in progress). Reviewed-on: #45 |
||
|
|
64002f1e29 |
v1.0.0-rc.129 (#36)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 11m14s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m43s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m35s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m20s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / Generate python client (push) Successful in 39s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m24s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 1m0s
Build Packages / build:rpm (rocky8) (push) Successful in 10m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m35s
Build Packages / build:rpm (rocky9) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m9s
Build Packages / Unit tests (push) Failing after 1h18m57s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124. * jfjoch_broker: Significant improvements in TCP image socket, as a viable alternative for ZeroMQ sockets (only a single port on broker side, dynamically change number of writers, acknowledgments for written files) * jfjoch_broker: Delta phi is calculated also for still data in Bragg prediction * jfjoch_broker: Image pusher statistics are accessible via the REST interface * jfjoch_writer: Supports TCP image socket and for these auto-forking option Reviewed-on: #36 Co-authored-by: Filip Leonarski <filip.leonarski@psi.ch> Co-committed-by: Filip Leonarski <filip.leonarski@psi.ch> |
||
|
|
f3e0a15d26 |
v1.0.0-rc.127 (#34)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m51s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m6s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m7s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 9m47s
Build Packages / Generate python client (push) Successful in 29s
Build Packages / Build documentation (push) Successful in 43s
Build Packages / Create release (push) Has been skipped
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m46s
Build Packages / build:rpm (rocky8) (push) Successful in 9m33s
Build Packages / Unit tests (push) Has been skipped
Build Packages / build:rpm (ubuntu2204) (push) Successful in 8m47s
Build Packages / build:rpm (rocky9) (push) Successful in 9m55s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m4s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124. * jfjoch_broker: Default EIGER readout time is 20 microseconds * jfjoch_broker: Multiple improvements regarding performance * jfjoch_broker: Image buffer allows to track frames in preparation and sending * jfjoch_broker: Dedicated thread for ZeroMQ transmission to better utilize the image buffer * jfjoch_broker: Experimental implementation of transmission with raw TCP/IP sockets * jfjoch_writer: Fixes regarding properly closing files in long data collections * jfjoch_process: Scale & merge has been significantly improved, but it is not yet integrated into mainstream code Reviewed-on: #34 |