Commit Graph

6 Commits

Author SHA1 Message Date
leonarski_f 3cd96b8607 TCPStreamPusher: hard backpressure cap so a wedged writer can't hang the run
The peer-liveness timeout only catches a *silent* writer. A misbehaving writer
that keeps sending BUSY heartbeats while never draining (e.g. a permanently
wedged filesystem) would otherwise block SendAll -- and, through it, the queued
SendImage path and the end-of-run frame_transformation_futures.get() -- forever.

Add a progress-based cap in SendAll: if no bytes leave the socket for
max_backpressure_timeout (default 60s, tunable via SetMaxBackpressureTimeout)
the connection is declared dead regardless of heartbeats. It is one global cap,
enforced everywhere SendAll runs, so it bounds both mid-run stalls and
finalization. Generous relative to the 15s liveness window, since a heartbeating
peer is given more grace than a silent one -- but finite.

Add TCPImageCommTest_WedgedWriter_DroppedByBackpressureCap: a writer that ACKs
START then stalls forever while heartbeating (cap 1.5s, liveness 5s) must have
its connection dropped, and neither the producers nor EndDataCollection may
hang. Verified to hang (timeout) with the cap disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f 2a9fd084ab TCPStreamPusher: post-zerocopy cleanup + fix queue-path backpressure drop
Follow-up simplifications after removing the zerocopy machinery, plus a real
backpressure bug the cleanup surfaced:

- SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and
  then marked the connection broken. At high frame rate the 128-deep queue
  fills in tens of ms, so any filesystem stall longer than ~2s dropped the run
  even though the writer was alive and heartbeating -- defeating the whole
  BUSY-heartbeat backpressure design. Block instead while the peer is alive
  (!broken && active); the real liveness decision already lives in SendAll's
  peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This
  makes the queue path consistent with the send path: both wait out arbitrarily
  long stalls and only give up when the peer goes genuinely silent.
- Drop the dead per-connection data_sent counter (written, never read) and the
  redundant ImagePusherQueueElement.image_data set on the TCP path (only the
  HDF5 pusher reads that field).
- Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable).

Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw
writer double connects, ACKs START, then stops draining for 4s while still
sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out
the stall on the zero-copy queue path and deliver all 1000 images. Verified to
fail (115/1000 delivered, connection dropped) against the old 2s-deadline
behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f fc68a9baed v1.0.0-rc.146 (#56)
Build Packages / Unit tests (push) Skipped
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m34s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m0s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m23s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m16s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m49s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 8m32s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 9m15s
Build Packages / XDS test (durin plugin) (push) Successful in 7m16s
Build Packages / Generate python client (push) Successful in 16s
Build Packages / build:rpm (rocky9) (push) Successful in 10m12s
Build Packages / Create release (push) Skipped
Build Packages / Build documentation (push) Successful in 47s
Build Packages / DIALS test (push) Successful in 10m18s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 5m46s
Build Packages / build:rpm (rocky8) (push) Successful in 1h41m2s
Build Packages / XDS test (neggia plugin) (push) Successful in 1h59m18s
This is an UNSTABLE release. The release has significant modifications for data processing - in case of troubles go back to 1.0.0-rc.144.

jfjoch_process: Generate a dedicated file (_process.h5), which can be used as a replacement for the _master.h5 file for a reanalyzed dataset.
jfjoch_process: Improve the performance of scaling and merging, implement on the fly scaling.
jfjoch_writer: All final data analysis results are repopulated in the _master.h5 file.
jfjoch_scale: Dedicated tool for rescaling/merging existing data.
jfjoch_viewer: Fix bugs where pixel labels where displayed on a wrong pixel.

WARNING! Scaling and merging are experimental at the moment, and may not provide reasonable results for the time being.

Reviewed-on: #56
2026-05-28 18:48:35 +02:00
leonarski_f c1c170112c v1.0.0-rc.136 (#45)
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 13m48s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 13m57s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 15m15s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 15m35s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 15m29s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m55s
Build Packages / XDS test (durin plugin) (push) Successful in 8m17s
Build Packages / build:rpm (rocky9) (push) Successful in 12m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 11m2s
Build Packages / Generate python client (push) Successful in 30s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 51s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 7m13s
Build Packages / DIALS test (push) Successful in 13m19s
Build Packages / XDS test (neggia plugin) (push) Successful in 5m52s
Build Packages / Unit tests (push) Successful in 1h18m25s
Build Packages / build:rpm (rocky8) (push) Successful in 7m2s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.132.

* jfjoch_broker: Improve logic regarding indexing architecture and thread pools (work in progress).

Reviewed-on: #45
2026-04-20 11:54:33 +02:00
leonarski_f 64002f1e29 v1.0.0-rc.129 (#36)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 11m14s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m43s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m35s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m20s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / Generate python client (push) Successful in 39s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m24s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 1m0s
Build Packages / build:rpm (rocky8) (push) Successful in 10m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m35s
Build Packages / build:rpm (rocky9) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m9s
Build Packages / Unit tests (push) Failing after 1h18m57s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Significant improvements in TCP image socket, as a viable alternative for ZeroMQ sockets (only a single port on broker side, dynamically change number of writers, acknowledgments for written files)
* jfjoch_broker: Delta phi is calculated also for still data in Bragg prediction
* jfjoch_broker: Image pusher statistics are accessible via the REST interface
* jfjoch_writer: Supports TCP image socket and for these auto-forking option

Reviewed-on: #36
Co-authored-by: Filip Leonarski <filip.leonarski@psi.ch>
Co-committed-by: Filip Leonarski <filip.leonarski@psi.ch>
2026-03-05 22:13:12 +01:00
leonarski_f f3e0a15d26 v1.0.0-rc.127 (#34)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m51s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m6s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m7s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 9m47s
Build Packages / Generate python client (push) Successful in 29s
Build Packages / Build documentation (push) Successful in 43s
Build Packages / Create release (push) Has been skipped
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m46s
Build Packages / build:rpm (rocky8) (push) Successful in 9m33s
Build Packages / Unit tests (push) Has been skipped
Build Packages / build:rpm (ubuntu2204) (push) Successful in 8m47s
Build Packages / build:rpm (rocky9) (push) Successful in 9m55s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m4s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Default EIGER readout time is 20 microseconds
* jfjoch_broker: Multiple improvements regarding performance
* jfjoch_broker: Image buffer allows to track frames in preparation and sending
* jfjoch_broker: Dedicated thread for ZeroMQ transmission to better utilize the image buffer
* jfjoch_broker: Experimental implementation of transmission with raw TCP/IP sockets
* jfjoch_writer: Fixes regarding properly closing files in long data collections
* jfjoch_process: Scale & merge has been significantly improved, but it is not yet integrated into mainstream code

Reviewed-on: #34
2026-03-02 15:57:12 +01:00