Commit Graph

8 Commits

Author SHA1 Message Date
leonarski_f 3cd96b8607 TCPStreamPusher: hard backpressure cap so a wedged writer can't hang the run
The peer-liveness timeout only catches a *silent* writer. A misbehaving writer
that keeps sending BUSY heartbeats while never draining (e.g. a permanently
wedged filesystem) would otherwise block SendAll -- and, through it, the queued
SendImage path and the end-of-run frame_transformation_futures.get() -- forever.

Add a progress-based cap in SendAll: if no bytes leave the socket for
max_backpressure_timeout (default 60s, tunable via SetMaxBackpressureTimeout)
the connection is declared dead regardless of heartbeats. It is one global cap,
enforced everywhere SendAll runs, so it bounds both mid-run stalls and
finalization. Generous relative to the 15s liveness window, since a heartbeating
peer is given more grace than a silent one -- but finite.

Add TCPImageCommTest_WedgedWriter_DroppedByBackpressureCap: a writer that ACKs
START then stalls forever while heartbeating (cap 1.5s, liveness 5s) must have
its connection dropped, and neither the producers nor EndDataCollection may
hang. Verified to hang (timeout) with the cap disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f 2a9fd084ab TCPStreamPusher: post-zerocopy cleanup + fix queue-path backpressure drop
Follow-up simplifications after removing the zerocopy machinery, plus a real
backpressure bug the cleanup surfaced:

- SendImage(ZeroCopyReturnValue&) imposed a hard 2s deadline on enqueueing and
  then marked the connection broken. At high frame rate the 128-deep queue
  fills in tens of ms, so any filesystem stall longer than ~2s dropped the run
  even though the writer was alive and heartbeating -- defeating the whole
  BUSY-heartbeat backpressure design. Block instead while the peer is alive
  (!broken && active); the real liveness decision already lives in SendAll's
  peer-liveness timeout, which the writer's BUSY heartbeats keep fresh. This
  makes the queue path consistent with the send path: both wait out arbitrarily
  long stalls and only give up when the peer goes genuinely silent.
- Drop the dead per-connection data_sent counter (written, never read) and the
  redundant ImagePusherQueueElement.image_data set on the TCP path (only the
  HDF5 pusher reads that field).
- Add SetPeerLivenessTimeout() so the liveness window is tunable (and testable).

Add TCPImageCommTest_StalledWriter_SurvivesViaHeartbeat: a controllable raw
writer double connects, ACKs START, then stops draining for 4s while still
sending BUSY heartbeats (peer-liveness window set to 2s). The run must ride out
the stall on the zero-copy queue path and deliver all 1000 images. Verified to
fail (115/1000 delivered, connection dropped) against the old 2s-deadline
behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f f859f8108f TCPStreamPusher: remove MSG_ZEROCOPY machinery, use plain blocking send
The MSG_ZEROCOPY path was the common factor behind the occasional mid-run
writer disconnects and added substantial failure surface for no benefit at
this throughput (tens of MB/s): the socket error queue raises POLLERR as a
normal event (entangled with liveness detection), and the per-connection
completion-id counter was reset every run while the kernel's sk_zckey is
monotonic for the life of the socket, so on a persistent connection the
bookkeeping diverged from run 2 onward.

Replace it with a straightforward synchronous send():
- SendAll/SendFrame lose all zerocopy params; DATA payloads are sent with a
  plain ::send(MSG_NOSIGNAL), so the image-buffer slot is owned by the kernel
  once send() returns and WriterThread releases it immediately.
- Drop ZeroCopyCompletionThread and the zc_pending/zc_mutex/zc_cv/zc_*_id
  state, SO_ZEROCOPY setup, and the errqueue include.
- StopDataCollectionThreads now drains and releases any queued-but-unsent
  slots on the stalled-writer path (active==false makes WriterThread exit
  without draining) instead of Clear()-ing them, avoiding a slot leak.

The BUSY-heartbeat peer-liveness timeout (backpressure tolerance) is kept;
it is independent of zerocopy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-25 15:55:49 +02:00
leonarski_f 75e401f0e5 v1.0.0-rc.153 (#63)
Build Packages / Unit tests (push) Successful in 1h31m59s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 8m43s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m5s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m27s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m56s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 9m24s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m27s
Build Packages / build:rpm (rocky8) (push) Successful in 9m20s
Build Packages / build:rpm (rocky9) (push) Successful in 10m50s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 9m54s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 8m38s
Build Packages / DIALS test (push) Successful in 12m13s
Build Packages / XDS test (durin plugin) (push) Successful in 7m8s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 7m8s
Build Packages / XDS test (neggia plugin) (push) Successful in 7m50s
Build Packages / Generate python client (push) Successful in 16s
Build Packages / Build documentation (push) Successful in 50s
Build Packages / Create release (push) Skipped
This is an UNSTABLE release. It includes many experimental features, as well as many AI generated fixes. We recommend using rc.152 for production use.

* jfjoch_broker: Add EXPERIMENTAL pixelrefine mode for image processing
* jfjoch_broker: Allow to load user mask from 8-bit and 16-bit TIFF files
* jfjoch_broker: Add ROI calculation in non-FPGA workflow
* jfjoch_broker: Fixes to TCP image pusher
* jfjoch_broker: Remove NUMA bindings
* jfjoch_broker: Improvements to indexing
* jfjoch_broker: For PSI EIGER, trimming energies are taken from the detector configuration (now compulsory) instead of hardcoded values
* jfjoch_writer: Save ROI definitions and the per-pixel ROI bitmap in the master file; azimuthal ROIs support phi (angular) sectors
* jfjoch_viewer: Major redesign with dockable panels and saved layouts, plus on-canvas creation/move/resize of box, circle and azimuthal ROIs
* jfjoch_viewer: Run jfjoch_process reprocessing jobs from inside the GUI and overlay per-run results

Reviewed-on: #63
2026-06-23 20:29:49 +02:00
leonarski_f fc68a9baed v1.0.0-rc.146 (#56)
Build Packages / Unit tests (push) Skipped
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m34s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m0s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m23s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m16s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m49s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 8m32s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 9m15s
Build Packages / XDS test (durin plugin) (push) Successful in 7m16s
Build Packages / Generate python client (push) Successful in 16s
Build Packages / build:rpm (rocky9) (push) Successful in 10m12s
Build Packages / Create release (push) Skipped
Build Packages / Build documentation (push) Successful in 47s
Build Packages / DIALS test (push) Successful in 10m18s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 5m46s
Build Packages / build:rpm (rocky8) (push) Successful in 1h41m2s
Build Packages / XDS test (neggia plugin) (push) Successful in 1h59m18s
This is an UNSTABLE release. The release has significant modifications for data processing - in case of troubles go back to 1.0.0-rc.144.

jfjoch_process: Generate a dedicated file (_process.h5), which can be used as a replacement for the _master.h5 file for a reanalyzed dataset.
jfjoch_process: Improve the performance of scaling and merging, implement on the fly scaling.
jfjoch_writer: All final data analysis results are repopulated in the _master.h5 file.
jfjoch_scale: Dedicated tool for rescaling/merging existing data.
jfjoch_viewer: Fix bugs where pixel labels where displayed on a wrong pixel.

WARNING! Scaling and merging are experimental at the moment, and may not provide reasonable results for the time being.

Reviewed-on: #56
2026-05-28 18:48:35 +02:00
leonarski_f bb9f5c715f v1.0.0-rc.135 (#44)
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m55s
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m28s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m56s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m47s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 13m7s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 12m31s
Build Packages / build:rpm (rocky8) (push) Successful in 12m59s
Build Packages / build:rpm (rocky9) (push) Successful in 14m5s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 15m30s
Build Packages / Generate python client (push) Successful in 1m18s
Build Packages / Build documentation (push) Successful in 1m3s
Build Packages / Create release (push) Has been skipped
Build Packages / build:rpm (ubuntu2404) (push) Successful in 10m8s
Build Packages / XDS test (durin plugin) (push) Successful in 9m16s
Build Packages / XDS test (neggia plugin) (push) Successful in 7m59s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 9m12s
Build Packages / DIALS test (push) Successful in 11m44s
Build Packages / Unit tests (push) Successful in 1h23m8s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.132.

* Multiple small bug fixes scattered across the whole code base. (detected with GPT-5.4)
* jfjoch_viewer: Improve image render performance

Reviewed-on: #44
Co-authored-by: Filip Leonarski <filip.leonarski@psi.ch>
Co-committed-by: Filip Leonarski <filip.leonarski@psi.ch>
2026-04-16 11:59:59 +02:00
leonarski_f 64002f1e29 v1.0.0-rc.129 (#36)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 11m14s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m43s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m35s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m20s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / Generate python client (push) Successful in 39s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m24s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 1m0s
Build Packages / build:rpm (rocky8) (push) Successful in 10m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m35s
Build Packages / build:rpm (rocky9) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m9s
Build Packages / Unit tests (push) Failing after 1h18m57s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Significant improvements in TCP image socket, as a viable alternative for ZeroMQ sockets (only a single port on broker side, dynamically change number of writers, acknowledgments for written files)
* jfjoch_broker: Delta phi is calculated also for still data in Bragg prediction
* jfjoch_broker: Image pusher statistics are accessible via the REST interface
* jfjoch_writer: Supports TCP image socket and for these auto-forking option

Reviewed-on: #36
Co-authored-by: Filip Leonarski <filip.leonarski@psi.ch>
Co-committed-by: Filip Leonarski <filip.leonarski@psi.ch>
2026-03-05 22:13:12 +01:00
leonarski_f f3e0a15d26 v1.0.0-rc.127 (#34)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m51s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m6s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m7s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 9m47s
Build Packages / Generate python client (push) Successful in 29s
Build Packages / Build documentation (push) Successful in 43s
Build Packages / Create release (push) Has been skipped
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m46s
Build Packages / build:rpm (rocky8) (push) Successful in 9m33s
Build Packages / Unit tests (push) Has been skipped
Build Packages / build:rpm (ubuntu2204) (push) Successful in 8m47s
Build Packages / build:rpm (rocky9) (push) Successful in 9m55s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m4s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Default EIGER readout time is 20 microseconds
* jfjoch_broker: Multiple improvements regarding performance
* jfjoch_broker: Image buffer allows to track frames in preparation and sending
* jfjoch_broker: Dedicated thread for ZeroMQ transmission to better utilize the image buffer
* jfjoch_broker: Experimental implementation of transmission with raw TCP/IP sockets
* jfjoch_writer: Fixes regarding properly closing files in long data collections
* jfjoch_process: Scale & merge has been significantly improved, but it is not yet integrated into mainstream code

Reviewed-on: #34
2026-03-02 15:57:12 +01:00