Commit Graph

3 Commits

Author SHA1 Message Date
leonarski_f 97c3008c39 TCP stream: tolerate writer backpressure via BUSY heartbeats
Build Packages / Unit tests (push) Successful in 56m25s
Build Packages / DIALS test (push) Successful in 13m9s
Build Packages / XDS test (durin plugin) (push) Successful in 9m34s
Build Packages / XDS test (JFJoch plugin) (push) Successful in 9m37s
Build Packages / XDS test (neggia plugin) (push) Successful in 5m57s
Build Packages / Generate python client (push) Successful in 14s
Build Packages / Build documentation (push) Successful in 40s
Build Packages / Create release (push) Skipped
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 9m57s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m47s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m11s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m57s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m20s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m41s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m19s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m42s
Build Packages / build:rpm (rocky8) (push) Successful in 11m4s
Build Packages / build:rpm (rocky9) (push) Successful in 11m43s
A slow filesystem could stall the writer's consume pipeline, propagating
TCP backpressure to the pusher. The pusher then treated that backpressure
as a dead peer and force-closed the connection mid-run; the writer
reconnected as a brand-new connection outside the active session, so the
rest of the run was silently dropped and the half-written HDF5 file was
later finalized with holes.

Replace the throughput/progress-based send timeout with a peer-liveness
timeout:

- Add TCPFrameType::BUSY (wire version 2 -> 3).
- TCPImagePuller runs a heartbeat thread that sends BUSY every 250ms on a
  thread independent of the (possibly stalled) write path, so liveness
  keeps flowing during deep stalls. A send_mutex serializes
  ACK/pong/heartbeat writes.
- TCPStreamPusher refreshes last_peer_activity_ns on every inbound frame
  and only declares a connection dead after peer_liveness_timeout (15s) of
  complete silence, tolerating arbitrarily long backpressure while still
  catching a genuinely dead peer (and immediate EPIPE/ECONNRESET).
- Re-key both backpressure waits (the SendAll data path and the post-END
  WaitForEndAck) onto this liveness signal instead of byte-progress /
  DATA-ACK-progress, so a slow final flush at END is tolerated too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 16:55:59 +02:00
leonarski_f 64002f1e29 v1.0.0-rc.129 (#36)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 11m14s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 10m43s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 11m35s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 9m20s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 10m23s
Build Packages / Generate python client (push) Successful in 39s
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 11m24s
Build Packages / Create release (push) Has been skipped
Build Packages / Build documentation (push) Successful in 1m0s
Build Packages / build:rpm (rocky8) (push) Successful in 10m35s
Build Packages / build:rpm (ubuntu2204) (push) Successful in 10m35s
Build Packages / build:rpm (rocky9) (push) Successful in 11m17s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m9s
Build Packages / Unit tests (push) Failing after 1h18m57s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Significant improvements in TCP image socket, as a viable alternative for ZeroMQ sockets (only a single port on broker side, dynamically change number of writers, acknowledgments for written files)
* jfjoch_broker: Delta phi is calculated also for still data in Bragg prediction
* jfjoch_broker: Image pusher statistics are accessible via the REST interface
* jfjoch_writer: Supports TCP image socket and for these auto-forking option

Reviewed-on: #36
Co-authored-by: Filip Leonarski <filip.leonarski@psi.ch>
Co-committed-by: Filip Leonarski <filip.leonarski@psi.ch>
2026-03-05 22:13:12 +01:00
leonarski_f f3e0a15d26 v1.0.0-rc.127 (#34)
Build Packages / build:rpm (rocky8_nocuda) (push) Successful in 10m51s
Build Packages / build:rpm (ubuntu2404_nocuda) (push) Successful in 8m0s
Build Packages / build:rpm (ubuntu2204_nocuda) (push) Successful in 9m6s
Build Packages / build:rpm (rocky9_nocuda) (push) Successful in 10m7s
Build Packages / build:rpm (rocky8_sls9) (push) Successful in 9m47s
Build Packages / Generate python client (push) Successful in 29s
Build Packages / Build documentation (push) Successful in 43s
Build Packages / Create release (push) Has been skipped
Build Packages / build:rpm (rocky9_sls9) (push) Successful in 10m46s
Build Packages / build:rpm (rocky8) (push) Successful in 9m33s
Build Packages / Unit tests (push) Has been skipped
Build Packages / build:rpm (ubuntu2204) (push) Successful in 8m47s
Build Packages / build:rpm (rocky9) (push) Successful in 9m55s
Build Packages / build:rpm (ubuntu2404) (push) Successful in 9m4s
This is an UNSTABLE release. The release has significant modifications and bug fixes, if things go wrong, it is better to revert to 1.0.0-rc.124.

* jfjoch_broker: Default EIGER readout time is 20 microseconds
* jfjoch_broker: Multiple improvements regarding performance
* jfjoch_broker: Image buffer allows to track frames in preparation and sending
* jfjoch_broker: Dedicated thread for ZeroMQ transmission to better utilize the image buffer
* jfjoch_broker: Experimental implementation of transmission with raw TCP/IP sockets
* jfjoch_writer: Fixes regarding properly closing files in long data collections
* jfjoch_process: Scale & merge has been significantly improved, but it is not yet integrated into mainstream code

Reviewed-on: #34
2026-03-02 15:57:12 +01:00