Parallel clustering library

Fast connected component analysis for hybrid pixel detectors

About

This project implements a parallel clustering libraries for CPU and GPU intended for hybrid pixel detectors. By clustering, we mean connected-component analysis with respect to spatial and temporal pixel coordinates. Namely, it implements algorithms described here published in Journal of Instrumentation.

This project provides free access to source code for non-commercial puproposes on request. For access, do not hesitate to contact us. We kindly ask the users to cite the corresponding article , should any of the described parallel clustering method be applied in your work.

For potential applications, improvements, ideas, bugs or questions please contact me by email either at celko(at)ksvi.mff.cuni.cz or at tomas.celko(at)cvut.cz, I will be happy to help.

Supported hardware

Currently we support clustering Timepix3 and Timepix4 hits in data-driven mode. Support for frame-based mode and multiple-detector configurations is planned. Support for other similar detectors or modes is a matter of demand, feel free to let us know about possible applications.

News

15.12.2025 - Parallel clustering 1.0 (Unstable) - Release of the new tile-based clustering algorithm. The release includes also GPU-side attribute computation including cluster energy, energy histograms and aggregation into frames as well as energy-based filtering.

29.7.2025 - Latest windows build was deployed to the website and example cmake script was updated.

29.7.2025 - Benchmarks confirm that tile-based algorithm outperforms the current one by better utilizing the GPU compute power. To store small tiles, low-latency memory was used which enabled further optimizations. Regarding the actual peformance on RTX 4070 Ti Super, it was around 30% for smallest clusters up to more than 100% increase for large ion clusters. Release is planned by the end of the year.

4.4.2025 - New CUDA-based parallel clustering algorithm with possibly significantly lower memory usage was designed, implementation is expected during May/2025. This parallel algorithm could enable much higher degree of parallelization but it is best to wait for the implementation and the subsequent benchmarks.

28.1.2025 - Test multiple combinations of parameters and fixed found bugs. Refactored the code to remove “CUDA separable compilation” (= device code linking) option which now increased the clustering throughput back to the values listed in the published paper (~10-15%).

2.1.2025 - Added windows installer, but our gitlab linux distribution ci/cd pipelines seem to currently broken and therefore are not updated. The available version of .deb files should still be functional.

28.12.2024 - Extended the installation documentation.

26.12.2024 - Baseline testing of all prebuilt libraries has passed, continuing with parameter testing.

19.12.2024 - We are currently working on finalizing the package for the first (alpha) release of the GPU package.

CPU parallel clustering

We implemented CPU-based parallel clustering directly as a part of the Tracklab software, where it can be used.

Development of the standalone package is in progress.

GPU parallel clustering

I. Requirements

Project targets Linux-based and Windows platforms. Since the implementation runs in CUDA, a CUDA-capable device is required to run the project.

Prerequisites to link against the prebuilt library:

  • Linux or Windows x86/64 platform (for others contact us for building from source)
  • CUDA-capable device, compute capability >= 6.7
  • NVidia GPU Computing Toolkit >= 12.4 (and a compatible nvidia driver, check with nvidia-smi command)
  • CMake >= 3.19
  • C++ compiler compatible with C++17 standard

II. Installation

We provide user multiple options how to install our library:

  • download suitable installer file
  • (optional) for .deb file, check the Lintian output, make sure there are no errors
  • (important) make sure all prerequisites are matched - this is not checked by the package
  • run the installer

Option 2: From prebuilt .zip package

  • download suitable version for your platform from the table below
  • extract the zip file to desired location. For linux, you may want to copy contents of the include folder to usr/include/ and contents of lib folder to /usr/lib/. For Windows, you may want to copy contents of the extracted clusterer_cuda folder to C:/Program Files and let cmake know about path to clusterer_cuda-config.cmake. The next step might not be required if the files were placed in standard locations like /usr/lib
  • letting cmake know about path to clusterer_cuda-config.cmake - either set CMAKE_PREFIX_PATH to directory where clusterer_cuda-config.cmake is located or add it to environmental PATH variable.

Option 3: From source

  • Request access to the repository by email - free for non-commercial applications assuming proper citation of the relevant article
  • Create a build folder and generate build files with cmake .. or similar
  • Build with cmake --build . --config=Release

If you struggle with installation, feel free to reach out by email.

III. Linking

Option 1: Use find_package from your cmakelists

We consider this to be the most convenient option. Based on the value of CLUSTERER_CUDA_USE_STATIC variable, the clusterer_cuda-config.cmake sets the variables CLUSTERER_CUDA_INCLUDE_DIR and CLUSTERER_CUDA_LIBRARY. An example part of cmake script can be found below:


cmake_minimum_required(VERSION 3.18)
project(clusterer_test LANGUAGES CUDA CXX)

enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

#we can set this variable, if we link dynamically and want to copy the .dll/.so file to the binary folder 
#note, if it is not set, we need to make sure .dll/.so file is findable by the system
set(CMAKE_BINARY_DIR_COPY_DLL "${CMAKE_CURRENT_LIST_DIR}/build/Release")

set(CLUSTERER_CUDA_USE_STATIC ON)  # for static/dynamic linking, If building in debug mode, switching this to OFF can help
find_package(clusterer_cuda REQUIRED)
message(THESE_VARIABLES_SHOULD_BE_INITIALIZED: ${CLUSTERER_CUDA_FOUND} ${CLUSTERER_CUDA_INCLUDE_DIR} ${CLUSTERER_CUDA_LIBRARY})
add_executable(clusterer_test "src/main.cpp")
target_include_directories(clusterer_test PRIVATE ${CLUSTERER_CUDA_INCLUDE_DIR})
target_link_libraries(clusterer_test PRIVATE 
    ${CLUSTERER_CUDA_LIBRARY} 
    CUDA::cudart_static # choose either cudart_static or cudart
    CUDA::nvrtc      
)

Option 2: Set include and library paths manually

Another option is to bypass cmake find_package completely and set CLUSTERER_CUDA_INCLUDE_DIR and CLUSTERER_CUDA_LIBRARY manually. This is also the case for non-cmake-based projects, like the ones in Visual Studio. In Visual Studio, go to Configuration Properties > C/C++ > General and set Additional Include Directories, and similarly for Configuration Properties > Linker > General set Additional Library Directories.

IV. Example use (up-to-date with version 1.0):

#include "data_flow/external_dataflow_controller.cuh"
#include "data_structs/clustered_data.h"
using namespace clustering;
void test_clustering_tpx3()
{
  // define function callback to receive clustered data
  auto output_callback = [](clustered_data<tpx3_hit> data)
  {
    std::cout << "Returned with " << data.size << " hits" << std::endl;
    std::cout << "First hit: " << data.x[0] << data.y[0] << data.toa[0] << data.tot[0] << data.label[0] << std::endl;
    // Since 1.0 the actual toa is 4B, which means i-th pixel's time is offset encoded as data.toa_offset + data.toa[i] 
    // Since version 1.0 we can also query the attributes
    std::cout << "First cluster size:" << data.attributes.cluster_sizes[0] << std::endl;
    std::cout << "First cluster energy:" << data.attributes.cluster_energy[0] << std::endl;
    const uint8_t output_index = 0; // select output stream of frames
    std::cout << "First pixel total energy in the frame:" << data.attributes.cluster_energy_2d_map[output_index][0] <<std::endl;
    std::cout << "Index of the first pixel in the frame:" << data.attributes.cluster_energy_2d_map_indices[output_index][0] <<std::endl; // If pixel index has value j then j = frame_idx * sensor_matrix_size + pixel_matrix_coordinate (zero suppressed sparse encoding) -> frame_idx and pixel matrix idx can be computed as follows: 
    // frame_idx  = j div sensor_matrix_size
    // pixel_matrix_idx = j mod sensor_matrix_size
    // Note: the number of elements of data.attributes.cluster_energy_2d_map is data.attributes.nonzero_pixel_matrix_count_clustered

    /*process the data here in the callback, make a copy, as the data pointers might not be valid after callback returns...
      data.x, data.y, data.toa, data.tot
      //note: data.toa is uint64_t = 19 digits of precision - toa in nanoseconds or even smaller, controlled by parameter decimal digits 
    */
  }

  //initialize controller with arguments
  external_dataflow_controller<tpx3_hit> controller_tpx3(node_args::load_tpx3_args(/*pass algorithm parameters here in a config file*/ "config.txt"), output_callback);
  //run the controller, after that we are ready to receive data  
  controller_tpx3.run();
 
  //process hits in a loop, example
  controller_tpx3.process_hit(0 /*pixel_idx*/, 2 /*coarse toa*/, 3 /*fine toa*/, 4/*tot*/);
  //Note: in versions < 1.0 please call controller_tpx3.input()->process_hit() instead

  //after sufficient amount of hits is processed, the callback is called
  ...
  //stop the dataflow, and flush not-yet processed hits and call callback for the last time
  controller_tpx3.close();
}

And for using the clustering library with timepix4 data folow similar approach:

#include "data_flow/external_dataflow_controller.cuh"
#include "data_structs/clustered_data.h"
using namespace clustering;
void test_clustering_tpx4()
{
// define function callback to receive clustered data
  auto output_callback = [](clustered_data<tpx4_hit> data){...}
  
  //initialize controller with arguments, for definition of "output_callback see the example above for timepix3"
  external_dataflow_controller<tpx4_hit> controller_tpx4(node_args::load_tpx4_args(/*pass algorithm parameters here in a config file*/ "config.txt"), output_callback);
  //run the controller, after that we are ready to receive data
  controller_tpx4.run();
  //Note: in versions < 1.0 please call controller_tpx4.input()->process_hit() instead

  //process hits in a loop, notice the addition of ultra fine toa
  controller_tpx4->process_hit(0 /*pixel_idx*/, 2 /*coarse toa*/, 3 /*fine toa*/, 4 /*ultra fine toa*/, 5/*tot*/);
  //after sufficient amount of hits is processed, the callback is called
  ...
  //stop the dataflow, and flush not-yet processed hits and call callback for the last time
  controller_tpx4.close();
}

And finally, an example, how the config.txt file might look like:

//(i) Configuration for clustering - mandatory

max_hitrate_mhz = 300 // maximum possible hitrate that can occur during max_unsortedness period of time
max_unsortedness_mus = 500 // maximum unorderedness of hits on the input
max_cluster_join_time_ns = 300 // maximum time difference of hits to be considered neighboring
buffer_size = 7000000 // host and device buffer size
host_buffer_count = 8 // number of pinned-memory buffers allocated on the host
cuda_streams_per_worker = 4 // number of cuda streams in each clustering worker
cuda_thread_count = 7680 // degree of parallelization, use carefully based on the buffer_size and available hardware
cuda_min_launch_stream_count = 4 // minimum number of buffers required to run the clustering
cuda_clustering_threads_per_block = 128 // number of threads in a single cuda thread block
print_info_dt_ms = 500 // frequency of printing information about the dataflow
toa_ns_decimal_digits = 1 // decimal digits of toa 0 = 1ns, 1 = 0.1ns, 2 = 0.01ns ...
cuda_init_threads_per_block = 256 // threads per block used for initialization of auxiliary datastructures
_label_cache_size = 32 // size of the label cache in shared memory
_hit_data_cache_size = 12 // size of the hit data cache, beware of the available shared memory on a GPU

// (ii) configuration for attributes - optional (applicable since 1.0)

discard_non_attribute_data = false // remove x,y,toa, tot and label data, only preserve attributes - reduces copying load
attribute_energy = true // convert tot clock ticks to keV
attribute_cluster_size = true // compute pixel count for each cluster explicitly (implicit from labels field)
attribute_cluster_energy = true // compute sum of energy of each pixel in each cluster
attribute_cluster_energy_2d_map = true // compute energy-weighted centroid of clusters and aggregate them into the 2D frame further defined by section (iii)

attribute_cluster_energy_spectrum = true // compute cluster energy histogram, if true, also set the lower and upper bound as well as bin count
attribute_cluster_energy_spectrum_min = 0 // lower bound of the histogram (first bin start)
attribute_cluster_energy_spectrum_max = 100 // upper bound of the histogram (last bin)
attribute_cluster_energy_spectrum_bin_count = 50 // number of histogram bins

// (iii) configuration for attributes -optional, applies to cluster_energy_2d_map only (applicable since 1.0)

attribute_cluster_energy_filter_1 = 0:5 // for each energy filter interval we get a output sequence of frames, one can define arbitrarily many of such intervals but each with a non-trivial performance overhead (0:5 means include clusters in range from 0 keV to 5keV)
attribute_cluster_energy_filter_2 = 5:10 // similar as filter_1, don't forget to number the filters in a sequence 1,2,3... without gaps
attribute_cluster_energy_filter_3 = 10:20 // similar as  filter_1
//...
attribute_frame_length_s = 1 //frame length in seconds, can be floating point number, If attribute_cluster_energy_2d_map = true the frame length must be specified



// (iv) extra metadata required for attribute computation - required if energy-based attribute is to be computed (applicable since 1.0)

calibration_folder = D:/path/to/calibration/ceofficients/H07-W0052[CdTe]/calib_pars // path to calibration folder containing a.txt, b.txt c.txt, t.txt calibration files

V. Prebuilt library available for download:

Platform GPU Library, prebuilt (Stable, 0.1) GPU Library, prebuilt (Latest, 1.0)
Windows installer Download clusterer_cuda_installer.exe Download clusterer_cuda_installer.exe
Windows (archive) Download clusterer_cuda_win.zip Download clusterer_cuda_win.zip
Other Unix-based system (archive) Download clusterer_cuda.zip Download clusterer_cuda.zip
Ubuntu 22.04 Download clusterer_cuda_ubuntu2204.deb Download clusterer_cuda_ubuntu2204.deb
Ubuntu 24.04 Download clusterer_cuda_ubuntu2404.deb Download clusterer_cuda_ubuntu2404.deb
Debian 12 Download clusterer_cuda_debian12.deb Download clusterer_cuda_debian12.deb

References