stdgpu: Efficient STL-like Data Structures on the GPU

stdgpu: Efficient STL-like Data Structures on the GPU

Features | Examples | Documentation | Building | Integration | Contributing | License | Contact

Features

stdgpu is an open-source library providing several generic GPU data structures for fast and reliable data management. Multiple platforms such as CUDA, OpenMP, and HIP are supported allowing you to rapidly write highly complex agnostic and native algorithms that look like sequential CPU code but are executed in parallel on the GPU.

  • Productivity. Previous libraries such as thrust, VexCL, ArrayFire or Boost.Compute focus on the fast and efficient implementation of various algorithms for contiguously stored data to enhance productivity. stdgpu follows an orthogonal approach and focuses on fast and reliable data management to enable the rapid development of more general and flexible GPU algorithms just like their CPU counterparts.

  • Interoperability. Instead of providing yet another ecosystem, stdgpu is designed to be a lightweight container library. Therefore, a core feature of stdgpu is its interoperability with previous established frameworks, i.e. the thrust library, to enable a seamless integration into new as well as existing projects.

  • Maintainability. Following the trend in recent C++ standards of providing functionality for safer and more reliable programming, the philosophy of stdgpu is to provide clean and familiar functions with strong guarantees that encourage users to write more robust code while giving them full control to achieve a high performance.

At its heart, stdgpu offers the following GPU data structures and containers:

atomic & atomic_ref
Atomic primitive types and references
bitset
Space-efficient bit array
deque
Dynamically sized double-ended queue
queue & stack
Container adapters
unordered_map & unordered_set
Hashed collection of unique keys and key-value pairs
vector
Dynamically sized contiguous array

In addition, stdgpu also provides commonly required functionality in algorithm, bit, cmath, contract, cstddef, functional, iterator, limits, memory, mutex, ranges, utility to complement the GPU data structures and to increase their usability and interoperability.

Examples

In order to reliably perform complex tasks on the GPU, stdgpu offers flexible interfaces that can be used in both agnostic code, e.g. via the algorithms provided by thrust, as well as in native code, e.g. in custom CUDA kernels.

For instance, stdgpu is extensively used in SLAMCast, a scalable live telepresence system, to implement real-time, large-scale 3D scene reconstruction as well as real-time 3D data streaming between a server and an arbitrary number of remote clients.

Agnostic code. In the context of SLAMCast, a simple task is the integration of a range of updated blocks into the duplicate-free set of queued blocks for data streaming which can be expressed very conveniently:

#include <stdgpu/cstddef.h>             // stdgpu::index_t
#include <stdgpu/iterator.h>            // stdgpu::make_device
#include <stdgpu/unordered_set.cuh>     // stdgpu::unordered_set

class stream_set
{
public:
    void
    add_blocks(const short3* blocks,
               const stdgpu::index_t n)
    {
        set.insert(stdgpu::make_device(blocks),
                   stdgpu::make_device(blocks + n));
    }

    // Further functions

private:
    stdgpu::unordered_set<short3> set;
    // Further members
};

Native code. More complex operations such as the creation of the duplicate-free set of updated blocks or other algorithms can be implemented natively, e.g. in custom CUDA kernels with stdgpu's CUDA backend enabled:

#include <stdgpu/cstddef.h>             // stdgpu::index_t
#include <stdgpu/unordered_map.cuh>     // stdgpu::unordered_map
#include <stdgpu/unordered_set.cuh>     // stdgpu::unordered_set

__global__ void
compute_update_set(const short3* blocks,
                   const stdgpu::index_t n,
                   const stdgpu::unordered_map<short3, voxel*> tsdf_block_map,
                   stdgpu::unordered_set<short3> mc_update_set)
{
    // Global thread index
    stdgpu::index_t i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    short3 b_i = blocks[i];

    // Neighboring candidate blocks for the update
    short3 mc_blocks[8]
    = {
        short3(b_i.x - 0, b_i.y - 0, b_i.z - 0),
        short3(b_i.x - 1, b_i.y - 0, b_i.z - 0),
        short3(b_i.x - 0, b_i.y - 1, b_i.z - 0),
        short3(b_i.x - 0, b_i.y - 0, b_i.z - 1),
        short3(b_i.x - 1, b_i.y - 1, b_i.z - 0),
        short3(b_i.x - 1, b_i.y - 0, b_i.z - 1),
        short3(b_i.x - 0, b_i.y - 1, b_i.z - 1),
        short3(b_i.x - 1, b_i.y - 1, b_i.z - 1),
    };

    for (stdgpu::index_t j = 0; j < 8; ++j)
    {
        // Only consider existing neighbors
        if (tsdf_block_map.contains(mc_blocks[j]))
        {
            mc_update_set.insert(mc_blocks[j]);
        }
    }
}

More examples can be found in the examples directory.

Documentation

A comprehensive introduction into the design and API of stdgpu can be found here:

Since a core feature and design goal of stdgpu is its interoperability with thrust, it offers full support for all thrust algorithms instead of reinventing the wheel. More information about the design can be found in the related research paper.

Building

Before building the library, please make sure that all required tools and dependencies are installed on your system. Newer versions are supported as well.

Required

Required for CUDA backend

Required for OpenMP backend

  • OpenMP 2.0
    • GCC 7
      • (Ubuntu 18.04/20.04) Already installed
    • Clang 6
      • (Ubuntu 18.04/20.04) sudo apt install libomp-dev
    • MSVC 19.20
      • (Windows) Already installed

Required for HIP backend (experimental)

The library can be built as every other project which makes use of the CMake build system.

In addition, we also provide cross-platform scripts to make the build process more convenient. Since these scripts depend on the selected build type, there are scripts for both debug and release builds.

Command Effect
sh scripts/setup_<build_type>.sh Performs a full clean build of the project. Removes old build, configures the project (build path: ./build), builds the project, and runs the unit tests.
sh scripts/build_<build_type>.sh (Re-)Builds the project. Requires that the project is set up.
sh scripts/run_tests_<build_type>.sh Runs the unit tests. Requires that the project is built.
sh scripts/install_<build_type>.sh Installs the project at the configured install path (default: ./bin).

Integration

In the following, we show some examples on how the library can be integrated into and used in a project.

CMake Integration. To use the library in your project, you can either install it externally first and then include it using find_package:

find_package(stdgpu 1.0.0 REQUIRED)

add_library(foo ...)

target_link_libraries(foo PUBLIC stdgpu::stdgpu)

Or you can embed it into your project and build it from a subdirectory:

# Exclude the examples from the build
set(STDGPU_BUILD_EXAMPLES OFF CACHE INTERNAL "")

# Exclude the tests from the build
set(STDGPU_BUILD_TESTS OFF CACHE INTERNAL "")

add_subdirectory(stdgpu)

add_library(foo ...)

target_link_libraries(foo PUBLIC stdgpu::stdgpu)

CMake Options. To configure the library, two sets of options are provided. The following build options control the build process:

Build Option Effect Default
STDGPU_BACKEND Device system backend STDGPU_BACKEND_CUDA
STDGPU_BUILD_SHARED_LIBS Builds the project as a shared library, if set to ON, or as a static library, if set to OFF BUILD_SHARED_LIBS
STDGPU_SETUP_COMPILER_FLAGS Constructs the compiler flags ON if standalone, OFF if included via add_subdirectory
STDGPU_TREAT_WARNINGS_AS_ERRORS Treats compiler warnings as errors OFF
STDGPU_BUILD_EXAMPLES Build the examples ON
STDGPU_BUILD_TESTS Build the unit tests ON
STDGPU_BUILD_TEST_COVERAGE Build a test coverage report OFF
STDGPU_ANALYZE_WITH_CLANG_TIDY Analyzes the code with clang-tidy OFF
STDGPU_ANALYZE_WITH_CPPCHECK Analyzes the code with cppcheck OFF

In addition, the implementation of some functionality can be controlled via configuration options:

Configuration Option Effect Default
STDGPU_ENABLE_CONTRACT_CHECKS Enable contract checks OFF if CMAKE_BUILD_TYPE equals Release or MinSizeRel, ON otherwise
STDGPU_USE_32_BIT_INDEX Use 32-bit instead of 64-bit signed integer for index_t ON

Contributing

For detailed information on how to contribute, see CONTRIBUTING.

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

If you use stdgpu in one of your projects, please cite the following publications:

stdgpu: Efficient STL-like Data Structures on the GPU

@UNPUBLISHED{stotko2019stdgpu,
    author = {Stotko, P.},
     title = {{stdgpu: Efficient STL-like Data Structures on the GPU}},
      year = {2019},
     month = aug,
      note = {arXiv:1908.05936},
       url = {https://arxiv.org/abs/1908.05936}
}

SLAMCast: Large-Scale, Real-Time 3D Reconstruction and Streaming for Immersive Multi-Client Live Telepresence

@article{stotko2019slamcast,
    author = {Stotko, P. and Krumpen, S. and Hullin, M. B. and Weinmann, M. and Klein, R.},
     title = {{SLAMCast: Large-Scale, Real-Time 3D Reconstruction and Streaming for Immersive Multi-Client Live Telepresence}},
   journal = {IEEE Transactions on Visualization and Computer Graphics},
    volume = {25},
    number = {5},
     pages = {2102--2112},
      year = {2019},
     month = may
}

Contact

Patrick Stotko - [email protected]

Owner
Patrick Stotko
Ph.D. Student @ University of Bonn
Patrick Stotko
Comments
  • Cannot include stdgpu to an existing OpenMP/HIP project

    Cannot include stdgpu to an existing OpenMP/HIP project

    I'm working on an OpenMP/HIP code and trying to include stdgpu as a subproject. What I need is to

    • compile stdgpu with hipcc/hcc,
    • compile my HIP sources with hipcc/hcc,
    • compile the rest with gcc/clang, and
    • leave all the linking things to gcc/clang

    It seems like stdgpu needs to pass -DCMAKE_CXX_COMPILER=hcc to cmake to build HIP backend but hcc doesn't support -fopenmp yet. So the OpenMP libraries are missing when a find_package(OpenMP) is encountered. Is there any way to workaround this?

    I've tried to set the compiler to clang. The compiler complains that -hc is an unknown argument. It looks like this is an hcc argument required by the rocthrust::rocthrust target.

    clang++  -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -Dstdgpu_EXPORTS
    ...
    -isystem /opt/rocm-3.3.0/hip/include -isystem /opt/rocm/include  -stdlib=libc++ -O3 -DNDEBUG -fPIC   -hc -fPIC -std=c++14 -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c /external/stdgpu/src/stdgpu/impl/iterator.cpp
    clang-9: error: unknown argument: '-hc'
    
  • error: redefinition of ‘constexpr const bool stdgpu::numeric_limits<T>::is_specialized’

    error: redefinition of ‘constexpr const bool stdgpu::numeric_limits::is_specialized’

    Ubuntu 20 Cuda 11.4

    Library compiled by CUDA back-end successfully, examples work. But when add STDGPU to small project, next error during compiling:

    /usr/local/include/stdgpu/impl/limits_detail.h:73:42: error: redefinition of ‘constexpr const bool stdgpu::numeric_limits::is_specialized’ image image

  • Failed to build project with stdgpu.  Error: expected unqualified-id before ‘sizeof’

    Failed to build project with stdgpu. Error: expected unqualified-id before ‘sizeof’

    Hello, I'm trying to embed stdgpu into my project. I write the cmake file as the tutorial and cmake built successfully.

    However, when I tried to build my own project, it failed and raised many errors.

    [ 70%] Built target stdgpu
    [ 80%] Built target foo
    [ 90%] Building CXX object CMakeFiles/parallel_cache.dir/main.cpp.o
    In file included from /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/platform.h:34:0,
                     from /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/iterator.h:33,
                     from /home/yanglinzhuo/parallel_cache/stdgpu_test.cuh:5,
                     from /home/yanglinzhuo/parallel_cache/main.cpp:3:
    /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/cuda/platform.h:48:37: error: expected unqualified-id before ‘sizeof’
         #define STDGPU_CUDA_DEVICE_ONLY sizeof("STDGPU ERROR: Wrong compiler detected! Device-only functions must be compiled with the device compiler!")
                                         ^
    /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/cuda/platform.h:48:37: note: in definition of macro ‘STDGPU_CUDA_DEVICE_ONLY’
         #define STDGPU_CUDA_DEVICE_ONLY sizeof("STDGPU ERROR: Wrong compiler detected! Device-only functions must be compiled with the device compiler!")
                                         ^~~~~~
    /home/yanglinzhuo/parallel_cache/stdgpu/src/stdgpu/../stdgpu/platform.h:80:34: note: in expansion of macro ‘STDGPU_DETAIL_CAT2_DIRECT’
     #define STDGPU_DETAIL_CAT2(A, B) STDGPU_DETAIL_CAT2_DIRECT(A, B)
                                      ^~~~~~~~~~~~~~~~~~~~~~~~~
    ...
    

    I omit many error lines because they are similar. The main error here is error: expected unqualified-id before ‘sizeof’.

    I'm confused with these errors and have no ideas how to fix them.

    Because I'm new to compile with cmake, so I think there maybe some mistakes in my cmake file.

    The following is y project's sructure:

    • CMakeLists.txt
    • main.cpp
    • stdgpu_test.cuh
    • stdgpu/

    And the following is my cmake file:

    cmake_minimum_required(VERSION 3.18)
    
    project(parallel_cache)
    
    set(CMAKE_CXX_STANDARD 14)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)
    
    set(Torch_DIR /usr/local/libtorch/share/cmake/Torch)  # My libtorch path
    find_package(Torch REQUIRED)
    
    # Exclude the examples from the build
    set(STDGPU_BUILD_EXAMPLES OFF CACHE INTERNAL "")
    # Exclude the tests from the build
    set(STDGPU_BUILD_TESTS OFF CACHE INTERNAL "")
    add_subdirectory(stdgpu)
    set_property(TARGET stdgpu PROPERTY CUDA_ARCHITECTURES 60)
    add_library(foo stdgpu)
    set_target_properties(foo PROPERTIES LINKER_LANGUAGE CXX)
    target_link_libraries(foo PUBLIC stdgpu::stdgpu)
    
    add_executable(${PROJECT_NAME} "main.cpp" "stdgpu_test.cuh")
    
    target_link_libraries(parallel_cache PUBLIC "${TORCH_LIBRARIES}")
    target_link_libraries(parallel_cache PUBLIC foo)
    set_property(TARGET parallel_cache PROPERTY CXX_STANDARD 14)
    

    My system configuration is:

    • OS: Ubuntu 18.04
    • Compiler: GCC 7.5.0
    • Backend: CUDA
    • Library version: 1.3.0

    Wish for any help. Thanks.

  • Implement at() in terms of operator[] to avoid bound checks in deque and vector

    Implement at() in terms of operator[] to avoid bound checks in deque and vector

    The containers in the STL that support random access implement two different functions to access specific elements in the container:

    • operator[]
    • at

    operator[] just accesses the element without bound checks, while at perfoms the same operation, but doing bound checks and throwing an out_of_bounds exception when the index is out of bounds.

    I propose that the containers (deque and vector basically) implement at() in terms of operator[], and perfom bound checks only when calling at(). That way they're more compliant to the STL ones. This is my proposed solution for vector (for deque is the same idea):

    template <typename T>
    inline STDGPU_DEVICE_ONLY typename vector<T>::reference
    vector<T>::at(const vector<T>::index_type n)
    {
        return const_cast<vector<T>::reference>(static_cast<const vector<T>*>(this)->at(n));
    }
    
    
    template <typename T>
    inline STDGPU_DEVICE_ONLY typename vector<T>::const_reference
    vector<T>::at(const vector<T>::index_type n) const
    {
        STDGPU_EXPECTS(0 <= n);
        STDGPU_EXPECTS(n < size());
        STDGPU_EXPECTS(occupied(n));
    
        return this->operator[](n);
    }
    
    
    template <typename T>
    inline STDGPU_DEVICE_ONLY typename vector<T>::reference
    vector<T>::operator[](const vector<T>::index_type n)
    {
        return _data[n];
    }
    
    
    template <typename T>
    inline STDGPU_DEVICE_ONLY typename vector<T>::const_reference
    vector<T>::operator[](const vector<T>::index_type n) const
    {
        return _data[n];
    }
    
  • Compatibility with Thrust?

    Compatibility with Thrust?

    Describe the bug Current stdgpu seems to be NOT compatible with NVidia Thurst?

    Steps to reproduce

    git clone https://github.com/stotko/stdgpu.git
    mkdir build
    cd build
    cmake ../
    

    Expected behavior Successfully built and run.

    Actual behavior No matter GCC or Clang, both failed.

    • GCC 11.2
    [  1%] Building CXX object src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o
    cd ....../stdgpu/build/src/stdgpu && /usr/bin/c++ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -Dstdgpu_EXPORTS -I....../stdgpu/src/stdgpu/.. -I....../stdgpu/build/src/stdgpu/include -isystem /usr/local/cuda/include -fPIC -Wall -pedantic -Wextra -Wshadow -Wsign-compare -Wconversion -Wfloat-equal -Wundef -Wdouble-promotion -MD -MT src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -MF CMakeFiles/stdgpu.dir/impl/iterator.cpp.o.d -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c ....../stdgpu/src/stdgpu/impl/iterator.cpp
    In file included from /usr/local/cuda/include/nv/detail/__target_macros:13,
                     from /usr/local/cuda/include/nv/target:195,
                     from /usr/local/cuda/include/cub/detail/device_synchronize.cuh:23,
                     from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:36,
                     from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:26,
                     from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42,
                     from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22,
                     from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69,
                     from /usr/local/cuda/include/thrust/detail/reference.h:28,
                     from ....../stdgpu/src/stdgpu/../stdgpu/iterator.h:30,
                     from ....../stdgpu/src/stdgpu/impl/iterator.cpp:16:
    /usr/local/cuda/include/cub/util_device.cuh: In function ‘cudaError_t cub::PtxVersionUncached(int&)’:
    /usr/local/cuda/include/cub/util_device.cuh:368:15: error: invalid conversion from ‘EmptyKernelPtr’ {aka ‘void (*)()’} to ‘const void*’ [-fpermissive]
      368 |           if (CubDebug(result = cudaFuncGetAttributes(&empty_kernel_attrs,
          |               ^~~~~~~~
          |               |
          |               EmptyKernelPtr {aka void (*)()}
    In file included from /usr/local/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h:38,
                     from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:19,
                     from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42,
                     from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22,
                     from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69,
                     from /usr/local/cuda/include/thrust/detail/reference.h:28,
                     from ....../stdgpu/src/stdgpu/../stdgpu/iterator.h:30,
                     from ....../stdgpu/src/stdgpu/impl/iterator.cpp:16:
    /usr/local/cuda/include/cuda_runtime_api.h:4337:125: note:   initializing argument 2 of ‘cudaError_t cudaFuncGetAttributes(cudaFuncAttributes*, const void*)’
     4337 | extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFuncGetAttributes(struct cudaFuncAttributes *attr, const void *func);
          |                                                                                                                 ~~~~~~~~~~~~^~~~
    make[2]: *** [src/stdgpu/CMakeFiles/stdgpu.dir/build.make:93: src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o] Error 1
    make[2]: Leaving directory '....../stdgpu/build'
    make[1]: *** [CMakeFiles/Makefile2:318: src/stdgpu/CMakeFiles/stdgpu.dir/all] Error 2
    make[1]: Leaving directory '....../stdgpu/build'
    make: *** [Makefile:149: all] Error 2
    
    • clang 14.0
    [  1%] Building CXX object src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o
    cd ....../build/src/stdgpu && /usr/bin/clang++ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -I....../src/stdgpu/.. -I....../build/src/stdgpu/include -isystem /usr/local/cuda/include -Wall -pedantic -Wextra -Wshadow -Wsign-compare -Wconversion -Wfloat-equal -Wundef -Wdouble-promotion -MD -MT src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -MF CMakeFiles/stdgpu.dir/impl/iterator.cpp.o.d -o CMakeFiles/stdgpu.dir/impl/iterator.cpp.o -c ....../src/stdgpu/impl/iterator.cpp
    In file included from ....../src/stdgpu/impl/iterator.cpp:16:
    In file included from ....../src/stdgpu/../stdgpu/iterator.h:30:
    In file included from /usr/local/cuda/include/thrust/detail/reference.h:28:
    In file included from /usr/local/cuda/include/thrust/system/detail/generic/memory.h:69:
    In file included from /usr/local/cuda/include/thrust/system/detail/generic/memory.inl:22:
    In file included from /usr/local/cuda/include/thrust/system/detail/adl/malloc_and_free.h:42:
    In file included from /usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h:26:
    In file included from /usr/local/cuda/include/thrust/system/cuda/detail/util.h:38:
    /usr/local/cuda/include/cub/util_device.cuh:368:33: error: no matching function for call to 'cudaFuncGetAttributes'
              if (CubDebug(result = cudaFuncGetAttributes(&empty_kernel_attrs,
                                    ^~~~~~~~~~~~~~~~~~~~~
    /usr/local/cuda/include/cub/util_debug.cuh:115:64: note: expanded from macro 'CubDebug'
        #define CubDebug(e) CUB_NS_QUALIFIER::Debug((cudaError_t) (e), __FILE__, __LINE__)
                                                                   ^
    /usr/local/cuda/include/nv/detail/__target_macros:455:78: note: expanded from macro 'NV_IF_TARGET'
    #  define NV_IF_TARGET(cond, t, ...)    _NV_BLOCK_EXPAND(_NV_TARGET_IF(cond, t, __VA_ARGS__))
                                                                                 ^
    /usr/local/cuda/include/nv/detail/__target_macros:419:74: note: expanded from macro '_NV_TARGET_IF'
    #    define _NV_TARGET_IF(cond, t, ...) _NV_IF( _NV_ARCH_COND_CAT(cond), t, __VA_ARGS__)
                                                                             ^
    note: (skipping 24 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
    /usr/local/cuda/include/nv/detail/__preprocessor:83:47: note: expanded from macro '_NV_STRIP_PAREN'
    #define _NV_STRIP_PAREN(...) _NV_STRIP_PAREN1(__VA_ARGS__)
                                                  ^~~~~~~~~~~
    /usr/local/cuda/include/nv/detail/__preprocessor:82:48: note: expanded from macro '_NV_STRIP_PAREN1'
    #define _NV_STRIP_PAREN1(...) _NV_STRIP_PAREN2 __VA_ARGS__
                                                   ^~~~~~~~~~~
    /usr/local/cuda/include/nv/detail/__preprocessor:81:31: note: expanded from macro '_NV_STRIP_PAREN2'
    #define _NV_STRIP_PAREN2(...) __VA_ARGS__
                                  ^~~~~~~~~~~
    /usr/local/cuda/include/cuda_runtime_api.h:4337:58: note: candidate function not viable: no known conversion from 'EmptyKernelPtr' (aka 'void (*)()') to 'const void *' for 2nd argument; take the address of the argument with &
    extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFuncGetAttributes(struct cudaFuncAttributes *attr, const void *func);
                                                             ^
    1 error generated.
    make[2]: *** [src/stdgpu/CMakeFiles/stdgpu.dir/build.make:93: src/stdgpu/CMakeFiles/stdgpu.dir/impl/iterator.cpp.o] Error 1
    make[2]: Leaving directory '....../build'
    make[1]: *** [CMakeFiles/Makefile2:403: src/stdgpu/CMakeFiles/stdgpu.dir/all] Error 2
    make[1]: Leaving directory '....../build'
    make: *** [Makefile:149: all] Error 2
    

    System (please complete the following information):

    • OS: Ubuntu 22.04
    • Compiler: gcc 11.2 and clang 14.0, both tried
    • Backend: CUDA 11.6
    • Library version: stdgpu 1.3.0 ??
  • build(deps): bump numpy from 1.19.4 to 1.22.0 in /benchmark/benchmark-1.6.1

    build(deps): bump numpy from 1.19.4 to 1.22.0 in /benchmark/benchmark-1.6.1

    Bumps numpy from 1.19.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • build(deps): bump numpy from 1.19.4 to 1.21.0 in /benchmark/benchmark-1.6.1

    build(deps): bump numpy from 1.19.4 to 1.21.0 in /benchmark/benchmark-1.6.1

    Bumps numpy from 1.19.4 to 1.21.0.

    Release notes

    Sourced from numpy's releases.

    v1.21.0

    NumPy 1.21.0 Release Notes

    The NumPy 1.21.0 release highlights are

    • continued SIMD work covering more functions and platforms,
    • initial work on the new dtype infrastructure and casting,
    • universal2 wheels for Python 3.8 and Python 3.9 on Mac,
    • improved documentation,
    • improved annotations,
    • new PCG64DXSM bitgenerator for random numbers.

    In addition there are the usual large number of bug fixes and other improvements.

    The Python versions supported for this release are 3.7-3.9. Official support for Python 3.10 will be added when it is released.

    :warning: Warning: there are unresolved problems compiling NumPy 1.21.0 with gcc-11.1 .

    • Optimization level -O3 results in many wrong warnings when running the tests.
    • On some hardware NumPy will hang in an infinite loop.

    New functions

    Add PCG64DXSM BitGenerator

    Uses of the PCG64 BitGenerator in a massively-parallel context have been shown to have statistical weaknesses that were not apparent at the first release in numpy 1.17. Most users will never observe this weakness and are safe to continue to use PCG64. We have introduced a new PCG64DXSM BitGenerator that will eventually become the new default BitGenerator implementation used by default_rng in future releases. PCG64DXSM solves the statistical weakness while preserving the performance and the features of PCG64.

    See upgrading-pcg64 for more details.

    (gh-18906)

    Expired deprecations

    • The shape argument numpy.unravel_index cannot be passed as dims keyword argument anymore. (Was deprecated in NumPy 1.16.)

    ... (truncated)

    Commits
    • b235f9e Merge pull request #19283 from charris/prepare-1.21.0-release
    • 34aebc2 MAINT: Update 1.21.0-notes.rst
    • 493b64b MAINT: Update 1.21.0-changelog.rst
    • 07d7e72 MAINT: Remove accidentally created directory.
    • 032fca5 Merge pull request #19280 from charris/backport-19277
    • 7d25b81 BUG: Fix refcount leak in ResultType
    • fa5754e BUG: Add missing DECREF in new path
    • 61127bb Merge pull request #19268 from charris/backport-19264
    • 143d45f Merge pull request #19269 from charris/backport-19228
    • d80e473 BUG: Removed typing for == and != in dtypes
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

  • Build suceeds but cannot include package and build

    Build suceeds but cannot include package and build

    Hi there,

    I have been able to build stdgpu under windows with msvc 2019 and cuda 10.2 with no problems. The example projects work fine. However, taking the installed library and using it in a test setup produces some errors (the same is true for the addsubdirectories route). My CmakeLists.txt is:

    cmake_minimum_required(VERSION 3.1)
    set (CMAKE_CXX_STANDARD 14)
    
    project(VoxelGrid LANGUAGES CXX CUDA)
    
    file(GLOB srcfiles 
    ${PROJECT_SOURCE_DIR}/src/*.h   
    ${PROJECT_SOURCE_DIR}/src/*.cpp
    )
    include_directories(${PROJECT_SOURCE_DIR}/src)
    
    set(stdgpu_DIR ${PROJECT_SOURCE_DIR}/3rdParty/stdgpu/lib/cmake/stdgpu)
    find_package(stdgpu 1.0.0 REQUIRED)
    
    add_executable(VoxelGridTest exe/main.cpp ${srcfiles})
    target_link_libraries(VoxelGridTest PUBLIC stdgpu::stdgpu)
    

    in main.cpp I copied the unordered_map example. The build fails with

     FAILED: CMakeFiles/VoxelGridTest.dir/exe/main.cpp.obj 
      C:\PROGRA~2\MICROS~2\2019\COMMUN~1\VC\Tools\MSVC\1427~1.291\bin\Hostx64\x64\cl.exe  /nologo /TP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -I..\..\src -I..\..\3rdParty\stdgpu\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include" /DWIN32 /D_WINDOWS /W3 /GR /EHsc /MD /Zi /O2 /Ob1 /DNDEBUG   -std:c++14 /showIncludes /FoCMakeFiles\VoxelGridTest.dir\exe\main.cpp.obj /FdCMakeFiles\VoxelGridTest.dir\ /FS -c ..\..\exe\main.cpp
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(140): error C2059: syntax error: 'sizeof'
      C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu/atomic.cuh(332): note: see reference to class template instantiation 'stdgpu::atomic<T>' being compiled
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(141): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(150): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(152): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(160): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(162): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(171): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(171): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(172): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(180): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(180): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(181): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(189): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(189): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(190): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(198): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(198): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(199): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(207): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(207): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(208): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(217): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(217): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(218): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(226): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(226): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(227): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(235): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(235): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(236): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(244): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(244): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(245): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(253): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(253): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(254): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(261): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(261): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(262): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(269): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(269): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(270): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(277): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(277): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(278): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(287): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(287): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(288): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(296): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(296): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(297): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(305): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(305): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(306): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(314): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(314): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(315): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(323): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(323): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(324): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(422): error C2059: syntax error: 'sizeof'
      C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu/atomic.cuh(615): note: see reference to class template instantiation 'stdgpu::atomic_ref<T>' being compiled
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(423): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(432): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(434): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(442): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(444): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(453): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(453): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(454): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(462): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(462): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(463): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(471): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(471): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(472): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(480): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(480): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(481): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(489): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(489): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(490): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(499): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(499): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(500): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(508): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(508): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(509): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(517): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(517): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(518): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(526): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(526): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(527): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(535): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(535): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(536): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(543): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(543): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(544): error C2238: unexpected token(s) preceding ';'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): error C2988: unrecognizable template declaration/definition
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): error C2059: syntax error: 'sizeof'
    C:\dev\VoxelGrid\3rdParty\stdgpu\include\stdgpu\atomic.cuh(551): fatal error C1003: error count exceeds 100; stopping compilation
    

    We had a similar issue under linux. Thanks in advance for the support!

  • bitset: Improve performance of count function

    bitset: Improve performance of count function

    The count() function of the bitset container used two distinct reductions for computing the number of set bits. The first one considers all bits in the first n - 1 block whereas the second one works on each of the remaining bits in the n-th block, i.e. the last block. Since the reduction is memory-bound anyways, the second reduction introduces a large overhead while the first one still has room for additional compute operations. Merge both reductions into a single one that computes a bit mask for each block on the fly before counting the bits per block. This leads to a decent performance improvement for count() and all related functions.

    Before:

    -------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations
    -------------------------------------------------------------------------
    stdgpu_bitset_count/1000            0.050 ms        0.051 ms        13822
    stdgpu_bitset_count/1000000         0.056 ms        0.056 ms        12476
    stdgpu_bitset_count/1000000000      0.733 ms        0.733 ms          956
    stdgpu_bitset_all/1000              0.050 ms        0.050 ms        13879
    stdgpu_bitset_all/1000000           0.056 ms        0.056 ms        12538
    stdgpu_bitset_all/1000000000        0.736 ms        0.734 ms          954
    stdgpu_bitset_any/1000              0.050 ms        0.050 ms        13841
    stdgpu_bitset_any/1000000           0.056 ms        0.056 ms        12165
    stdgpu_bitset_any/1000000000        0.734 ms        0.734 ms          954
    stdgpu_bitset_none/1000             0.050 ms        0.051 ms        13811
    stdgpu_bitset_none/1000000          0.056 ms        0.056 ms        12428
    stdgpu_bitset_none/1000000000       0.735 ms        0.734 ms          954
    

    After:

    -------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations
    -------------------------------------------------------------------------
    stdgpu_bitset_count/1000            0.026 ms        0.026 ms        26891
    stdgpu_bitset_count/1000000         0.031 ms        0.031 ms        22745
    stdgpu_bitset_count/1000000000      0.580 ms        0.580 ms         1208
    stdgpu_bitset_all/1000              0.026 ms        0.026 ms        26856
    stdgpu_bitset_all/1000000           0.031 ms        0.031 ms        22795
    stdgpu_bitset_all/1000000000        0.580 ms        0.580 ms         1207
    stdgpu_bitset_any/1000              0.026 ms        0.026 ms        26940
    stdgpu_bitset_any/1000000           0.031 ms        0.031 ms        22814
    stdgpu_bitset_any/1000000000        0.580 ms        0.580 ms         1207
    stdgpu_bitset_none/1000             0.026 ms        0.026 ms        26945
    stdgpu_bitset_none/1000000          0.032 ms        0.032 ms        22725
    stdgpu_bitset_none/1000000000       0.581 ms        0.580 ms         1185
    
  • mutex: Improve performance of valid function

    mutex: Improve performance of valid function

    The valid() function of mutex_array tests each bit individually and performs a reduction to check if all mutexes are unlocked. Use the count() function of bitset instead to simplify the code. This also significantly improves the performance for large container sizes due to the more efficient reduction of the bits in bitset.

    Before:

    ----------------------------------------------------------------------
    Benchmark                            Time             CPU   Iterations
    ----------------------------------------------------------------------
    stdgpu_mutex_valid/1000          0.027 ms        0.027 ms        25450
    stdgpu_mutex_valid/100000        0.051 ms        0.051 ms        12299
    stdgpu_mutex_valid/10000000      0.695 ms        0.694 ms          991
    

    After:

    ----------------------------------------------------------------------
    Benchmark                            Time             CPU   Iterations
    ----------------------------------------------------------------------
    stdgpu_mutex_valid/1000          0.049 ms        0.049 ms        13793
    stdgpu_mutex_valid/100000        0.050 ms        0.050 ms        13565
    stdgpu_mutex_valid/10000000      0.302 ms        0.301 ms         2316
    
  • General: Add benchmarks

    General: Add benchmarks

    Performance optimizations are crucial to maintain the usability and enable a broader adoption of the library. So far, there is no systematic mechanism to assess the benefits of any potential performance improvements, so most previous optimizations often were easy to spot bottleneck with obvious effects. Add a dependency to the benchmark library and introduce a set of performance tests for our containers.

  • Unordered_maps with complex containers

    Unordered_maps with complex containers

    Don't know if this feature already exists but do unordered_maps support structures other than <int,int> like <int,stdgpu::unordered_set>. Similarly can unordered_set support <pair<int,int>> using a hashing function from boost::hash<pair<int, int>> or vectors using your own hashing function for vectors.

    This library is excellent btw! Solves so many issues with support for STL like containers for GPU :)

  • Any chance to support spirv as a backend?

    Any chance to support spirv as a backend?

    Is your feature request related to a problem? Please describe. I'd like to use this in a hardware vendor agnostic way, and more specifically I'd like to use with webgpu. Any plans to support in the future? OpenCL subset can compile to spirv, so that could be another option?

  • cmake: Backend-specific targets

    cmake: Backend-specific targets

    The backend system is currently restricted to build and install the library only for a single backend.

    Current behavior:

    • Set STDGPU_BACKEND to either STDGPU_BACKEND_CUDA (default) or STDGPU_BACKEND_OPENMP to control which backend will be used.
    • Build target stdgpu::stdgpu for the particular choice of STDGPU_BACKEND. Other backends will not be considered at all.

    Proposed behavior:

    • Set STDGPU_ENABLE_<BACKEND> where <BACKEND> is one of CUDA, OPENMP.
    • Build backend targets stdgpu::<BACKEND> for each enabled backend using the backend-specific settings and dependency checks.
    • Define stdgpu::stdgpu as an alias target to stdgpu::<BACKEND> serving as a default which can be controlled via STDGPU_BACKEND to match current behavior.

    This will make the system more flexible and allow users to choose freely between all enabled backends in their projects rather than being globally restricted to a single choice. Note that linking to more than one backend at the same time will be considered undefined behavior/ODR violation.

    Furthermore, if only a single backend should be used at all times, this intend can also be expressed more clearly by linking to stdgpu::<BACKEND> rather than the configuration-dependent stdgpu::stdgpu target.

  • Header-only and GPU architecture independence

    Header-only and GPU architecture independence

    In contrast to boost, thrust and others, stdgpu is not a header-only library and, hence, requires shipping a compiled library. The following module currently require source file compilation:

    • ~~bitset: Contains host-only functions which also contain code executed on the device.~~
    • iterator: Only contains a wrapper function to hide the dependency to memory from the header.
    • limits: Contains the definition of static member variables.
    • memory: Both the general as well as the backend-specific parts handle the allocation and memcpy parts in the sources. ~~This includes some global variables that need to be converted to proper singletons.~~
    • ~~mutex: Contains host-only functions which also contain code executed on the device.~~

    Inlining bitset and mutex will make the library independent of the required GPU architecture, e.g. the compute capability set for CUDA. Even if we decide not to go for header-only, achieving architecture independence might be a good compromise.

  • Container: resize and copy support

    Container: resize and copy support

    Up to now, the container classes have a fixed capacity and are created using the non-standard createDeviceObject factory function. Furthermore, since ease of use in GPU kernels is considered a key feature, the copy constructors are currently restricted to perform only shallow copies rather than deep copies. This behavior makes the container still feel non-standard and unintuitive to some degree, especially for new users.

    In order to fix both issues, the design of the copy operations needs to be revised to match the STL more closely. At first glance, this seems to be an easy task:

    1. Define the copy constructors and copy assignment operators to perform deep copies.
    2. Provide a reference_wrapper<T> class which can be used on the GPU.

    However, objects (or at least their states) need to be copied from CPU to GPU memory in order to allow for the proper execution of an operation. Since we want to make the containers work for as many backends and use cases as possible, we cannot make any assumptions how this transfer will be performed or whether this really requires calling the copy constructor or not. reference_wrapper<T> does not solve this problem since it points to the original object which lives in CPU memory.

    Therefore, the current proposal would be:

    1. Provide a shallow_copy_wrapper<T> class (suggestions for a better name are welcome) which wraps the object state. This class is copyable such that the object state can be easily passed to the GPU similar to reference_wrapper<T>. However, if the state of the original object is changed, e.g. due to a resize operation, this change will not be visible or propagated to the wrapper invalidating it. Thus, we trade object consistency with GPU support.
    2. Define the copy constructors and copy assignment operators to perform deep copies, but restrict them to be callable from the host only.
    3. Clearly document that shallow_copy_wrapper<T> is only intended to allow crossing memory boundaries and to enable container usage on the GPU. For CPU usage, std::reference_wrapper<T> should be used instead if required.
    4. Deprecate/remove the createDeviceObject and destroyDeviceObject factory functions.

    This change will break existing usage within kernels and thrust algorithms (functors). A reasonable transition strategy would be to introduce shallow_copy_wrapper<T> in the last minor release of version 1 (which might be 1.3.0) and provide an option to disable the copy constructor and copy assignment operators. This way, users could start porting to the new copy model and will only need to move away from the factory functions in version 2.0.0.

  • bitset: Class name and API

    bitset: Class name and API

    Our bitset class is a GPU version of std::bitset which, however, is designed to cover more use cases. In particular, its interface and implementation (run-time fixed-sized) is somewhere between std::bitset (compile-time fixed-sized) and boost::dynamic_bitset (run-time dynamic-sized). This may lead to confusion if users expect the exact same API as std::bitset.

    There are several ways to address this issue:

    • Rename it to dynamic_bitset and extend its API to match (as close as possible) boost.
    • Rename it to vector<bool> and change/extend its API to match (as close as possible) vector.
    • Keep the name and extend its API towards boost's version, i.e. considering the non-dynamic-sized functions.

    At the moment, the last option seems to be a good compromise. However, it does not fully solve the problem regarding potential user confusion. Since any of the options will break the API, this change is considered for stdgpu 2.0.0

Concurrent data structures in C++
Concurrent data structures in C++

Junction is a library of concurrent data structures in C++. It contains several hash map implementations: junction::ConcurrentMap_Crude junction::Conc

Sep 21, 2022
A C++ library of Concurrent Data Structures

CDS C++ library The Concurrent Data Structures (CDS) library is a collection of concurrent containers that don't require external (manual) synchroniza

Sep 24, 2022
A C++ library providing various concurrent data structures and reclamation schemes.

xenium xenium is a header-only library that provides a collection of concurrent data structures and memory reclamation algorithms. The data structures

Sep 20, 2022
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Aug 9, 2022
CTPL - Modern and efficient C++ Thread Pool Library

CTPL Modern and efficient C++ Thread Pool Library A thread pool is a programming pattern for parallel execution of jobs, http://en.wikipedia.org/wiki/

Sep 19, 2022
ArrayFire: a general purpose GPU library.
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

Sep 22, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Sep 19, 2022
OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

Jul 27, 2022
Patterns and behaviors for GPU computing

moderngpu 2.0 (c) 2016 Sean Baxter You can drop me a line here Full documentation with github wiki under heavy construction. Latest update: 2.12 2016

Sep 20, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

Sep 14, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Sep 19, 2022
ParallelComputingPlayground - Shows different programming techniques for parallel computing on CPU and GPU

ParallelComputingPlayground Shows different programming techniques for parallel computing on CPU and GPU. Purpose The idea here is to compute a Mandel

May 16, 2020
A easy to use multithreading thread pool library for C. It is a handy stream like job scheduler with an automatic garbage collector. This is a multithreaded job scheduler for non I/O bound computation.

A easy to use multithreading thread pool library for C. It is a handy stream-like job scheduler with an automatic garbage collector for non I/O bound computation.

Jun 4, 2022
Smart queue that executes tasks in threadpool-like manner

execq execq is kind of task-based approach of processing data using threadpool idea with extended features. It supports different task sources and mai

Aug 11, 2022
Small library helping you with basic stuff like getting metrics out of your code, thread naming, etc.

CommonPP commonpp is a multi purpose library easing very few operations like: Getting metrics out of your program (counter, gauge, statistical descrip

Aug 5, 2022
capcom-like executor for any physmem driver
capcom-like executor for any physmem driver

dolboeb-executor Arbitrary code execution inside of vulnerable driver How's this works? Dolboeb-executor will replace a function inside vulnerable dri

Sep 6, 2022
The RaftLib C++ library, streaming/dataflow concurrency via C++ iostream-like operators

RaftLib is a C++ Library for enabling stream/data-flow parallel computation. Using simple right shift operators (just like the C++ streams that you wo

Sep 21, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Jul 24, 2022
✔️The smallest header-only GUI library(4 KLOC) for all platforms
✔️The smallest header-only GUI library(4 KLOC) for all platforms

Welcome to GUI-lite The smallest header-only GUI library (4 KLOC) for all platforms. 中文 Lightweight ✂️ Small: 4,000+ lines of C++ code, zero dependenc

Sep 21, 2022
Step is a C++17, header-only library of STL-like algorithms and data structures
Step is a C++17, header-only library of STL-like algorithms and data structures

Step is a C++17, header-only library of STL-like algorithms and data structures. Installation git clone --depth 1 https://github.com/storm-ptr/step.gi

Sep 1, 2022