Cooperative primitives for CUDA C++.


About CUB

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:

Orientation of collective primitives within the CUDA software stack

CUB is included in the NVIDIA HPC SDK and the CUDA Toolkit.

We recommend the CUB Project Website for further information and examples.



A Simple Example

#include <cub/cub.cuh>

// Block-sorting CUDA kernel
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
     using namespace cub;

     // Specialize BlockRadixSort, BlockLoad, and BlockStore for 128 threads
     // owning 16 integer items each
     typedef BlockRadixSort<int, 128, 16>                     BlockRadixSort;
     typedef BlockLoad<int, 128, 16, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
     typedef BlockStore<int, 128, 16, BLOCK_STORE_TRANSPOSE> BlockStore;

     // Allocate shared memory
     __shared__ union {
         typename BlockRadixSort::TempStorage  sort;
         typename BlockLoad::TempStorage       load;
         typename BlockStore::TempStorage      store;
     } temp_storage;

     int block_offset = blockIdx.x * (128 * 16);	  // OffsetT for this block's ment

     // Obtain a segment of 2048 consecutive keys that are blocked across threads
     int thread_keys[16];
     BlockLoad(temp_storage.load).Load(d_in + block_offset, thread_keys);
     __syncthreads();

     // Collectively sort the keys
     BlockRadixSort(temp_storage.sort).Sort(thread_keys);
     __syncthreads();

     // Store the sorted segment
     BlockStore(temp_storage.store).Store(d_out + block_offset, thread_keys);
}

Each thread block uses cub::BlockRadixSort to collectively sort its own input segment. The class is specialized by the data type being sorted, by the number of threads per block, by the number of keys per thread, and implicitly by the targeted compilation architecture.

The cub::BlockLoad and cub::BlockStore classes are similarly specialized. Furthermore, to provide coalesced accesses to device memory, these primitives are configured to access memory using a striped access pattern (where consecutive threads simultaneously access consecutive items) and then transpose the keys into a blocked arrangement of elements across threads.

Once specialized, these classes expose opaque TempStorage member types. The thread block uses these storage types to statically allocate the union of shared memory needed by the thread block. (Alternatively these storage types could be aliased to global memory allocations).



Supported Compilers

CUB is regularly tested using the specified versions of the following compilers. Unsupported versions may emit deprecation warnings, which can be silenced by defining CUB_IGNORE_DEPRECATED_COMPILER during compilation.

  • NVCC 11.0+
  • NVC++ 20.9+
  • GCC 5+
  • Clang 7+
  • MSVC 2019+ (19.20/16.0/14.20)



Releases

CUB is distributed with the NVIDIA HPC SDK and the CUDA Toolkit in addition to GitHub.

See the changelog for details about specific releases.

CUB Release Included In
1.14.0 NVIDIA HPC SDK 21.9
1.13.1 CUDA Toolkit 11.5
1.13.0 NVIDIA HPC SDK 21.7
1.12.1 CUDA Toolkit 11.4
1.12.0 NVIDIA HPC SDK 21.3
1.11.0 CUDA Toolkit 11.3
1.10.0 NVIDIA HPC SDK 20.9 & CUDA Toolkit 11.2
1.9.10-1 NVIDIA HPC SDK 20.7 & CUDA Toolkit 11.1
1.9.10 NVIDIA HPC SDK 20.5
1.9.9 CUDA Toolkit 11.0
1.9.8-1 NVIDIA HPC SDK 20.3
1.9.8 CUDA Toolkit 11.0 Early Access
1.9.8 CUDA 11.0 Early Access
1.8.0
1.7.5 Thrust 1.9.2
1.7.4 Thrust 1.9.1-2
1.7.3
1.7.2
1.7.1
1.7.0 Thrust 1.9.0-5
1.6.4
1.6.3
1.6.2 (previously 1.5.5)
1.6.1 (previously 1.5.4)
1.6.0 (previously 1.5.3)
1.5.2
1.5.1
1.5.0
1.4.1
1.4.0
1.3.2
1.3.1
1.3.0
1.2.3
1.2.2
1.2.0
1.1.1
1.0.2
1.0.1
0.9.4
0.9.2
0.9.1
0.9.0



Development Process

CUB and Thrust depend on each other. It is recommended to clone Thrust and build CUB as a component of Thrust.

CUB uses the CMake build system to build unit tests, examples, and header tests. To build CUB as a developer, the following recipe should be followed:

# Clone Thrust and CUB from Github. CUB is located in Thrust's
# `dependencies/cub` submodule.
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..   # Command line interface.
ccmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..  # ncurses GUI (Linux only)
cmake-gui  # Graphical UI, set source/build directories and options in the app

# Build:
cmake --build . -j 
   
       # invokes make (or ninja, etc)

# Run tests and examples:
ctest

   

By default, the C++14 standard is targeted, but this can be changed in CMake. More information on configuring your CUB build and creating a pull request is found in CONTRIBUTING.md.



Open Source License

CUB is available under the "New BSD" open-source license:

Copyright (c) 2010-2011, Duane Merrill.  All rights reserved.
Copyright (c) 2011-2018, NVIDIA CORPORATION.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
   *  Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
   *  Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
   *  Neither the name of the NVIDIA CORPORATION nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Owner
NVIDIA Corporation
NVIDIA Corporation
Comments
  • Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    Add BLOCK_LOAD_STRIPED and BLOCK_STORE_STRIPED

    This PR adds the following to BlockLoadAlgorithm

    1. BLOCK_LOAD_STRIPED It's basically BLOCK_LOAD_TRANSPOSE without the BlockExchange

    This PR adds the following to BlockStoreAlgorithm

    1. BLOCK_STORE_STRIPED It's basically BLOCK_STORE_TRANSPOSE without the BlockExchange
  • Support future value for initial value for device scan

    Support future value for initial value for device scan

    Prototyping the support of device pointer for cub device scan.

    Example Usage:

    cub::DeviceScan::ExclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, cub::FutureValue<float>(ptr), num_items);
    

    Tests pass for this PR, but there is no documentation yet. I will add doc if @allisonvacanti thinks this is a good idea.

    Please review https://github.com/NVIDIA/thrust/pull/1519 for thrust change.

  • Static links to tag releases changed

    Static links to tag releases changed

    Hi @brycelelbach, something changed when you revised the release logs yesterday. We used to point to https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz but now it changes to https://github.com/NVlabs/cub/archive/1.8.0.tar.gz (the "v" is missing). I think it is a breaking change. Any chance it could be reverted?

  • Faster Least Significant Digit Radix Sort Implementation

    Faster Least Significant Digit Radix Sort Implementation

    • radix sort with decoupled look-back, 8 bits per pass and other optimizations
    • pull request to the previous CUB repository: https://github.com/brycelelbach/cub_historical_2019_2020/pull/26
  • Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    Document that cub::DeviceRadixSort and cub::BlockRadixSort are stable

    There is nothing in the documentation https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html saying whether it is stable sort or not. It seems to be stable. If so, it would be great if this fact is mentioned in the documentation.

  • cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    cub::DeviceReduce::ReduceByKey() results are non-deterministic for floats

    @allisonvacanti @senior-zero

    cub::DeviceReduce::ReduceByKey web page describes "run-to-run" determinism for addition of floating point types, but the result looks wrong to me. Is it an expected behavior or a bug?

    From my limited testing somehow I got run-to-run result from the below code with CUDA 11.6 SDK. BTW, you need to run the program multiple times and will occasionally see the error, it doesn't work like the documentation mentioned which provides "run-to-run" determinism.

    Note: this issue is mainly for cub, there is a similar issue for thrust (https://github.com/NVIDIA/thrust/issues/1621)

    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    
    #include <thrust/copy.h>
    #include <thrust/fill.h>
    #include <thrust/sequence.h>
    #include <thrust/reduce.h>
    #include <iostream>
    
    #include <cub/cub.cuh>   // or equivalently <cub/device/device_reduce.cuh>
    
    
    int main() {
       auto const numElements = 250000;
       thrust::device_vector<double> data(numElements, 0.1);
       thrust::device_vector<double> keys(numElements, 1);
    
       thrust::device_vector<double> keys_out1(numElements);
       thrust::device_vector<double> keys_out2(numElements);
    
       thrust::device_vector<double> out1(numElements);
       thrust::device_vector<double> out2(numElements);
    
       thrust::host_vector<double> hostOut1(numElements);
       thrust::host_vector<double> hostOut2(numElements);
    
       thrust::device_vector<int> num_runs_out(1);
    
       // first run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out1.begin(),
             data.begin(), out1.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out1 to the host
          thrust::copy(out1.begin(), out1.begin() + num_runs_out[0], hostOut1.begin());
       }
    
       // second run
       {
          size_t temp_storage_bytes = 0;
          cub::DeviceReduce::ReduceByKey(
             nullptr, temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
          cub::DeviceReduce::ReduceByKey(
             d_temp_storage.data().get(), temp_storage_bytes,
             keys.begin(), keys_out2.begin(),
             data.begin(), out2.begin(),
             num_runs_out.begin(),
             thrust::plus<double>(), numElements);
          // copy out2 to the host
          thrust::copy(out2.begin(), out2.begin() + num_runs_out[0], hostOut2.begin());
       }
    
       // Check the outputs are exactly the same
       for (int i = 0; i < num_runs_out[0]; i++) {
          if (hostOut1[i] != hostOut2[i]) {
             std::cout << "Element " << i << " is not equal" << std::endl;
          }
       }
    
       return 0;
    }
    
  • New segmented sort algorithm

    New segmented sort algorithm

    This PR includes a new segmented sort facility. Few approaches to this problem exist.

    Embed segment number into keys

    This approach provides an elegant solution to the load-balancing issue but can lead to slowdowns. It also can't be applicable if the number of bites representing segments number exceeds a maximal number of bytes used by keys.

    Modified merge sort approach

    This idea is implemented in modern GPU. I've used this approach as a reference for comparison with the new segmented sort algorithm. As I show below, this approach can be outperformed in most cases.

    Kernel specialisation

    The idea behind this approach is to partition input segments into size groups. Specialised kernels can further process each size group. The LRB approach initially discussed in the issue falls into this category. It also represents the approach that the new segmented sort algorithm relies on.

    I'm going to briefly describe the genesis of the new segmented sort algorithm to justify some design decisions.

    To minimise the number of kernel specialisations, I've benchmarked different approaches to small (under a few hundred items) segment sorting. I've benchmarked single-thread even-odd sorting, bitonic warp sorting and newly added warp merge sort. The warp-scope merge-sort approach demonstrated some advantages: it can sort bigger segments and outperforms other methods (in the majority of cases).

    Warp-scope merge sort is included in this PR as a separate facility. It's possible to partition architectural warp into multiple virtual ones to sort multiple segments simultaneously. The warp-scope merge sort duplicated a significant part of the previously introduced block-scope merge sort, so I extracted the merge-sort strategy into a separate facility. Both warp and block sort share this strategy.

    Here's the speedup of warp-scope merge sort over warp-bitonic sort: image

    And the speedup of warp-scope merge sort over single-thread odd-even sort: image

    In the figures above I vary segment sizes and segments number.

    To further increase the performance of warp-scope merge sort I needed to load and store data efficiently. I needed warp-scope load, store and exchange facilities. These facilities are also provided in this PR.

    Using a proper sorting algorithm was not enough. Initially, I've assigned a CUDA thread block to a segment. Although this approach demonstrated speedup over the existing one, it led to inefficient resource utilisation because most threads were idle. Nonetheless, a kernel like this is used as a fallback solution when there are not enough segments. If idle threads don't block other CTAs from execution, there's no reason to spend time on segments partitioning. The fallback kernel helped to eliminate cases when the partitioning stage led to the overall slowdown of the new algorithm.

    image

    Initially, I implemented a single kernel for all size groups. Depending on the CTA number, I allocated a different number of threads per segment. That is, if the segment size exceeded a few thousand items, I've used slow block-scope radix sort. If the data was about a thousand items and fit into shared memory, I've used in-shared-memory block-scope radix sort. In all these cases, the whole CTA was assigned to a single segment. If the CTA number exceeded the number of large segments, I've partitioned CTA into multiple warp groups, each processing a separate segment. It happened that the large-segment branch limited the occupancy of small-segment one. So I separated this kernel into two. One kernel processes large segments and contains two branches - in-shared-memory sort and slow block-scope sort. Another kernel processes medium and small segments.

    To overlap small/large segments processing, I've used concurrent kernels. This PR contains a single-stream implementation, though. The multi-stream API is out of the scope of this PR and might be introduced later.

    Segments partitioning

    The LRB approach discussed in the initial issue balances segments in a non-deterministic way. In some cases, it led to slowdowns because consecutive sub-warps might not end up processing consecutive segments. Here's the speedup of the LRB approach compared to the partitioning based approach.

    image

    I've tried applying LRB only to the large segments group. This approach also leads to controversial performance results. In rare cases, when an extremely large segment is located in the tail of the large segments group, LRB leads to performance improvements. Otherwise, there are enough segments to overlap its processing, and the LRB stage leads to slowdowns. Therefore, we decided to opt-in LRB. The API with pre-partitioned segments is going to be implemented later. Here's the speedup of the version where I apply LRB to the large segment group.

    image

    Instead of LRB, I've implemented a three-way partition facility, which is included in this PR. It leads to deterministic partitioning of segments and might be used outside of the segmented sort. The three-way partitioning stage is faster than the LRB stage.

    Temporary storage layout

    The temporary storage layout of the new segmented sort is quite complex. To simplify the code and make it safer, I've introduced temporary storage layout wrappers, which can be found in this PR.

    Performance

    To benchmark the new algorithm on real-world data, I've converted sparse matrices collection to segment sizes. Segments number is equal to the number of rows in a sparse matrix, while segment size is equal to the number of non-zero values in this row. Here's the speedup of the new algorithm compared to the cub::DeviceSegmentedRadixSort on A100 GPU while sorting pairs of std::uint32_t and std::uint64_t.

    image

    | speedup | new algorithm | modernGPU | | ----------- | ----------- | ----------- | | min | 1.11 | 0.16 | | median | 61.24 | 7.93 | | max | 5314.44 | 2219.73 |

    The speedup depends on the segments number: image

  • Bug in ExclusiveSum

    Bug in ExclusiveSum

    The following code prints

    0 100 -128 0
    

    NOTE: It overflows!!!!

    nvcc --version
    

    prints

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Sun_Mar_21_19:15:46_PDT_2021
    Cuda compilation tools, release 11.3, V11.3.58
    Build cuda_11.3.r11.3/compiler.29745058_0
    
    #include <vector>
    
    #include "cub/cub.cuh"
    
    int main() {
      std::vector<int8_t> v = {100, 28, 1};
    
      int8_t *src;
      // ignore error check
      cudaMalloc((void **)&src, 3);
      cudaMemcpy(src, v.data(), 3, cudaMemcpyHostToDevice);
    
      int32_t *dst;
      cudaMalloc((void **)&dst, 4 * sizeof(int32_t));
    
      void *d_temp_storage = nullptr;
      size_t temp_storage_bytes = 0;
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
      cudaMalloc(&d_temp_storage, temp_storage_bytes);
      cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, src, dst,
                                    3);
    
      int32_t h[4];
      cudaMemcpy(h, dst, 4 * sizeof(int32_t), cudaMemcpyDeviceToHost);
    
      for (int32_t i = 0; i != 4; ++i) {
        std::cout << h[i] << " ";
      }
    
      std::cout << "\n";
      // ignore memory leak
      return 0;
    }
    
  • 64-bit Offsets in DeviceRadixSort

    64-bit Offsets in DeviceRadixSort

    • 64-bit OffsetT is supported for onesweep sorting
    • for decoupled look-back, the partition kernel is broken into smaller parts (as before), and a separate 32-bit type is used there
    • for histograms, 32-bit counters are used in shared memory and OffsetT-sized counters in global memory
  • Faster segmented sorting (and segmented problems in general)

    Faster segmented sorting (and segmented problems in general)

    Segmented problems can suffer from workload imbalance if the distribution of the segment sizes vary. The workload imbalance becomes even more challenging when the number of segments is very large.

    The attached zip file has an efficient segmented sort that is based on NVIDIA's CUB library. Most of the sorting is done using CUB's sorting functionality. The key thing that this segmented sorting algorithm offers is simple and efficient load-balancing. lrb_sort.cuh.zip

    The solution used for the segmented sort is applicable to other segmented problems.

    Details of the algorithm (and performance) can be found in the following paper.

  • Issue using cub reduce on more than elements than fit into a 4 byte integer

    Issue using cub reduce on more than elements than fit into a 4 byte integer

    Reduce (void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, int num_items, ReductionOpT reduction_op, T init, cudaStream_t stream=0, bool debug_synchronous=false)

    my issue seems to be that num_items is of type int so when I try to reduce more elements than fit into a 4 byte integer it overflows and the code obviously doesn't work properly. Given that GPUs are both growing in RAM size and that we can now oversubscribe by using cudaSharedMalloc are there any plans to change that number to be able to receive type size_t?

  • Better error message for no GPU or incompatible GPU

    Better error message for no GPU or incompatible GPU

    To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.

    This fix is for addressing the issue#1785 reported on Thrust https://github.com/NVIDIA/thrust/issues/1785

  • Merge sort key type selection

    Merge sort key type selection

    Before porting to CUB, Thrust implementation of merge sort didn't use to have *copy version. When introducing Copy overload, I followed the CUB generic scheme of selecting output iterator value type. After that, we changed type selection for most algorithms so it no longer follows these scheme. Merge sort implementation, however, wasn't changed. This leads to compilation failures in some cases. @harrism provided the following reproducer:

    #include <thrust/device_vector.h>
    #include <thrust/sequence.h>
    
    #include <thrust/iterator/transform_output_iterator.h>
    
    #include <cub/device/device_merge_sort.cuh>
    
    template <typename Tuple>
    struct trajectory_comparator {
      __device__ bool operator()(Tuple const& lhs, Tuple const& rhs)
      {
        auto lhs_id = thrust::get<0>(lhs);
        auto rhs_id = thrust::get<0>(rhs);
        auto lhs_ts = thrust::get<1>(lhs);
        auto rhs_ts = thrust::get<1>(rhs);
        return (lhs_id < rhs_id) || ((lhs_id == rhs_id) && (lhs_ts < rhs_ts));
      };
    };
    
    
    struct foo {
        float x;
        float y;
    };
    
    struct bar {
        float x;
        float y;
    };
    
    struct foo_to_bar {
        bar operator()(foo const& f) { return bar{f.x, f.y}; }
    };
    
    struct double_foo {
        foo operator()(foo const& f) { return foo{f.x * 2.0f, f.y * 2.0f}; }
    };
    
    
    int main(void)
    {
      thrust::device_vector<int> keys(1000);
      thrust::sequence(keys.begin(), keys.end());
    
      thrust::device_vector<foo> values(1000);
    
      thrust::device_vector<int> keys_out(1000, 0);
      
      #if 0 // transform iterator input and output types the same
      thrust::device_vector<foo> foo_values_out(1000);
      auto values_transformed_out = thrust::make_transform_output_iterator(foo_values_out.begin(), double_foo{});
      #else // transform iterator input and output types different
      thrust::device_vector<bar> bar_values_out(1000);
      auto values_transformed_out = thrust::make_transform_output_iterator(bar_values_out.begin(), foo_to_bar{});
      #endif
    
      std::size_t temp_storage_bytes = 0;
      cub::DeviceMergeSort::SortPairsCopy(nullptr,
                                          temp_storage_bytes,
                                          keys.begin(),
                                          values.begin(),
                                          keys_out.begin(),
                                          values_transformed_out,
                                          1000,
                                          thrust::less<int>{});
    
      void* temp_storage = nullptr;
      cudaMalloc(&temp_storage, temp_storage_bytes);
    
      cub::DeviceMergeSort::SortPairsCopy(temp_storage,
                                          temp_storage_bytes,
                                          keys.begin(),
                                          values.begin(),
                                          keys_out.begin(),
                                          values_transformed_out,
                                          1000,
                                          thrust::less<int>{});
    
      return 0;
    }
    

    The issue is related to the fact that we use output iterator value type to instantiate block load facility and then provide input iterator values that are not accepted by the block load member functions. We have to investigate if changing key/value type selection to be based on the input iterators is possible.

  • Wrap launch bounds

    Wrap launch bounds

    This PR addresses the following issue by replacing __launch_bounds__ usages with CUB_DETAIL_LAUNCH_BOUNDS. CUB_DETAIL_LAUNCH_BOUNDS leads to __launch_bounds__ usage only when RDC is not specified. Builds without RDC are not affected by this PR. For builds with RDC, the max performance differences are:

    | facility | type | diff | | ---------------------------------------------------- | -------- | ---- | | cub::DeviceSelect::If (complex predicate) | All | 0% | | cub::DeviceSelect::If | U32 | -9% | | cub::DeviceSelect::If | U64 | 0% | | cub::DeviceSegmentedReduce::Sum | U8 | 1% | | cub::DeviceSegmentedReduce::Sum | U64 | -10% | | cub::DeviceSegmentedRadixSort::SortPairs | U{8,16} | 4% | | cub::DeviceSegmentedRadixSort::SortPairs | U{32,64} | -32% | | cub::DeviceSegmentedRadixSort::SortKeys | U{8,16} | 8% | | cub::DeviceSegmentedRadixSort::SortKeys | U{32,64} | -25% | | cub::DeviceScan::InclusiveSum | All | 0% | | cub::DeviceScan::InclusiveSum - complex op | F32 | -7% | | cub::DeviceScan::ExclusiveSum | All | 0% | | cub::DeviceReduce::Reduce - custom op | All | -8% | | cub::DeviceReduce::Reduce | U8 | 20% | | cub::DeviceReduce::Reduce | U32 | 8% | | cub::DeviceReduce::Reduce | F64 | -4% | | cub::DevicePartition::If | All | -4% | | cub::DeviceHistogram::HistogramRange - A lot of bins | U64 | 30% | | cub::DeviceHistogram::HistogramRange | All | 0% | | cub::DeviceHistogram::HistogramEven - A lot of bins | U32 | -20% | | cub::DeviceHistogram::HistogramEven | All | 0% | | cub::DeviceAdjacentDifference | All | 0% |

    Negative diff means speedup of the version without __launch_bounds__. Since the results are quite controversial, I wouldn't like to advertise the macro as our API. If absolutely needed, one might define:

    #define CUB_DETAIL_LAUNCH_BOUNDS(...) \
      __launch_bounds__(__VA_ARGS__)
    
    #include <cub/cub.cuh>
    

    But for now it's an implementation detail that fixes compilation with RDC in some corner cases. Going forward, we might consider having tuning API that would control __launch_bounds__ specification as well as pragma unroll usage. The default tuning would be a function of the input types.

  • New indirection level for launch bounds

    New indirection level for launch bounds

    Specifying __launch_bounds__ in the presence of RDC has proven to be troublesome and unreliable. We have to abstract it out so that launch bounds are not specified when RDC is enabled.

  • Fix `BlockRadixRankMatchEarlyCounts` or constrain it

    Fix `BlockRadixRankMatchEarlyCounts` or constrain it

    Currently, BlockRadixRankMatchEarlyCounts doesn't work in some specific cases (1 << RADIX_BITS) % WARP_THREADS != 0. This use case should be addressed or the structure has to be complemented with a static assert that validates template arguments. To make the structure conforming, we might as well provide BlockDimY and BlockDimZ template parameters.

  • Fix or hide `BlockRadixRank` and friends

    Fix or hide `BlockRadixRank` and friends

    Currently, we have a set of block radix rank facilities:

    • BlockRadixRank
    • BlockRadixRankMatch
    • BlockRadixRankMatchEarlyCounts

    There's also a enum BlockScanAlgorithm that describes the differences between these algorithms. Unlike the rest of CUB facilities, BlockRadixRank* don't follow the common pattern of accepting BlockScanAlgorithm as a template parameter. Ideally, there'll be only BlockRadixRank which would take BlockScanAlgorithm as a template parameter to specialize for a particular case. Unfortunately, the implementations have different set of template parameters and some work has to be done before this is possible. We have to options:

    1. Break API and make BlockRadixRank a high-level entry point that accepts BlockScanAlgorithm.
    2. Hide all BlockRadixRank* algorithms as implementation details for BlockRadixSort.

    Since there are people using BlockRadixRank separately, the former is preferable.

An unified library for fitting primitives from 3D point cloud data with both C++&Python API.
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Jun 30, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

Sep 21, 2022
A GPU (CUDA) based Artificial Neural Network library
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Sep 7, 2022
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex
GPU Cloth TOP in TouchDesigner using CUDA-enabled NVIDIA Flex

This project demonstrates how to use NVIDIA FleX for GPU cloth simulation in a TouchDesigner Custom Operator. It also shows how to render dynamic meshes from the texture data using custom PBR GLSL material shaders inside TouchDesigner.

Jul 27, 2022
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

Jun 15, 2022
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation
A CUDA implementation of Lattice Boltzmann for fluid dynamics simulation

Lattice Boltzmann simulation I am conscious of being only an individual struggling weakly against the stream of time. But it still remains in my power

Mar 1, 2022
Tiny CUDA Neural Networks
 Tiny CUDA Neural Networks

This is a small, self-contained framework for training and querying neural networks. Most notably, it contains a lightning fast "fully fused" multi-layer perceptron as well as support for various advanced input encodings, losses, and optimizers.

Sep 15, 2022
BM3D denoising filter for VapourSynth, implemented in CUDA

VapourSynth-BM3DCUDA Copyright© 2021 WolframRhodium BM3D denoising filter for VapourSynth, implemented in CUDA Description Please check VapourSynth-BM

Sep 17, 2022
HIPIFY: Convert CUDA to Portable C++ Code

Tools to translate CUDA source code into portable HIP C++ automatically

Sep 19, 2022
A easy-to-use image processing library accelerated with CUDA on GPU.

gpucv Have you used OpenCV on your CPU, and wanted to run it on GPU. Did you try installing OpenCV and get frustrated with its installation. Fret not

Aug 14, 2021
CUDA-accelerated Apriltag detection and pose estimation.
CUDA-accelerated Apriltag detection and pose estimation.

Isaac ROS Apriltag Overview This ROS2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publishes their poses,

Sep 6, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

Sep 17, 2022
CUDA Custom Buffers and example blocks
CUDA Custom Buffers and example blocks

gr-cuda CUDA Support for GNU Radio using the custom buffer changes introduced in GR 3.10. Custom buffers for CUDA-enabled hardware are provided that c

Aug 17, 2022
Raytracer implemented with CPU and GPU using CUDA
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Nov 29, 2021
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Sep 7, 2022
FoxRaycaster, optimized, fixed and with a CUDA option
FoxRaycaster, optimized, fixed and with a CUDA option

Like FoxRaycaster(link) but with a nicer GUI, bug fixes, more optimized and with CUDA. Used in project: Code from FoxRaycaster, which was based on thi

Oct 21, 2021
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

Sep 18, 2022
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

May 7, 2022
C++20 coroutines-based cooperative multitasking library

?? Coop Coop is a C++20 coroutines-based library to support cooperative multitasking in the context of a multithreaded application. The syntax will be

Aug 21, 2022
cooperative testcases for ueb

tuwien-compilerbau-test-21 Kooperative Sammlung von Testfällen für die LVA "Übersetzerbau" der TU Wien im SS21. Bitte beachtet, dass dies eine koopera

Jun 12, 2021