libcu++: The C++ Standard Library for Your Entire System

libcu++: The C++ Standard Library for Your Entire System

Examples Godbolt Documentation

libcu++, the NVIDIA C++ Standard Library, is the C++ Standard Library for your entire system. It provides a heterogeneous implementation of the C++ Standard Library that can be used in and between CPU and GPU code.

If you know how to use your C++ Standard Library, then you know how to use libcu++. All you have to do is add cuda/std/ to the start of your Standard Library includes and cuda:: before any uses of std:::

#include <cuda/std/atomic>
cuda::std::atomic<int> x;

The NVIDIA C++ Standard Library is an open source project; it is available on GitHub and included in the NVIDIA HPC SDK and CUDA Toolkit. If you have one of those SDKs installed, no additional installation or compiler flags are needed to use libcu++.

cuda:: and cuda::std::

When used with NVCC, NVIDIA C++ Standard Library facilities live in their own header hierarchy and namespace with the same structure as, but distinct from, the host compiler's Standard Library:

  • std::/<*>: When using NVCC, this is your host compiler's Standard Library that works in __host__ code only, although you can use the --expt-relaxed-constexpr flag to use any constexpr functions in __device__ code. With NVCC, libcu++ does not replace or interfere with host compiler's Standard Library.
  • cuda::std::/: Strictly conforming implementations of facilities from the Standard Library that work in __host__ __device__ code.
  • cuda::/: Conforming extensions to the Standard Library that work in __host__ __device__ code.
  • cuda::device/: Conforming extensions to the Standard Library that work only in __device__ code.
// Standard C++, __host__ only.
#include <atomic>
std::atomic<int> x;

// CUDA C++, __host__ __device__.
// Strictly conforming to the C++ Standard.
#include <cuda/std/atomic>
cuda::std::atomic<int> x;

// CUDA C++, __host__ __device__.
// Conforming extensions to the C++ Standard.
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;

libcu++ is Heterogeneous

The NVIDIA C++ Standard Library works across your entire codebase, both in and across host and device code. libcu++ is a C++ Standard Library for your entire system, not just Everything in cuda:: is __host__ __device__.

libcu++ facilities are designed to be passed between host and device code. Unless otherwise noted, any libcu++ object which is copyable or movable can be copied or moved between host and device code.

Synchronization objects work across host and device code, and can be used to synchronize between host and device threads. However, there are some restrictions to be aware of; please see the synchronization library section for more details.

cuda::device::

A small number of libcu++ facilities only work in device code, usually because there is no sensible implementation in host code.

Such facilities live in cuda::device::.

libcu++ is Incremental

Today, the NVIDIA C++ Standard Library delivers a high-priority subset of the C++ Standard Library today, and each release increases the feature set. But it is a subset; not everything is available today. The Standard API section lists the facilities available and the releases they were first introduced in.

Licensing

The NVIDIA C++ Standard Library is an open source project developed on GitHub. It is NVIDIA's variant of LLVM's libc++. libcu++ is distributed under the Apache License v2.0 with LLVM Exceptions.

Conformance

The NVIDIA C++ Standard Library aims to be a conforming implementation of the C++ Standard, ISO/IEC IS 14882, Clause 16 through 32.

ABI Evolution

The NVIDIA C++ Standard Library does not maintain long-term ABI stability. Promising long-term ABI stability would prevent us from fixing mistakes and providing best in class performance. So, we make no such promises.

Every major CUDA Toolkit release, the ABI will be broken. The life cycle of an ABI version is approximately one year. Long-term support for an ABI version ends after approximately two years. Please see the versioning section for more details.

We recommend that you always recompile your code and dependencies with the latest NVIDIA SDKs and use the latest NVIDIA C++ Standard Library ABI. Live at head.

Comments
  • Add atomics for floating point types.

    Add atomics for floating point types.

    This PR is a draft to add support for float/double atomics.

    Please review and let me know what is missing. Unfortunately, the diff between the old and new codegen output is a mess due to the reordering of operations.

    Also rolls back #282 and fixes #279

  • Alignment requirements of cuda::std::complex

    Alignment requirements of cuda::std::complex

    The following two static_asserts compile without issues:

    #include <cuda/std/complex>
    static_assert(alignof(cuda::std::complex<double>) == 8);
    static_assert(alignof(cuda::std::complex<float>) == 4);
    

    I'd expected them to be 16 and 8 to match the double2 and float2 types.

  • Redistributable

    Redistributable

    I'm interested in supporting this library on my own compiler. Do you provide a binary redistributable that covers the non-header portions of the library? Would that be hiding in the hpc toolkit somewhere? Basically I'm looking for guidance on deployment.

  • Backport C++17 type_traits and cuda::std::byte to C++14

    Backport C++17 type_traits and cuda::std::byte to C++14

    Broken off from https://github.com/NVIDIA/libcudacxx/pull/10

    • [x] Backports <type_traits> features from C++17 to make them available in C++14

    • [x] Backports tests of type traits to make them supported in C++14

    • [x] Backports cuda::std::byte to be available in C++14 (this was necessary to make some tests pass as a result of the <type_traits> changes

  • `cuda::atomic_ref<float>::fetch_min` and `fetch_max` incorrect results on values of different signs

    `cuda::atomic_ref::fetch_min` and `fetch_max` incorrect results on values of different signs

    It looks like cuda::atomic_ref<T>::fetch_min for floating point types of T is not generating correct code. For these types a CAS loop implementation should be used, but as this Godbolt PTX shows, an unsigned integer atom.max is being generated. https://godbolt.org/z/sWeezx1a1

    The problem is in the conversion to unsigned: negative and positive floats reinterpreted as unsigned integers will not compare in the same order.

    The code in the Godbolt does a max of positive values with the initial max set to a negative value. It also does a min of negative values with the initial min set to a positive value. Both result in incorrect results.

    The correct output of the program should be

    0: Min: 0 Max: 511
    1: Min: 512 Max: 1023
    2: Min: 1024 Max: 1535
    3: Min: 1536 Max: 2047
    4: Min: 2048 Max: 2559
    5: Min: 2560 Max: 3071
    6: Min: 3072 Max: 3583
    7: Min: 3584 Max: 4095
    

    Instead, the program prints

    0: Min: 3.40282e+38 Max: -1
    1: Min: 3.40282e+38 Max: -1
    2: Min: 3.40282e+38 Max: -1
    3: Min: 3.40282e+38 Max: -1
    4: Min: 3.40282e+38 Max: -1
    5: Min: 3.40282e+38 Max: -1
    6: Min: 3.40282e+38 Max: -1
    7: Min: 3.40282e+38 Max: -1
    

    Tested on CUDA 11.5 on a Tesla V100 (DGX system).

    Discovered when trying to convert cuSpatial to use libcudacxx atomic_ref rather than a custom implementation which uses atomicCAS. RAPIDS would like to eliminate its custom atomic operation implementations, but this bug needs to be fixed first.

    CC @jrhemstad

  • Fix buggy numerics of tanh(complex) at inf

    Fix buggy numerics of tanh(complex) at inf

    Because: lim[x->inf, tanh(x+iy)] = 1 lim[x->-inf, tanh(x+iy)] = -1

    Test:

    #include <complex>
    #include <cuda/std/complex>
    #include <iostream>
    
    constexpr float inf = std::numeric_limits<float>::infinity();
    
    int main() {
        float values[] = {inf, 1, 0, -1, -inf};
        for (float r : values) {
            for (float i : values) {
                std::complex<float> s = {r, i};
                cuda::std::complex<float> c = {r, i};
    
                auto ts = std::tanh(s);
                auto tc = cuda::std::tanh(c);
    
                std::cout << "input: (" << r << ", " << i << ")" << std::endl;
                std::cout << "std: (" << ts.real() << ", " << ts.imag() << ")" << std::endl;
                std::cout << "cuda::std: (" << tc.real() << ", " << tc.imag() << ")" << std::endl;
                std::cout << std::endl;
            }
        }
    }
    

    Before:

    input: (inf, inf)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 1)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 0)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, -1)
    std: (1, -0)
    cuda::std: (1, -0)
    
    input: (inf, -inf)
    std: (1, -0)
    cuda::std: (1, 0)
    
    input: (1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (1, 1)
    std: (1.08392, 0.271753)
    cuda::std: (1.08392, 0.271753)
    
    input: (1, 0)
    std: (0.761594, 0)
    cuda::std: (0.761594, 0)
    
    input: (1, -1)
    std: (1.08392, -0.271753)
    cuda::std: (1.08392, -0.271753)
    
    input: (1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, 1)
    std: (0, 1.55741)
    cuda::std: (0, 1.55741)
    
    input: (0, 0)
    std: (0, 0)
    cuda::std: (0, 0)
    
    input: (0, -1)
    std: (0, -1.55741)
    cuda::std: (0, -1.55741)
    
    input: (0, -inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, 1)
    std: (-1.08392, 0.271753)
    cuda::std: (-1.08392, 0.271753)
    
    input: (-1, 0)
    std: (-0.761594, 0)
    cuda::std: (-0.761594, 0)
    
    input: (-1, -1)
    std: (-1.08392, -0.271753)
    cuda::std: (-1.08392, -0.271753)
    
    input: (-1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-inf, inf)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, 1)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, 0)
    std: (-1, 0)
    cuda::std: (1, 0)
    
    input: (-inf, -1)
    std: (-1, -0)
    cuda::std: (1, -0)
    
    input: (-inf, -inf)
    std: (-1, -0)
    cuda::std: (1, 0)
    

    After:

    input: (inf, inf)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 1)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, 0)
    std: (1, 0)
    cuda::std: (1, 0)
    
    input: (inf, -1)
    std: (1, -0)
    cuda::std: (1, -0)
    
    input: (inf, -inf)
    std: (1, -0)
    cuda::std: (1, 0)
    
    input: (1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (1, 1)
    std: (1.08392, 0.271753)
    cuda::std: (1.08392, 0.271753)
    
    input: (1, 0)
    std: (0.761594, 0)
    cuda::std: (0.761594, 0)
    
    input: (1, -1)
    std: (1.08392, -0.271753)
    cuda::std: (1.08392, -0.271753)
    
    input: (1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (0, 1)
    std: (0, 1.55741)
    cuda::std: (0, 1.55741)
    
    input: (0, 0)
    std: (0, 0)
    cuda::std: (0, 0)
    
    input: (0, -1)
    std: (0, -1.55741)
    cuda::std: (0, -1.55741)
    
    input: (0, -inf)
    std: (0, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-1, 1)
    std: (-1.08392, 0.271753)
    cuda::std: (-1.08392, 0.271753)
    
    input: (-1, 0)
    std: (-0.761594, 0)
    cuda::std: (-0.761594, 0)
    
    input: (-1, -1)
    std: (-1.08392, -0.271753)
    cuda::std: (-1.08392, -0.271753)
    
    input: (-1, -inf)
    std: (nan, nan)
    cuda::std: (-nan, -nan)
    
    input: (-inf, inf)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, 1)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, 0)
    std: (-1, 0)
    cuda::std: (-1, 0)
    
    input: (-inf, -1)
    std: (-1, -0)
    cuda::std: (-1, -0)
    
    input: (-inf, -inf)
    std: (-1, -0)
    cuda::std: (-1, 0)
    

    Thanks a lot cc @mruberry @ngimel for discussion.

  • Revamp samples and benchmarks

    Revamp samples and benchmarks

    • [x] Created benchmarks/ to hold libcu++ benchmarks
    • [x] Moved benchmark.cu/.cpp from samples/ to benchmarks (renamed to concurrency_host/device, but this can be changed)
    • [x] Overhauled the old CMakeList.txt for building the benchmarks to use more modern cmake
    • [x] Add CMakeList.txt to samples/ to build the existing samples
  • Add clang-11 docker configurations

    Add clang-11 docker configurations

    Based on nvbug https://nvbugs/200700358

    added the required docker configuration related to the build with Clang-11

    The image is based on Ubuntu 20.04. In addition, we define DEBIAN_FRONTEND and TZ in docker layer.

  • Fix <tuple> on MSVC

    Fix on MSVC

    decltype is a culprit for a slew of MSVC bugs. About 50 failures have been fixed by hacking the __tuple_sfinae_base trait.

    Tests that still need to be addressed:

    Failing Tests (8):
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.apply/apply_extended_types.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/PR27684_contains_ref_to_incomplete_type.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/UTypes.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/alloc.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/deduct.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/nothrow_cnstr.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.cnstr/test_lazy_sfinae.pass.cpp
        libcu++ :: std/utilities/tuple/tuple.tuple/tuple.helper/tuple_size_structured_bindings.pass.cpp
    
      Expected Passes    : 70
      Unexpected Failures: 8
    
  • Add extended API documentation

    Add extended API documentation

    • [x] Do the required Jekyll changes
    • [x] Fix relative links
    • [x] Document Group concept & aligned_size_t
    • [x] Add missing examples
    • [x] Verify that all examples are compiling
  • Backports C++17/20 <chrono> features to C++14

    Backports C++17/20 features to C++14

    Depends on https://github.com/NVIDIA/libcudacxx/pull/44

    • [x] Moves <chrono> features guarded to C++17/20 to C++14

    • [x] Adds missing _LIBCUDACXX_INLINE_VISIBILITY functions in <chrono> to ensure they are annotated with __host__ __device__

    • [x] Backports libc++ std/utilities/time tests that were previously only tested in C++17/20 to be tested with C++14

    • [x] Added explicit casts to operator- for weekday to workaround an NVCC bug

    • [x] ~Removed the _LIBCUDACXX_HAS_NO_CXX20_CHRONO_LITERALS definition since the literals were backported~ This turned out to not be possible because compilers warn on the "user" defined literal.

    Porting the tests required a few changes/workarounds to upstream tests:

    • [x] Updated tests to build with --expt-relaxed-constexpr to allow accessing constexpr globals from <chrono> in device code

    • [x] nvcc does not support ODR-user of constexpr global variables (e.g., cuda::std::chrono::March). To workaround, in several tests I created a constexpr local copy of the global, e.g.,

    __device__ void odr_use_month(cuda::std::chrono::month const& m);
    
    __device__ void foo(){
       // odr_use_month(cuda::std::March); // This won't work
       auto constexpr March = cuda::std::March;
       odr_use_month(March); // this works
    }
    
    • [x] Rename operator[].pass.cpp to operator_index.pass.cpp because nvcc doesn't like square brackets in a file name

    • [x] Marked the following tests as XFAIL with gcc below 7.0 due to gcc bug:

      • time/time.cal/time.cal.ymdlast/time.cal.ymdlast.members/plus_minus_equal_month.pass.cpp
      • time/time.cal/time.cal.ymdlast/time.cal.ymdlast.members/plus_minus_equal_year.pass.cpp
      • time/time.cal/time.cal.ymwd/time.cal.ymwd.members/plus_minus_equal_month.pass.cpp
      • time/time.cal/time.cal.ymwd/time.cal.ymwd.members/plus_minus_equal_year.pass.cpp
  • Fix structured binding support

    Fix structured binding support

    Currently structured bindings for cuda::std::tuple and cuda::std::array were broken.

    The reason for that is that the standard requires, that the specializations of tuple_size and tuple_element reside in namespace std. whereas our specializations resided in namespace cuda::std

    Work around that by pulling those specializations into namespace std too.

    Fixes CUDA Tuple Structured Binding Declaration Broken #316

  • CUDA Tuple Structured Binding Declaration Broken

    CUDA Tuple Structured Binding Declaration Broken

    I'm running into issues where cuda::std::tuple does not seem to support structured binding declarations. Is this a feature that should work but is broken? Is this unsupported for cuda::std::tuple specifically?

    Note that the normal std::tuple supports this cpp17 feature, even in device code. If I comment out the cuda_tuple_kernel and its calls in the main() function, the code compiles without issues.

    Code:

    #include <cuda/std/tuple>
    #include <tuple>
    
    #define CHECK_CUDA(cmd)                                  \
      do {                                                   \
        cudaError_t res = (cmd);                             \
        if (res != cudaSuccess) {                            \
          fprintf(stderr, "CUDA: %s = %d (%s)\n", #cmd, res, \
                  cudaGetErrorString(res));                  \
          abort();                                           \
        }                                                    \
      } while (0)
    
    // This works
    __global__ void std_tuple_kernel()
    {
      std::tuple<bool, float> my_tup = std::make_tuple(true, 1.0f);
    
      printf("Direct access: %d, %f\n", std::get<0>(my_tup), std::get<1>(my_tup));
    
      auto [first, second] = my_tup;
    
      printf("Structured binding: %d, %f\n", first, second);
    }
    
    // This fails to compile
    __global__ void cuda_tuple_kernel()
    {
    
      cuda::std::tuple<bool, float> my_tup = cuda::std::make_tuple(true, 1.0f);
    
      printf("Direct access: %d, %f\n", cuda::std::get<0>(my_tup), cuda::std::get<1>(my_tup));
    
      auto [first, second] = my_tup;
    
      printf("Structured binding: %d, %f\n", first, second);
    }
    
    int main()
    {
    
      std_tuple_kernel<<<1, 1>>>();
      CHECK_CUDA(cudaDeviceSynchronize());
    
      cuda_tuple_kernel<<<1, 1>>>();
      CHECK_CUDA(cudaDeviceSynchronize());
    }
    

    Compile command:

    nvcc tuple_issues.cu --expt-relaxed-constexpr -std=c++17 -gencode=arch=compute_80,code=compute_80
    

    Compile error:

    tuple_issues.cu(22): error: cannot bind to non-public member "cuda::std::__4::tuple<_Tp...>::__base_ [with _Tp=<__nv_bool, float>]"
    

    System Info: GPU: A100 nvcc: 11.7.64 g++: 9.4.0 OS: Ubuntu 20 LTS

  • cuda::std::complex division is slower than expected

    cuda::std::complex division is slower than expected

    @jrhemstad edit: This was originally from the Thrust repo about thrust::complex. I pointed @NickKarpowicz at cuda::std::complex as it will eventually replace thrust::complex. @NickKarpowicz reported that cuda::std::complex was even slower than thrust::complex :upside_down_face:

    Hi, I noticed that the division of a (real) double precision number by a thust::complex isn't as fast as it could be. Maybe this is a case of the compiler not optimizing something it should, but there is an easy workaround. I posted about this on the nVidia forum, and they suggested I create an issue here.

    It seems to do the operation "literally": first turn the double into a complex number, then divide complex/complex. This can be done more efficiently, in a way that saves a division. I pasted a simple program below that I made to isolate and test this.

    Long story short: If I write a function by hand to do the operation without the additional divide, kernel calls just doing this division ~64 million times average 6.22 ms according to the profiler. Doing it with Thrust’s division operator, they take 10.97 ms on average, on a 2080 Super. On a 3060, the number are similar, 13.23 ms vs. 23.46 ms. This is compiling on Windows in Visual Studio 2022, CUDA 11.7.

    So one can simply overload the / operator for a bit of a speedup, as:

    __device__ thrust::complex<double> operator/(double a, thrust::complex<double> b) {
    		double divByDenominator = a / (b.real() * b.real() + b.imag() * b.imag());
    		return thrust::complex<double>(b.real() * divByDenominator, -b.imag() * divByDenominator);
    }
    

    I tried this with floats instead of doubles, and it doesn’t seem to matter there. If it turns out to be the case in general and not just a me thing, maybe it’s worth putting something like that explicitly in the library, or maybe the compiler is just currently missing something it shouldn’t? I'm not sure if this is a compiler issue or thrust issue, but one can most easily work around it when interacting with thrust so I put it here...

    The code I used for testing is here:

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include <stdio.h>
    #include <thrust/complex.h>
    
    #define TESTSIZE 64*1048576
    #define THREADS_PER_BLOCK 128
    #define NLAUNCHES 5
    
    //divide the arrays using thrust standard operator
    __global__ void divideWithThrust(double* x, thrust::complex<double>* y, thrust::complex<double>* z) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	z[i] = x[i] / y[i];
    }
    
    //divide the arrays by hand
    __global__ void divideDZ(double* x, thrust::complex<double>* y, thrust::complex<double>* z) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	double divByDenominator = x[i] / (y[i].real() * y[i].real() + y[i].imag() * y[i].imag());
    	z[i] = thrust::complex<double>(y[i].real() * divByDenominator, -y[i].imag() * divByDenominator);
    }
    
    //divide the arrays by explicitly turning the double into a complex double
    __global__ void divideDZupcast(double* x, thrust::complex<double>* y, thrust::complex<double>* z) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	z[i] = thrust::complex<double>(x[i], 0) / y[i];
    }
    
    //float math for comparison
    __global__ void divideWithThrustFloat(float* x, thrust::complex<float>* y, thrust::complex<float>* z) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	z[i] = x[i] / y[i];
    }
    
    //float by hand for comparison
    __global__ void divideFC(float* x, thrust::complex<float>* y, thrust::complex<float>* z) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	float divByDenominator = x[i] / (y[i].real() * y[i].real() + y[i].imag() * y[i].imag());
    	z[i] = thrust::complex<float>(y[i].real() * divByDenominator, -y[i].imag() * divByDenominator);
    }
    
    //fill arrays
    __global__ void initArrays(double* x, thrust::complex<double>* y) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	x[i] = sin(0.1 * i);
    	y[i] = thrust::complex<double>(cos(0.2 * i), sin(0.5 * i));
    }
    __global__ void initArraysFloat(float* x, thrust::complex<float>* y) {
    	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    	x[i] = sin(0.1 * i);
    	y[i] = thrust::complex<float>(cos(0.2 * i), sin(0.5 * i));
    }
    
    
    int main()
    {
    	//first check with doubles
    	double *x;
    	thrust::complex<double> *y, *z;
    	cudaMalloc(&x, TESTSIZE * sizeof(double));
    	cudaMalloc(&y, TESTSIZE * sizeof(thrust::complex<double>));
    	cudaMalloc(&z, TESTSIZE * sizeof(thrust::complex<double>));
    
    	//divide by hand
    	initArrays<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(x, y);
    	for (int i = 0; i < NLAUNCHES; i++) {
    		divideDZ<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(x, y, z);
    	}
    
    	//divide with thrust
    	initArrays<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(x, y);
    	for (int i = 0; i < NLAUNCHES; i++) {
    		divideWithThrust<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(x, y, z);
    	}
    
    	//divide by turning double into complex explicitly
    	initArrays << <TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK >> > (x, y);
    	for (int i = 0; i < NLAUNCHES; i++) {
    		divideDZupcast << <TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK >> > (x, y, z);
    	}
    
    	cudaFree(x);
    	cudaFree(y);
    	cudaFree(z);
    
    
    	//compare float division
    	float *xf;
    	thrust::complex<float> *yf, * zf;
    	cudaMalloc(&xf, TESTSIZE * sizeof(float));
    	cudaMalloc(&yf, TESTSIZE * sizeof(thrust::complex<float>));
    	cudaMalloc(&zf, TESTSIZE * sizeof(thrust::complex<float>));
    
    	initArraysFloat<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(xf, yf);
    	for (int i = 0; i < NLAUNCHES; i++) {
    		divideFC<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(xf, yf, zf);
    	}
    
    	initArraysFloat<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(xf, yf);
    	for (int i = 0; i < NLAUNCHES; i++) {
    		divideWithThrustFloat<<<TESTSIZE / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(xf, yf, zf);
    	}
    
    	cudaFree(xf);
    	cudaFree(yf);
    	cudaFree(zf);
    
    	return 0;
    }
    
  • Port `std::span` and enable if for C++11 onwards to support mdspan

    Port `std::span` and enable if for C++11 onwards to support mdspan

    In addition to porting the span implementation from libc++ the PR does the following:

    • Remove outdated support for tuple interface
    • Remove outdated support for const_iterator
    • Change index_type to size_type

    I intentionally did not adopt the ranges support, as that is out of scope.

  • Implement concept emulation for C++14 / C++17

    Implement concept emulation for C++14 / C++17

    This also ports all the standard concepts from libc++ preserving the new modular organization.

    This will be a base for further development such as memory_resource or in the far future ranges

  • Implement `{async_}resource_ref`

    Implement `{async_}resource_ref`

    This implements the NVIDIA internal proposal for a any_resource_ref abstraction that facilitates the handling of memory allocations on host and device side.

    To facilitate adoption the proposal is backported to C++14 / C++17 through a concept emulation designed by @ericniebler. Unfortunately, C++11 is too far a stretch due to missing support for constexpr variables.

    Currently only plain any_resource_ref resource is implemented, as async_resource_ref is more or less a straightforward extension.

    Things still to do:

    • [x] Implement async_resource_ref
    • [x] Implement equality_comparable concept
    • [ ] Write some documentation
Libft is an individual project at 42 that requires us to re-create some standard C library functions including some additional ones that can be used later to build a library of useful functions for the rest of the program.

?? Index What is Libft? List of Functions Technologies ✨ What is Libft? Libft is an individual project at 42 that requires us to re-create some standa

Jan 17, 2022
Thrust is a C++ parallel programming library which resembles the C++ Standard Library.

Thrust: Code at the speed of light Thrust is a C++ parallel programming library which resembles the C++ Standard Library. Thrust's high-level interfac

Sep 27, 2022
jkds is a modern header-only C++20 library that complements the standard library.

jkds is a modern header-only C++20 library that complements the standard library. It provides generic atypical data structures, ergonomic functional programming abstractions, and then some.

May 24, 2022
Bionic BSD-3-ClauseBionic - Google's standard library, developed for Android. BSD-3-Clause

bionic bionic is Android's C library, math library, and dynamic linker. Using bionic as an app developer See the user documentation. Working on bionic

Sep 28, 2022
CloudABI's standard C library

NOTE: This project is unmaintained CloudABI is no longer being maintained. It was an awesome experiment, but it never got enough traction to be sustai

Sep 15, 2022
EASTL stands for Electronic Arts Standard C++ Template Library

EASTL stands for Electronic Arts Standard Template Library. It is an extensive and robust implementation that has an emphasis on high performance.

Oct 2, 2022
An open source standard C library that includes useful functions && (Reimplementation of libc functions + own functions).
An open source standard C library that includes useful functions && (Reimplementation of libc functions + own functions).

?? LIBFT-42 : Artistic view of LIBC: ?? HOW DOES IT FEEL HAVING YOUR OWN LIB: SUBJECT : ENGLISH PDF ℹ️ What is LIBFT : This project aims to code a C l

Sep 21, 2022
Reimplementation of some of the Standard C Library functions.
Reimplementation of some of the Standard C Library functions.

42-libft Reimplementation of some of the Standard C Library functions. This repository contains some of the standard library C functions. List of avai

Sep 29, 2022
STXXL: Standard Template Library for Extra Large Data Sets

STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations

Sep 24, 2022
MSVC's implementation of the C++ Standard Library.

Microsoft's C++ Standard Library This is the official repository for Microsoft's implementation of the C++ Standard Library (also known as the STL), w

Oct 3, 2022
mlibc is a C standard library

mlibc is a C standard library Official Discord server: https://discord.gg/7WB6Ur3 Design of the library Directory Purpose options/ (More or less) OS-i

Sep 29, 2022
A standard conforming C++20 implementation of std::optional.

A standard conforming C++20 implementation of std::optional.

Aug 24, 2022
This project is pretty straightforward, you have to recode printf. You will learn what is and how to implement variadic functions. Once you validate it, you will reuse this function in your future projects.
This project is pretty straightforward, you have to recode printf. You will learn what is and how to implement variadic functions. Once you validate it, you will reuse this function in your future projects.

100/100 Introduction to ft_printf This is the third project in the 1337 Curriculum #42network . This project is pretty straight forward, recode the pr

May 27, 2022
A thread-safe, easy-to-use, utility for sending and receiving notifications. It allows you to decouple different modules of your application.

NotificationManager NotificationManager is a thread-safe, easy-to-use utility for sending and receiving notifications. It allows you to decouple diffe

Dec 27, 2021
Library that simplify to find header for class from STL library.

Library that simplify to find header for class from STL library. Instead of searching header for some class you can just include header with the class name.

Jun 7, 2022
D++ Extremely Lightweight C++ Discord Library

D++ An incredibly lightweight C++ Discord library This project is in alpha stages of development. Completed so far: Websocket connection with heartbea

Oct 1, 2022
Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags.
Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags.

Single-header header-only C++11 / C++14 / C++17 library for easily managing set of auto-generated type-safe flags. Quick start #include <bitflags/bitf

Aug 24, 2022
expected lite - Expected objects in C++11 and later in a single-file header-only library

expected lite: expected objects for C++11 and later expected lite is a single-file header-only library for objects that either represent a valid value

Sep 28, 2022
Guidelines Support Library

GSL: Guidelines Support Library The Guidelines Support Library (GSL) contains functions and types that are suggested for use by the C++ Core Guideline

Sep 25, 2022