Multi-backend implementation of SYCL for CPUs and GPUs

Project logo

hipSYCL - a SYCL implementation for CPUs and GPUs

hipSYCL is a modern SYCL implementation targeting CPUs and GPUs, with a focus on leveraging existing toolchains such as CUDA or HIP. hipSYCL currently targets the following devices:

  • Any CPU via OpenMP
  • NVIDIA GPUs via CUDA
  • AMD GPUs via HIP/ROCm
  • Intel GPUs via oneAPI Level Zero and SPIR-V (highly experimental and WIP!)

hipSYCL supports compiling source files into a single binary that can run on all these backends when building against appropriate clang distributions. More information about the compilation flow can be found here.

The runtime architecture of hipSYCL consists of the main library hipSYCL-rt, as well as independent, modular plugin libraries for the individual backends: Runtime architecture

hipSYCL's compilation and runtime design allows hipSYCL to effectively aggregate multiple toolchains that are otherwise incompatible, making them accessible with a single SYCL interface.

The philosophy behind hipSYCL is to leverage such existing toolchains as much as possible. This brings not only maintenance and stability advantages, but enables performance on par with those established toolchains by design, and also allows for maximum interoperability with existing compute platforms. For example, the hipSYCL CUDA and ROCm backends rely on the clang CUDA/HIP frontends that have been augmented by hipSYCL to additionally also understand SYCL code. This means that the hipSYCL compiler can not only compile SYCL code, but also CUDA/HIP code even if they are mixed in the same source file, making all CUDA/HIP features - such as the latest device intrinsics - also available from SYCL code (details). Additionally, vendor-optimized template libraries such as rocPRIM or CUB can also be used with hipSYCL. Consequently, hipSYCL allows for highly optimized code paths in SYCL code for specific devices.

Because a SYCL program compiled with hipSYCL looks just like any other CUDA or HIP program to vendor-provided software, vendor tools such as profilers or debuggers also work well with hipSYCL.

The following image illustrates how hipSYCL fits into the wider SYCL implementation ecosystem:

About the project

While hipSYCL started its life as a hobby project, development is now led and funded by Heidelberg University. hipSYCL not only serves as a research platform, but is also a solution used in production on machines of all scales, including some of the most powerful supercomputers.

Contributing to hipSYCL

We encourage contributions and are looking forward to your pull request! Please have a look at CONTRIBUTING.md. If you need any guidance, please just open an issue and we will get back to you shortly.

If you are a student at Heidelberg University and wish to work on hipSYCL, please get in touch with us. There are various options possible and we are happy to include you in the project :-)

Citing hipSYCL

hipSYCL is a research project. As such, if you use hipSYCL in your research, we kindly request that you cite:

Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 8, 1. DOI:https://doi.org/10.1145/3388333.3388658

(This is a talk and available online. Note that some of the content in this talk is outdated by now)

Acknowledgements

We gratefully acknowledge contributions from the community.

Performance

hipSYCL has been repeatedly shown to deliver very competitive performance compared to other SYCL implementations or proprietary solutions like CUDA. See for example:

  • Sohan Lal, Aksel Alpay, Philip Salzmann, Biagio Cosenza, Nicolai Stawinoga, Peter Thoman, Thomas Fahringer, and Vincent Heuveline. 2020. SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 10, 1. DOI:https://doi.org/10.1145/3388333.3388669
  • Brian Homerding and John Tramm. 2020. Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 16, 1–7. DOI:https://doi.org/10.1145/3388333.3388660
  • Tom Deakin and Simon McIntosh-Smith. 2020. Evaluating the performance of HPC-style SYCL applications. In Proceedings of the International Workshop on OpenCL (IWOCL ’20). Association for Computing Machinery, New York, NY, USA, Article 12, 1–11. DOI:https://doi.org/10.1145/3388333.3388643

Benchmarking hipSYCL

When targeting the CUDA or HIP backends, hipSYCL just massages the AST slightly to get clang -x cuda and clang -x hip to accept SYCL code. hipSYCL is not involved in the actual code generation. Therefore any significant deviation in kernel performance compared to clang-compiled CUDA or clang-compiled HIP is unexpected.

As a consequence, if you compare it to other llvm-based compilers please make sure to compile hipSYCL against the same llvm version. Otherwise you would effectively be simply comparing the performance of two different LLVM versions. This is in particular true when comparing it to clang CUDA or clang HIP.

Current state

hipSYCL is not yet a fully conformant SYCL implementation, although many SYCL programs already work with hipSYCL.

Hardware and operating system support

Supported hardware:

  • Any CPU for which a C++17 OpenMP compiler exists
  • NVIDIA CUDA GPUs. Note that clang, which hipSYCL relies on, may not always support the very latest CUDA version which may sometimes impact support for very new hardware. See the clang documentation for more details.
  • AMD GPUs that are supported by ROCm

Operating system support currently strongly focuses on Linux. On Mac, only the CPU backend is expected to work. Windows support with CPU and CUDA backends is experimental, see Using hipSYCL on Windows.

Installing and using hipSYCL

In order to compile software with hipSYCL, use syclcc which automatically adds all required compiler arguments to the CUDA/HIP compiler. syclcc can be used like a regular compiler, i.e. you can use syclcc -o test test.cpp to compile your SYCL application called test.cpp with hipSYCL.

syclcc accepts both command line arguments and environment variables to configure its behavior (e.g., to select the target platform CUDA/ROCm/CPU to compile for). See syclcc --help for a comprehensive list of options.

When targeting a GPU, you will need to provide a target GPU architecture. The expected formats are defined by clang CUDA/HIP. Examples:

  • sm_52: NVIDIA Maxwell GPUs
  • sm_60: NVIDIA Pascal GPUs
  • sm_70: NVIDIA Volta GPUs
  • gfx900: AMD Vega 10 GPUs
  • gfx906: AMD Vega 20 GPUs

The full documentation of syclcc and hints for the CMake integration can be found in using hipSYCL.

Documentation

Comments
  • ROCm backend - build instructions

    ROCm backend - build instructions

    Hi,

    to build hipSYCL with ROCm backend, the instructions say that the "amd-common" branch for llvm/clang/lld from AMD should be used (Link). But currently this is a llvm/clang/lld version 10 (Link).

    And according to the "CMakeLists.txt" of hipSYCL it supports only llvm/clang/lld up to version 9.

    Shouldn´t it be "roc-ocl-2.7x" instead of "amd-common" branch?

  • Fix cmake config file generation to use correct function

    Fix cmake config file generation to use correct function

    Earlier to this change, the codebase was using regular configure_file to generate the package config file which renders the install files non-relocatable. This commit refactors the regular configure_file to configure_package_config_file command.

    Apart from the major change mentioned above, this commit also did a couple of additional changes.

    • adds openmp flags to the interface link options of hipSYCL-rt target so that this flag is passed to any application using cmake target. This was required on Arch Linux, without which examples are failing to build with undefined reference errors.
    • Fixed the include path values passed to hipSYCL-rt target
  • MacOS CMake issues

    MacOS CMake issues

    Prompted by https://github.com/illuhad/hipSYCL/issues/222#issuecomment-601104011, I tried to build on Mac but it seems to ignore my explicit specification of Clang location and instead finds /usr/bin/clang++.

    It is possible that the issue here is that the Homebrew installs of LLVM don't have the required components, but CMake is not doing a proper job detecting that.

    jrhammon-mac02:build jrhammon$ cmake .. -DCMAKE_INSTALL_PREFIX=/opt/hipsycl -DCMAKE_CXX_COMPILER=/usr/local/Cellar/llvm/9.0.1/bin/clang++ -DCMAKE_C_COMPILER=/usr/local/Cellar/llvm/9.0.1/bin/clang -DDISABLE_LLVM_VERSION_CHECK=ON
    -- The C compiler identification is Clang 9.0.1
    -- The CXX compiler identification is Clang 9.0.1
    -- Check for working C compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang
    -- Check for working C compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang++
    -- Check for working CXX compiler: /usr/local/Cellar/llvm/9.0.1/bin/clang++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Warning (dev) at CMakeLists.txt:17 (set):
      implicitly converting 'INTEGER' to 'STRING' type.
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Could NOT find LLVM (missing: LLVM_DIR)
    -- Building hipSYCL against LLVM configured from LLVM_DIR-NOTFOUND
    -- Selecting clang: /usr/bin/clang++
    CMake Error at CMakeLists.txt:77 (message):
      clang include path CLANG_INCLUDE_PATH-NOTFOUND does not exist.  Please
      provide clang's internal include path manually.
    
    
    -- Using clang include directory: CLANG_INCLUDE_PATH-NOTFOUND
    -- Configuring incomplete, errors occurred!
    See also "/Users/jrhammon/Work/SYCL/hipSYCL/build/CMakeFiles/CMakeOutput.log".
    
  • atomics on CPU

    atomics on CPU

    I have the following simple atomic counter example that compiles and runs fine for CUDA, but doesn't compile on the CPU.

    #include <iostream>
    #include <CL/sycl.hpp>
    
    namespace s = cl::sycl;
    
    int main()
    {
        s::queue q;
    
        int counter = 0;
        {
            s::buffer<int> counter_buf(&counter, 1);
    
            q.submit([&](cl::sycl::handler& cgh)
            {
                auto access_counter = counter_buf.get_access<cl::sycl::access::mode::read_write>(cgh);
    
                cgh.parallel_for<class atomic_increment>(s::range<1>(1 << 30), [=] (cl::sycl::id<1> tid)
                {
                    s::atomic<int> atomic_counter { s::global_ptr<int> {&access_counter[0]} };
                    atomic_counter.fetch_add(1);
                });
            });
        }
    
        std::cout << "Counter: " << counter << std::endl;
    }
    
    /opt/hipSYCL/CUDA/bin/syclcc-clang --hipsycl-gpu-arch=sm_52 -O3 atomic-counter.cpp -o atomic-counter-cuda
    

    compiles and runs fine.

    /opt/hipSYCL/CUDA/bin/syclcc-clang --hipsycl-platform=cpu -g atomic-counter.cpp -o atomic-counter-cpu
    In file included from atomic-counter.cpp:3:
    In file included from /opt/hipSYCL/CUDA/bin/../include/CL/sycl.hpp:58:
    /opt/hipSYCL/CUDA/bin/../include/CL/sycl/atomic.hpp:103:12: error: use of undeclared identifier 'atomicAdd'
        return atomicAdd(_ptr, operand);
               ^
    atomic-counter.cpp:23:32: note: in instantiation of function template specialization 'cl::sycl::atomic<int, cl::sycl::access::address_space::global_space>::fetch_add<int, nullptr>' requested here
                    atomic_counter.fetch_add(1);
    

    I know atomics are not fully supported (they are listed as a limitation in README), but other issues suggest that only minor features of atomics are missing. Is the above error by design, or is it something that can be easily fixed?

  • windows compilation report

    windows compilation report "lld-link: error: undefined symbol: cuModuleGetFunction"

    After the failure, repeat 'ninja': E:\hipSYCL-sycl-2020\build>ninja [1/1] Linking CXX shared library src\runtime\rt-backend-cuda.dll FAILED: src/runtime/rt-backend-cuda.dll src/runtime/rt-backend-cuda.lib cmd.exe /C "cd . && E:\hipSYCL-sycl-2020\LLVM\bin\clang++.exe -fuse-ld=lld-link -nostartfiles -nostdlib -O2 -g -DNDEBUG -Xclang -gcodeview -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -shared -o src\runtime\rt-backend-cuda.dll -Xlinker /implib:src\runtime\rt-backend-cuda.lib -Xlinker /pdb:src\runtime\rt-backend-cuda.pdb -Xlinker /version:0.0 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_event.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_allocator.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_device_manager.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_hardware_manager.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_backend.cpp.obj src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj src/runtime/hipSYCL-rt.lib E:/CUDA11/lib/x64/cudart_static.lib -lkernel32 -luser32 -lgdi32 -lwinspool -lshell32 -lole32 -loleaut32 -luuid -lcomdlg32 -ladvapi32 -loldnames && cd ." lld-link: error: undefined symbol: cuModuleGetFunction

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_queue.cpp:329 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_queue::submit_kernel_from_module(class hipsycl::rt::cuda_module_manager &, class hipsycl::rt::cuda_module const &, class std::basic_string<char, struct std::char_traits, class std::allocator> const &, class hipsycl::rt::static_array<3> const &, class hipsycl::rt::static_array<3> const &, unsigned int, void **))

    lld-link: error: undefined symbol: cuLaunchKernel

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_queue.cpp:337 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_queue.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_queue::submit_kernel_from_module(class hipsycl::rt::cuda_module_manager &, class hipsycl::rt::cuda_module const &, class std::basic_string<char, struct std::char_traits, class std::allocator> const &, class hipsycl::rt::static_array<3> const &, class hipsycl::rt::static_array<3> const &, unsigned int, void **))

    lld-link: error: undefined symbol: cuModuleUnload

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:131 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: __cdecl hipsycl::rt::cuda_module_manager::~cuda_module_manager(void)) referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:182 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_module_manager::load(class hipsycl::rt::device_id, class hipsycl::rt::cuda_module const &, struct CUmod_st *&))

    lld-link: error: undefined symbol: cuModuleLoadDataEx

    referenced by E:\hipSYCL-sycl-2020\src\runtime\cuda\cuda_module.cpp:192 src/runtime/CMakeFiles/rt-backend-cuda.dir/cuda/cuda_module.cpp.obj:(public: class hipsycl::rt::result __cdecl hipsycl::rt::cuda_module_manager::load(class hipsycl::rt::device_id, class hipsycl::rt::cuda_module const &, struct CUmod_st *&)) clang++: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed.


    I follow the steps from https://github.com/illuhad/hipSYCL/wiki/Using-hipSYCL-on-Windows use the prebuilt llvm11.1.0 and boost 1.75 binary, hipSYCL "sycl/2020" branch hipSYCL-rt.dll and rt-backend-omp.dll successfully generated but failed to create cuda backend part. I tried with CUDA 10.2 and 11, both report the same errors. Cannot understand why lld-link reports "undefined symbol: cuModuleGetFunction" when linking with cudart_static.lib. I use the official CUDA windows 10 x86_64 release downloaded from NVIDIA website.

  • Problems with compilation on ubuntu 18.04 with rocm 2.3

    Problems with compilation on ubuntu 18.04 with rocm 2.3

    I tried to run cmake with default parameters seems on my system it will be compiled with gcc 7.3.0 but I am getting errors like error: no member named 'make_unique' in namespace 'std' so I tried to switch to clang 6.0.0 which I have on my system but with same result CC=clang CXX=clang++ cmake. I thought clang 6 has default std c++14. So I tried it with CXXFLAGS+=-std=c++17 cmake but I am getting error The platform rocm was explicitly chosen, but it is not available.. This was again with gcc and 7.3.0 so I suppose gcc is not supported. I would expect that your cmake config will try to choose clang as C++ compiler as I see it listed as dependency and will add -std=c++14 at least.

    At last I tried CC=clang CXX=clang++ CXXFLAGS+=-std=c++17 cmake and it was finally successful (with warning in many places warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]). But when I tried to compile sycl source syclcc test.cpp --hipsycl-platform=amd --std=c++17 I got this error:

    warning: argument unused during compilation: '-L/usr/local/bin/../lib' [-Wunused-command-line-argument]
    ld: /opt/rocm/bin/../lib/libhip_hcc.so: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    This clang-9 come from rocm hcc. So I am going to try to set default C and C++ compiler on my system to newest clang.

    So I tried it with clang-7. I again had to specify CXXFLAGS=-std=c++14 before cmake. This time I tried c++14 and hipsycl compiles without warning. But syclcc test.cpp --hipsycl-platform=rocm gives this strange error:

    warning: argument unused during compilation: '-L/usr/local/bin/../lib' [-Wunused-command-line-argument]
    syclcc fatal error: [Errno 2] No such file or directory: 'hipsycl_211bd330025b7952.cpp'
    

    I tried also singulatiry container. I see that it is using clang 6 and adds std c++14:

    + export CXX=clang++-6.0
    + cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_CXX_FLAGS=-std=c++14 -DWITH_CPU_BACKEND=ON -DWITH_ROCM_BACKEND=ON ..
    -- The C compiler identification is GNU 5.4.0
    -- The CXX compiler identification is Clang 6.0.0
    

    But when compiling sycl program I am getting similar error as previously (with additional perl locale warning):

    [email protected]:~$ syclcc test.cpp --hipsycl-platform=rocm -std=c++14
    perl: warning: Setting locale failed.
    perl: warning: Please check that your locale settings:
    	LANGUAGE = "en_US:en",
    	LC_ALL = (unset),
    	LANG = "en_US.UTF-8"
        are supported and installed on your system.
    perl: warning: Falling back to the standard locale ("C").
    warning: argument unused during compilation: '-L/usr/bin/../lib' [-Wunused-command-line-argument]
    ld: /opt/rocm/bin/../lib/libhip_hcc.so: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    I have latest rocm 2.3. I now tested sample hip program and hcc program which is calling function hsa_system_major_extension_supported and linking with -lhsa-runtime64 and seems everything is working fine.

    But seems when I skip -lhsa-runtime64 then I am getting similar error as syclcc:

    ld: /tmp/tmp.EcD56X0on0/main.host.o: undefined reference to symbol '[email protected]@ROCR_1'
    //opt/rocm/hsa/lib/libhsa-runtime64.so.1: error adding symbols: DSO missing from command line
    clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
    

    And this finally worked from singularity container: syclcc test.cpp --hipsycl-platform=rocm -std=c++14 -lhsa-runtime64

  • Using multiple queues within OpenMP threads slows down SYCL

    Using multiple queues within OpenMP threads slows down SYCL

    Hi,

    I'm trying to optimize a piece of code by using multiple GPUs. To achieve this I first enumerate the devices on the host, create one queue for each device and then within a for loop access the "correct" queue by using omp_get_thread_num(). The problem is that using multiple queues slows down the entire computation by almost half. Using the same code with only one GPU (and thus one queue) gives the expected performance.

    Since the code is not mine to share I have tried to condense the relevant bits:

    const std::vector<sycl::device> devices = enumerate_devices();
    std::vector<sycl::queue> queues;
    for (auto device : devices) {
      queues.push_back(sycl::queue{device});
    }
    omp_set_num_threads(queues.size());
    #pragma omp parallel for
    for (...) {
      auto Q = queues[omp_get_thread_num()];
      // Each iteration will allocate quite a bit of memory
      // then run quite a few kernels
      // Lastly, free device memory again
    }
    

    I get the correct results in both instances, but using multiple devices uses the same total amount of time as using a single device, where kernel compute and memory handling takes roughly twice as much time as with a single queue.

    I'm wondering if this is a bad approach and if there are any suggestions for using multiple GPUs with hipSYCL (I have looked at the multi device queue, but it seems more difficult than the above with memory handling).

  • [SYCL2020] Group functions

    [SYCL2020] Group functions

    This PR adds a naive implementation for group_functions to hipSYCL. This includes implementations of

    • group_broadcast
    • group_barrier
    • group_{any,all,none]_of
    • group_reduce
    • group_{ex,in}clusive_scan

    on CPU and NVIDIA/AMD GPUs, as well as tests for these functions. I will provide optimized versions in later PRs. It also includes (group_)functions using two pointers (beginning/end), but these are not in the specification and are not meant to be used yet (as such they reside in the detail namespace).

    At the moment all tests pass (except some problems with the pointer-based functions on CPU which sometimes fail, I am still investigating). (private results)

    I would love to get some feedback. If you find some template-parameters or formatting you don't like, there is a chance I missed them in one of my cleanup/refactoring attempts. Just tell me so I can fix it. Some small changes like splitting the tests into multiple files for faster compilation might be added here,

  • Permission issue in manual build

    Permission issue in manual build

    Hello, I'm trying to build hipSYCL manually following the steps in README file.

    git clone --recurse-submodules https://github.com/illuhad/hipSYCL
    cd hipSYCL
    mkdir build
    cd build
    cmake -DCMAKE_INSTALL_PREFIX=. ..
    

    and this is the output

    -- The C compiler identification is GNU 4.8.5
    -- The CXX compiler identification is GNU 4.8.5
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Warning (dev) at CMakeLists.txt:17 (set):
      implicitly converting 'INTEGER' to 'STRING' type.
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Looking for pthread.h
    -- Looking for pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - not found
    -- Looking for pthread_create in pthreads
    -- Looking for pthread_create in pthreads - not found
    -- Looking for pthread_create in pthread
    -- Looking for pthread_create in pthread - found
    -- Found Threads: TRUE
    -- Found CUDA: /software/nvidia/cuda/10.0 (found version "10.0")
    -- Boost version: 1.57.0
    -- Found the following Boost libraries:
    --   filesystem
    --   system
    -- Boost version: 1.57.0
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /path/hipSYCL/build
    

    then I tried to build it with: make install it causes fatal error:

    make install
    Scanning dependencies of target hipSYCL_cuda
    [  2%] Building CXX object src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/application.cpp.o
    syclcc fatal error: [Errno 13] Permission denied: '/path/hipSYCL/bin/'
    make[2]: *** [src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/application.cpp.o] Error 255
    make[1]: *** [src/libhipSYCL/CMakeFiles/hipSYCL_cuda.dir/all] Error 2
    make: *** [all] Error 2
    

    however the permissions for source and build directory have not changed since their creation and also /path/hipSYCL/bin/ exists and contains syclcc syclcc-clang.

  • Add dedicated backend queues for inorder queues and priority queue support

    Add dedicated backend queues for inorder queues and priority queue support

    • Introduces rt::inorder_executor for simple, straight-forward in-order execution
    • Make rt::multi_queue_executor rely on multiple inorder_executor
    • Add mechanism to runtime to create dedicated inorder_executor objects (which backends may or may not support)
    • Change default behavior of in-order queue to request dedicated in-order executor, if supported by backend (otherwise, uses backend-provided executor). All fancy hipSYCL extensions like hipSYCL_retarget should still work, even if a dedicated inorder-executor has been requested, so no code should break. This allows expert users to have more control over scheduling decisions
    • Add hipSYCL_priority{int} queue property. When used with dedicated in-order executor (i.e. in-order queue), passes this priority to HIP/CUDA stream creation methods. On HIP, uses hipStreamCreateWithPriority. I'm not sure if this actually does something or if we need to create the stream with CU mask to achieve the desired effect.

    @al42and @pszi1ard this one is for you, apologies for the delay. Things were buggy for a long time, I hope it works now, although I would still consider it early access ;)

  • Strategy: The future of the source-to-source transformation

    Strategy: The future of the source-to-source transformation

    Since we'll have a clang plugin hopefully soonish that will directly allow the clang CUDA/HIP frontend to ingest SYCL code (see issue #34), we could in principle drop the source-to-source transformation entirely. I'd like to start a discussion here with hipSYCL users and developers to get some feedback on possible futures of the hipSYCL compilation toolchain. Is the source-to-source transformation important to you and we should support both source-to-source and the new clang plugin? Do you need nvcc support? Or is clang support (with the plugin) sufficient for you?

    Here are some pros of the source-to-source transformation that come to my mind:

    • Allows compilation of SYCL code with nvcc. This can be interesting from a marketing position ("you can do anything that nvcc can and can use the newest CUDA features right away")
    • Possible to specify areas in the code with preprocessor definitions that hipSYCL shouldn't modify. This could be beneficial if you're interested in mix-and-match with SYCL and CUDA/HIP code.

    The new clang plugin on the other hand gives us:

    • Much more robustness and reliability (there are edge cases in the source-to-source transformation...)
    • Faster compilation speed
    • Solid SYCL support
    • Paves the way for runtime selection whether a kernel should be executed on host or device (could in principle however also be implemented with source-to-source and clang's CUDA implementation [but likely not with nvcc])
    • Potentially even some parts of the C++ standard library could be used in kernels (although not yet with the initial version of the plugin)
    • Implementation of specific optimizations in the future since we have access to the IR

    While not impossible, it may require some additional effort to support both the current source-to-source transformation and the new clang plugin approach because the clang plugin treats any function without attributes implicitly as __host__ __device__. This means that all functions for SYCL kernels (e.g. math functions) must also support compilation for both host and device. At the moment, we assume in the runtime that everything used in kernels is __device__ only. This is also assumed by the current source-to-source transformation. Also, if we still have to support source-to-source, it may limit our ability to implement things with IR transformations.

  • [WIP] Add runtime components for Metal backend

    [WIP] Add runtime components for Metal backend

    This is a work-in-progress draft of the Metal backend. It currently adds the metal_hardware_context class and half-implements the Metal allocator. Many components, such as the allocator and blit encoder, have working prototypes written in Swift. They're located at metal-usm. I just need to translate them to C++.

  • hipSYCL generates extra empty kernels

    hipSYCL generates extra empty kernels

    With this code there are generated 5 kernels for a single gpu target. Seems each kernel lambda generates kernel with actual content and one extra empty kernel (and one additional extra empty kernel is generated regardless of how many kernels there are). It is even more when I use "named" kernels.

    	sycl::queue q;
    	q.single_task([]()
    	{
    		__hipsycl_if_target_hip(asm("s_nop 1"));
    	}).wait();
    	q.single_task([]()
    	{
    		__hipsycl_if_target_hip(asm("s_nop 2"));
    	}).wait();
    

    Only these two kernels which start with _Z16 contain actual code.

    _Z16__hipsycl_kernelIZ4mainEUlvE_Evv.kd
    _Z30__hipsycl_kernel_name_templateIZ4mainEUlvE0_Evv.kd
    _Z30__hipsycl_kernel_name_templateIZ4mainEUlvE_Evv.kd
    _Z16__hipsycl_kernelIZ4mainEUlvE0_Evv.kd
    _Z30__hipsycl_kernel_name_templateI24__hipsycl_unnamed_kernelEvv.kd
    
  • A few questions about hyipSYCL's capabilities

    A few questions about hyipSYCL's capabilities

    Hi, apologies if this is mentioned anywhere in the docs but I couldn't find the answers to some questions I had about hipSYCL:

    • AMD has notoriously poor support for ROCm on their consumer GPUs. What happens if a program built with hipSYCL detects an AMD GPU that isn't supported by ROCm? Does it fallback to basic OpenCL or the CPU etc?
    • Is/will Aarch64 a valid target? I found mentions of a metal backend which would have to work on Aarch64 for any Apple Silicon device, would this transfer to being able to run on other Aarch64 based platforms. My employer is looking to do some demonstrations on machines that have an Arm based CPU and Nvidia or AMD gpu.
    • Also based on the metal ticket, that would also mean MacOS support. Is there any plan for non-experimental Windows support? I saw you have a page for working on Windows, but it's not a stable process. This isn't critical, but would be preferred if we could target all three major platforms

    Thanks

  • WIP: Generic half (fp16) support

    WIP: Generic half (fp16) support

    This WIP PR adds a generic sycl::half class that is supported on all backends/compilation flows, however, when native half support is unavailable, arithmetic operations may be carried out in fp32.

    This PR only provides the class itself and basic arithmetic functionality, no math builtins. Also no tests yet :P

    Depends on #862, because implementing this requires taking into account the SSCP future.

  • Set `local_size` in reduction to default value

    Set `local_size` in reduction to default value

    Passing a local_size of 1 in a parallel_for reduction previously caused an infinite loop. This PR sets the local_size to a default value of 128 if the user passed 1.

    Fixes #857

Vgpu unlock - Unlock vGPU functionality for consumer grade GPUs.

vgpu_unlock Unlock vGPU functionality for consumer-grade Nvidia GPUs. Important! This tool is not guarenteed to work out of the box in some cases, so

Dec 29, 2022
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Jan 3, 2023
A bounded multi-producer multi-consumer concurrent queue written in C++11
A bounded multi-producer multi-consumer concurrent queue written in C++11

MPMCQueue.h A bounded multi-producer multi-consumer concurrent queue written in C++11. It's battle hardened and used daily in production: In the Frost

Dec 25, 2022
C++11 thread safe, multi-producer, multi-consumer blocking queue, stack & priority queue class

BlockingCollection BlockingCollection is a C++11 thread safe collection class that provides the following features: Modeled after .NET BlockingCollect

Nov 23, 2022
A mod menu base for GTA - Custom UI, backend == BigBase

Custom Base This is a mod menu base made for Grand Theft Auto V and compatible with GTA Online aswell. Making a parent submenu Parent submenus are the

Jan 16, 2022
Powerful multi-threaded coroutine dispatcher and parallel execution engine

Quantum Library : A scalable C++ coroutine framework Quantum is a full-featured and powerful C++ framework build on top of the Boost coroutine library

Dec 30, 2022
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

Dec 30, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

Dec 30, 2022
lc is a fast multi-threaded line counter.
lc is a fast multi-threaded line counter.

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)

Oct 25, 2022
A library OS for Linux multi-process applications, with Intel SGX support

Graphene Library OS with Intel SGX Support A Linux-compatible Library OS for Multi-Process Applications NOTE: We are in the middle of transitioning ou

Jan 4, 2023
KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software
KRATOS Multiphysics (

KRATOS Multiphysics ("Kratos") is a framework for building parallel, multi-disciplinary simulation software, aiming at modularity, extensibility, and high performance. Kratos is written in C++, and counts with an extensive Python interface.

Dec 29, 2022
This is a C++ package for multi-armed bandit simulations

This is a C++ package for multi-armed bandit simulations.

Dec 15, 2022
Jan 4, 2023
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.

What is SObjectizer? What distinguishes SObjectizer? SObjectizer is not like TBB, taskflow or HPX Show me the code! HelloWorld example Ping-Pong examp

Dec 26, 2022
Arcana.cpp - Arcana.cpp is a collection of helpers and utility code for low overhead, cross platform C++ implementation of task-based asynchrony.

Arcana.cpp Arcana is a collection of general purpose C++ utilities with no code that is specific to a particular project or specialized technology are

Nov 23, 2022
Parallel-util - Simple header-only implementation of "parallel for" and "parallel map" for C++11

parallel-util A single-header implementation of parallel_for, parallel_map, and parallel_exec using C++11. This library is based on multi-threading on

Jun 24, 2022
Fast, generalized, implementation of the Chase-Lev lock-free work-stealing deque for C++17

riften::Deque A bleeding-edge lock-free, single-producer multi-consumer, Chase-Lev work stealing deque as presented in the paper "Dynamic Circular Wor

Dec 22, 2022
An ultra-simple thread pool implementation for running void() functions in multiple worker threads
An ultra-simple thread pool implementation for running void() functions in multiple worker threads

void_thread_pool.cpp © 2021 Dr Sebastien Sikora. [email protected] Updated 06/11/2021. What is it? void_thread_pool.cpp is an ultra-simple

Nov 19, 2021