Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with the STL will recognize many of the Bolt APIs and customization techniques.

The primary goal of Bolt is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. It has interfaces that are easy to use, and has comprehensive documentation for the library routines, memory management, control interfaces, and host/device code sharing.

Compared to writing the equivalent functionality in OpenCL™, you’ll find that Bolt requires significantly fewer lines-of-code and less developer effort. Bolt is designed to provide a standard way to develop an application that can execute on either a regular CPU, or use any available OpenCL™ capable accelerated compute unit, with a single code path.

Here's a link to our BOLT wiki page.

Prerequisites

Windows

  1. Visual Studio 2010 onwards (VS2012 for C++ AMP)
  2. Tested with 32/64 bit Windows® 7/8 and Windows® Blue
  3. CMake 2.8.10
  4. TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
  5. APP SDK 2.8 or onwards.

Note: If the user has installed both Visual Studio 2012 and Visual Studio 2010, the latter should be updated to SP1.

Linux

  1. GCC 4.6.3 and above
  2. Tested with OpenSuse 12.3, RHEL 6.4 64bit, RHEL 6.3 32bit, Ubuntu 13.4
  3. CMake 2.8.10
  4. TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
  5. APP SDK 2.8 or onwards.

Note: Bolt pre-built binaries for Linux are build with GCC 4.7.3, same version should be used for Application building else user has to build Bolt from source with GCC 4.6.3 or higher.

Catalyst™ package

The latest Catalyst driver contains the most recent OpenCL runtime. Recommended Catalyst package is latest 13.11 Beta Driver.

13.4 and higher is supported.

Note: 13.9 in not supported.

Supported Devices

AMD APU Family with AMD Radeon™ HD Graphics

  • A-Series
  • C-Series
  • E-Series
  • E2-Series
  • G-Series
  • R-Series

AMD Radeon™ HD Graphics

  • 7900 Series (7990, 7970, 7950)
  • 7800 Series (7870, 7850)
  • 7700 Series (7770, 7750)

AMD Radeon™ HD Graphics

  • 6900 Series (6990, 6970, 6950)
  • 6800 Series (6870, 6850)
  • 6700 Series (6790 , 6770, 6750)
  • 6600 Series (6670)
  • 6500 Series (6570)
  • 6400 Series (6450)
  • 6xxxM Series

AMD Radeon™ Rx 2xx Graphics

  • R9 2xx Series
  • R8 2xx Series
  • R7 2xx Series

AMD FirePro™ Professional Graphics

  • W9100

Compiled binary windows packages (zip packages) for Bolt may be downloaded from the Bolt landing page hosted on AMD's Developer Central website.

Examples

The simple example below shows how to use Bolt to sort a random array of 8192 integers.

#include <bolt/cl/sort.h>
#include <vector>
#include <algorithm>

int main ()
{
    // generate random data (on host)
    size_t length = 8192
    std::vector<int> a (length);
    std::generate ( a.begin (), a.end(), rand );

    // sort, run on best device in the platform
    bolt::cl::sort(a.begin(), a.end());
    return 0;
}

The code will be familiar to programmers who have used the C++ Standard Template Library; the difference is the include file (bolt/cl/sort.h) and the bolt::cl namespace before the sort call. Bolt developers do not need to learn a new device-specific programming model to leverage the power and performance advantages of heterogeneous computing.

#include <bolt/cl/device_vector.h>
#include <bolt/cl/scan.h>
#include <vector>
#include <numeric>

int main()
{
  size_t length = 1024;
  // Create device_vector and initialize it to 1
  bolt::cl::device_vector< int > boltInput( length, 1 );

  // Calculate the inclusive_scan of the device_vector
  bolt::cl::inclusive_scan(boltInput.begin(),boltInput.end(),boltInput.begin( ) );

  // Create an std vector and initialize it to 1
  std::vector<int> stdInput( length, 1 );
 
  // Calculate the inclusive_scan of the std vector
  bolt::cl::inclusive_scan(stdInput.begin( ),stdInput.end( ),stdInput.begin( ) );
  return 0;
}

This example shows how Bolt simplifies management of heterogeneous memory. The creation and destruction of device resident memory is abstracted inside of the bolt::cl::device_vector <> class, which provides an interface familiar to nearly all C++ programmers. All of Bolt’s provided algorithms can take either the normal std::vector or the bolt::cl::device_vector<> class, which allows the user to control when and where memory is transferred between host and device to optimize performance.

Copyright and Licensing information

© 2012,2014 Advanced Micro Devices, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • scatter_if in bolt::amp not possible?

    scatter_if in bolt::amp not possible?

    Hi, I'm studying Bolt and wanted to implement an example program that needs scatter_if operation (http://thrust.github.io/doc/group__scattering.html#ga1079bc05bcb3d4b5080f1e07444fee37). I started to port thrust scatter_if code, which uses permutation_iterator but came across this (https://groups.google.com/forum/#!topic/thrust-users/Xe2JkFy_hUk). The Google Group post claims that permutation_iterator in AMP kernel is not possible because of the restriction AMP put on the use of pointer in kernel (ie. restrict(amp)). Is this true? If so, is it possible to implement permutation_iterator in bolt::cl?

    BTW, Bolt forum in AMD Dev Central does not seem to work correctly. It is set to private and my post there does not seem to go through. :(

  • Differences in develop and master branch.

    Differences in develop and master branch.

    I noticed that there are some differences between develop branch and master/v1.0 branch. Is that intentional? Most of the differences are minor documentation differences that probably won't affect much. But it seems like something was missed when syncing v1.0/master/develop branches in preparation of v1.0 release.

    I'm just getting started with Git way of doing thing. I've only been using SVN. So, it's possible that I'm missing something...

    Also, to submit an entry for Bolt Sample Code Contest, I should open a pull request to develop branch, right? Or does it not matter?

  • Fix various CMake issues when building on Linux

    Fix various CMake issues when building on Linux

    The following commits fixes various issues when building Bolt on Linux:

    • Do not enable amp build by default unless VS compiler is used
    • Correctly call bootstrap and b2 when building boost
    • Correct some filename cases

    So far, compilation still fail with g++-4.7 but at least compilation starts.

  • Missing Linux installation instructions

    Missing Linux installation instructions

    After downloading the binary tarball for Linux and unpacking it, I find a directory with some stuff in it, but no instructions on installation. For example, should one copy include, lib and lib64 to /usr/local? Or is it intended that Bolt applications should set up -I and -L compiler flags to wherever one unpacked Bolt? Is any special care needed if one already has Boost (and no doubt a different version of Boost) installed to avoid version conflicts?

  • Style checker to use for Bolt

    Style checker to use for Bolt

    Does Bolt have a recommended style checker to use? I noticed that even though Bolt has coding style guideline, not all of it is used/enforced. In particular, use of tab character and indentation seem wrong and seems to be different depending who checks in the code.

    The guideline states "Use only spaces, and indent 2 spaces at a time", it seems most of the code is indented with 4 spaces and uses tab characters as well as spaces.

    image

  • Bolt_1.2: bolt:cl::sort is hanging for higher odd buffer sizes with 1000 iterations.

    Bolt_1.2: bolt:cl::sort is hanging for higher odd buffer sizes with 1000 iterations.

    If we run bolt::cl::sort for 1000 iterations by having higher odd buffer sizes like 2 power 23, 2 power 25,.. for double and float data type it is hanging. No issues with non powers of 2 and even buffer sizes like 2 power 24, 2 power 26..

  • Bolt1.2: Some APP SDK samples in 2.9 are failing to build with respect to bolt 1.2 package.

    Bolt1.2: Some APP SDK samples in 2.9 are failing to build with respect to bolt 1.2 package.

    When we try to build bolt samples of AMD APP SDK by linking to the bolt 1.2 package, two samples( BoxFilterSAT and Stockdataanalysis,) are failed to build by throwing compilation error.

  • Problem with gtest download URL in cmake file

    Problem with gtest download URL in cmake file

    The download URL for gtest in superbuild/ExternalGtest.cmake is not working. It begins with "https://...", after deleting the "s" character, it works fine.

  • Develop

    Develop

    I have added the TBB Exceptional code path and restructured the code for serial, TBB and Defualt(OpenC) Path.

    I Have Added the Appropriate Google test cases so that it may take exhausted code path.

    Ensured no line is exceeding the 120 columns and

    Replaced the tab with 4 spaces

    Reported some issues in AMP routines

  • bolt 1.2, typo in transform_reduce.inl

    bolt 1.2, typo in transform_reduce.inl

    Bolt 1.2, file include/bolt/cl/detail/transform_reduce.inl, lines 446-447:

    dblog->CodePathTaken(BOLTLOG::BOLT_TRANSFORMREDUCE,BOLTLOG::BOLT_MULTICORE_CPU,"

    ::Transform_Reduce::MULTICORE_CPU");

    Clearly the string markers (") are located in two lines, which makes the compiler issue unnecessary warnings

    Z Koza

  • Bolt1.2: Ubuntu 32bit: gcc4.8.1: std::stable_sort api compilation failure

    Bolt1.2: Ubuntu 32bit: gcc4.8.1: std::stable_sort api compilation failure

    when std::sort function call on device_vector, means running function on CPU when we have data on GPU. These are working fine on Windows 32/64 and linux 64bit. We are seeing this only on 32 bit linux which may be because of compiler restriction.

  • throw opencl kernel compile issue when run test opencl case for example  clBolt.Test.StableSort

    throw opencl kernel compile issue when run test opencl case for example clBolt.Test.StableSort

    Hi , clone the bolt codes, compile on rocm1,9 and opencl-runtime, build the project with cmake commands as "cmake -DBOOST_LIBRARYDIR=/home/qcxie/software/boost_1_65_1/stage/lib -DBOOST_ROOT=/home/qcxie/software/boost_1_65_1 -DGTEST_ROOT=/home/qcxie/software/boost_1_65_1 -DCMAKE_BUILD_TYPE=Debug -DBolt_BUILD64=1 -DCMAKE_CXX_FLAGS="-std =c++14 -fpermissive -I /opt/rocm/opencl/include -L/opt/rocm/opencl/lib/x86_64 -lOpenCL" ../" successfully, but, it throws cl kernels error in running test case, for example clBolt.Test.StableSort " error: unknown type name 'namespace' namespace bolt { namespace cl { " how i should do to configure or set buildprogram optimons to fix this issue? thanks very much.

  • Fixed

    Fixed "%d , gx " console spam.

    It looks like there was an accidental printf leftover from debugging. The problem is evident in the MonteCarloPI sample and is fixed by removing the printf.

  • unable to input bolt::cl::transform_iterator into bolt::cl::copy

    unable to input bolt::cl::transform_iterator into bolt::cl::copy

    #include <iostream>
    #include <vector>
    #include <bolt/cl/iterator/counting_iterator.h>
    #include <bolt/cl/iterator/transform_iterator.h>
    #include <bolt/cl/functional.h>
    #include <bolt/cl/device_vector.h>
    #include <bolt/cl/copy.h>
    
    BOLT_FUNCTOR(GetSquare,
    struct GetSquare
    {
    public:
        int operator()(const int& globalId) const
        {
            return globalId*globalId;
        }
    };);
    
    int main()
    {
        const std::size_t n=10;
    
        bolt::cl::control ctrl = bolt::cl::control::getDefault();
    
        bolt::cl::device_vector<int> debug(n);
    
        auto globalId = bolt::cl::make_counting_iterator(0);
    
        // This is OK
        // bolt::cl::transform(globalId, globalId + n, debug.begin(), GetSquare());
    
        // This causes compilation error
        auto square = bolt::cl::make_transform_iterator(globalId, GetSquare());
        bolt::cl::copy(square, square + n, debug.begin());
    
        for(int i = 0; i < n; i++)
        {
            std::cout << i << ": " << debug[i] << std::endl;
        }
    
        return 0;
    }
    

    This problem seems to be because bolt::cl::transform_iterator have the method getContainer() only with template type

    template<typename Container >
    Container& getContainer() const
    {
        return this->base().getContainer( );
    }
    

    but bolt::cl::copy needs ITERATOR::getContainer() without any template type

    V_OPENCL( kernels[whichKernel].setArg( 0, first.getContainer().getBuffer()), "Error setArg kernels[ 0 ]" );
    

    This can be solved with C++11

    auto getContainer() const -> decltype(base().getContainer())
    {
        return this->base().getContainer();
    }
    

    I don't have any idea in C++03 (boost::result_of?).

  • Don't default to 32 bit builds.

    Don't default to 32 bit builds.

    Linux systems do not have 32 bit headers and libraries installed by default. Debugging the errors that arise because of this cause a lot of overhead on the users end.

Pool is C++17 memory pool template with different implementations(algorithms)

Object Pool Description Pool is C++17 object(memory) pool template with different implementations(algorithms) The classic object pool pattern is a sof

Nov 18, 2022
Multi-backend implementation of SYCL for CPUs and GPUs
Multi-backend implementation of SYCL for CPUs and GPUs

hipSYCL - a SYCL implementation for CPUs and GPUs hipSYCL is a modern SYCL implementation targeting CPUs and GPUs, with a focus on leveraging existing

Dec 26, 2022
Vgpu unlock - Unlock vGPU functionality for consumer grade GPUs.

vgpu_unlock Unlock vGPU functionality for consumer-grade Nvidia GPUs. Important! This tool is not guarenteed to work out of the box in some cases, so

Dec 29, 2022
Jan 4, 2023
High Performance Linux C++ Network Programming Framework based on IO Multiplexing and Thread Pool

Kingpin is a C++ network programming framework based on TCP/IP + epoll + pthread, aims to implement a library for the high concurrent servers and clie

Oct 19, 2022
A C++17 thread pool for high-performance scientific computing.

We present a modern C++17-compatible thread pool implementation, built from scratch with high-performance scientific computing in mind. The thread pool is implemented as a single lightweight and self-contained class, and does not have any dependencies other than the C++17 standard library, thus allowing a great degree of portability

Jan 4, 2023
C++-based high-performance parallel environment execution engine for general RL environments.
C++-based high-performance parallel environment execution engine for general RL environments.

EnvPool is a highly parallel reinforcement learning environment execution engine which significantly outperforms existing environment executors. With

Dec 30, 2022
Thread-pool-cpp - High performance C++11 thread pool

thread-pool-cpp It is highly scalable and fast. It is header only. No external dependencies, only standard library needed. It implements both work-ste

Dec 17, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Dec 11, 2022
Optimized primitives for collective multi-GPU communication

NCCL Optimized primitives for inter-GPU communication. Introduction NCCL (pronounced "Nickel") is a stand-alone library of standard communication rout

Dec 30, 2022
VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

VexCL VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to redu

Nov 27, 2022
Sqrt OS is a simulation of an OS scheduler and memory manager using different scheduling algorithms including Highest Priority First (non-preemptive), Shortest Remaining Time Next, and Round Robin
Sqrt OS is a simulation of an OS scheduler and memory manager using different scheduling algorithms including Highest Priority First (non-preemptive), Shortest Remaining Time Next, and Round Robin

A CPU scheduler determines an order for the execution of its scheduled processes; it decides which process will run according to a certain data structure that keeps track of the processes in the system and their status.

Sep 7, 2022
Operating system project - implementing scheduling algorithms and some system calls for XV6 OS

About XV6 xv6 is a modern reimplementation of Sixth Edition Unix in ANSI C for multiprocessor x86 and RISC-V systems. It was created for pedagogical p

Dec 22, 2022
Thrust - The C++ parallel algorithms library.

Thrust: Code at the speed of light Thrust is a C++ parallel programming library which resembles the C++ Standard Library. Thrust's high-level interfac

Jan 4, 2023
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.
An implementation of Actor, Publish-Subscribe, and CSP models in one rather small C++ framework. With performance, quality, and stability proved by years in the production.

What is SObjectizer? What distinguishes SObjectizer? SObjectizer is not like TBB, taskflow or HPX Show me the code! HelloWorld example Ping-Pong examp

Dec 26, 2022
Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model - Parallel Execution and Memory Abstraction

Kokkos: Core Libraries Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platfor

Jan 5, 2023
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Dec 3, 2022
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Oct 25, 2022