Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton logo

Wheels

Documentation
Documentation

Triton

This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.

The foundations of this project are described in the following MAPL2019 publication: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Please consider citing this work if you use Triton!

The official documentation contains installation instructions and tutorials.

Compatibility

Supported Platforms:

  • Linux

Supported Hardware:

  • NVIDIA GPUs (Compute Capability 7.0+)
  • Under development: AMD GPUs, CPUs

Contributing

Community contributions are more than welcome, whether it be to fix bugs or to add new features. Feel free to open Github issues about your contribution ideas, and we will review them. A contributor's guide containing general guidelines is coming soon!

Disclaimer

Triton is a fairly recent project, and it is under active development. We expect it to be pretty useful in a wide variety of cases, but don't be surprised if it's a bit rough around the edges :)

Comments
  • Changes to triton functions inside python script does not recompile

    Changes to triton functions inside python script does not recompile

    When I make changes to a function like this in the python tutorials and reran python ..., the new changes will not be recompiled.

    def leaky_relu(x):
        return tl.where(x >= 0, x, 0.01 * x)
    

    I had to manually clean the cache every time to compile the new code. Is there a better way to do this? Thanks.

  • What is the expected perf for int8 matmul?

    What is the expected perf for int8 matmul?

    Hi,

    I compared the matmul perf in fp16 and int8, using the tutorial code in https://triton-lang.org/master/getting-started/tutorials/03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py, and got the following result:

    ----------------------------------------------------------------
             M         N          K          Time(s)      Rate(TF/s)
    ----------------------------------------------------------------
         38400,       4096,       1024,       0.001457     221.026
         38400,       4096,       1024,       0.004148     77.664
    

    in A100 GPU. so for fp16 the TF/s is reasonable since the peak is 314 TF/s in tensorcore, for int8 it seems to be off by a lot, is this expected?

  • [RFC] triton dequantize instruction

    [RFC] triton dequantize instruction

    implemente a dequantize instruction in triton

    def dequantize(input, scale, shift, nbit, dst_ty=float16, _builder=None):
    

    input is nbit (8, 4, or 2) integers packed into int16s or int32s. scale and shift are float16 scalars. For example, for nbit = 8, input is of type int32. The instruction will convert [{int8_0, int8_1, int8_2, int8_3}, {int8_4, int8_5, int8_6, int8_7}, ...] (every four int8s packed into one int32) to scale * [int8_0, int8_1, int8_2, int8_3, int8_4, int8_5, int8_6, int8_7, ..., ] + shift in float16s. If the size of input is N, the size of output is 4 * N. Similarly for int4 and int2, eight int4s are packed into one int32 and eight int2s are packed into one int16. See test file https://github.com/yuguo68/triton/blob/dequantize_inst/python/test/unit/language/test_dequantize.py for code examples.

    For our use case at Meta, the scale and shift are usually concatenated together with the quantized integers.

    input in memory: scale(16 bits), shift (16bits), int8_0, int8_1, int8_2, ..., 
    output = scale * ([int8_0, int8_1, int8_2, ...]) + shift
    

    similarly for int4 and int2.

    We find that using existing triton instruction (bit mask, bitwise cast etc) to unpack the quantized integers is slow. Hence we decide to implement the algorithm similar to https://github.com/pytorch/FBGEMM/blob/6a59bb6621ba9ec7d650ccb78b78ea24d62a3904/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L1566-L1619. We observe 2X speedup for Meta use case.

    During the implementation, we find that it is critical to make the nano tile size (nts_) https://github.com/openai/triton/blob/09cc2d454b442301e88d1df153214732bd8714d8/include/triton/codegen/analysis/layout.h#L232-L233 consistent between the input and output. For example, for 8-bit quantization with input size of 64 (output size 256), the output layout [0, 0, 0, 0, 1, 1, 1, 1, …, 31, 31, 31, 31, 0, 0, 0, 0, 1, 1, 1, 1, …, 31, 31, 31, 31] does not work with input layout [0, 0, 1, 1,…, 31, 31], but work with input layout [0,1,…,31; 0,1,…,31]. input layout [0, 0, 1, 1,…, 31, 31] works with output layout [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …, 31, 31, 31, 31, 31, 31, 31, 31]. In general, supposing size(output)/size(input) = N, it requires nts_(output) = N * nts_(input).

    Currently we use tl.multiple_of hints https://github.com/yuguo68/triton/blob/2b3ba853a6f641584b0fb4c4ed8e15b772f7549c/python/test/unit/language/test_dequantize.py#L32-L38 to enforce the nano tile size consistency. Would love to hear better ways to enforce it, for example, in populate_starting_multiple_dequantize and populate_max_contiguous_dequantize.

    The PR author is new to Triton backend and would appreciate feedbacks/comments for improvement, especially for changes in lib/codegen/analysis/align.cc, lib/codegen/analysis/axes.cc. We are aware of the new MLIR backend, and would love to implement this instruction in the new backend as well. Comments on the feasibility in the new backend are appreciated. Thank you!

    @ngimel @jianyuh @ajtulloch

  • Fix Warnings and Enable Warnings as Errors

    Fix Warnings and Enable Warnings as Errors

    Enable Warnings-as-Errors

    I enabled HandleLLVMOptions in the top-level cmake project and used imported cmake option LLVM_ENABLE_WERROR to do the heavy lifting of setting compiler options correctly across compiler versions.

    Fixed Warnings

    I built triton-mlir with both GCC and Clang/LLVM and fixed all the result warnings. Most were harmless but i did find a couple of real issues during this work

    • signed/unsigned comparisons and unused code were the most common warnings
    • i fixed a couple of incorrect usages of std::move that were resulting in extra copies
    • replaced usage of tmpnam (unsafe) with built-in LLVM functionality
    • fix a place where code was checking if the success function existed rather than if a function returned success due to a typo.
  • Support for Sparse-Dense Matrix Mulitplication

    Support for Sparse-Dense Matrix Mulitplication

    Hi, All

    Is there any support for using GPU tensor core in Sparse-Dense Matrix Multiplication (SpMM) or Sampled Dense-Dense Matrix Multiplication (SDDMM)?

    Thanks!

  • TypeError: function takes exactly 16 arguments (13 given)

    TypeError: function takes exactly 16 arguments (13 given)

    (Issue was first posted in torchdynamo, but I'm reposting it here, since it seems like it is potentially an issue with triton instead.)

    The following seems to throws TypeError: function takes exactly 16 arguments (13 given) no matter what I do. I've reproduced it several times now.

    import torch
    from torch import tensor, device
    import torch.fx as fx
    from torchdynamo.testing import rand_strided
    from math import inf
    from torch.fx.experimental.proxy_tensor import make_fx
    
    # torch version: 1.14.0.dev20221009
    # torch cuda version: 11.7
    # torch git version: 0dbefb2414417e80371ef3d8224404d4a522f86e
    
    
    # CUDA Info:
    # nvcc: NVIDIA (R) Cuda compiler driver
    # Copyright (c) 2005-2022 NVIDIA Corporation
    # Built on Wed_Jun__8_16:49:14_PDT_2022
    # Cuda compilation tools, release 11.7, V11.7.99
    # Build cuda_11.7.r11.7/compiler.31442593_0
    
    # GPU Hardware Info:
    # NVIDIA A100-SXM4-40GB : 1
    
    
    from torch.nn import *
    class Repro(torch.nn.Module):
        def __init__(self):
            super().__init__()
    
    
    
        def forward(self, arg0_1, new_zeros_1):
            slice_scatter = torch.ops.aten.slice_scatter.default(new_zeros_1, arg0_1, 2, 0, 2048);  new_zeros_1 = arg0_1 = None
            return (slice_scatter,)
    
    args = [((16, 128, 2048), (262144, 2048, 1), torch.float32, 'cuda'), ((16, 128, 2112), (270336, 2112, 1), torch.float32, 'cuda')]
    args = [rand_strided(sh, st, dt, dev) for (sh, st, dt, dev) in args]
    mod = make_fx(Repro())(*args)
    
    from torchinductor.compile_fx import compile_fx_inner
    from torchdynamo.debug_utils import same_two_models
    
    compiled = compile_fx_inner(mod, args)
    compiled(*args)
    
  • [small] use torch.int for autotuning cache

    [small] use torch.int for autotuning cache

    For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for zero_, so using int8 buffer in do_bench makes it slow.

  • triton==1.0.0.dev20210329 no longer installable via pip

    triton==1.0.0.dev20210329 no longer installable via pip

    Hi - would it be possible to reinstantiate triton==1.0.0.dev20210329 on pip, or make it clear how to update to the latest dev branch? The api seems to have changed significantly in the latest nightlies, and some function in https://github.com/microsoft/DeepSpeed rely on that particular interface.

  • Regression for caffe opencl branch.

    Regression for caffe opencl branch.

    The latest isaac code triggers many test failures with caffe's opencl branch. The good commit is: Templates/Reduce1D: now properly loading 2D scalars commit 6ac5e1f55b1cae5

    Since that commit, both "General: Internal code generator overhaul" and "JIT: No longer using fallbacks for stride[0] > 1" introduce some regressions.

    It's easy to build the Caffe's opencl branch as below:

    mkdir build

    cmake -DUSE_GREENTEA=ON -DUSE_ISAAC=ON ..

    cd build

    make -j8

    make runtest

    Then you will see many new failures with the above two commit.

    BTW It's better to use latest beignet driver as the OCL compiler. The good commit works great with beignet.

    @ptillet Could you look at this issue? Thanks.

  • Refresh cache when the source code of outlined functions are changed

    Refresh cache when the source code of outlined functions are changed

    Draft proposal to update the cache logic.

    Testing cases and document are incomplete.

    By revisiting the caching logic, my understanding is that we should update a binary/kernel whenever the following characteristics are changes:

    1. The signature of the kernel's source code.
    2. The signatures of inlined JIT functions being called.
    3. The signatures of outlined JIT functions being called.
    4. The length of any non-constexpr variables.
    5. The number of variables.
    6. The value of constexpr variables.
    7. The alignment, date type, and ptr length of tensors.
    8. The number of warps and stages.

    This patch tries to fix a corner case of characteristics 3 (C3).

  • Fix LLVM error for bfloat16

    Fix LLVM error for bfloat16

    I was encountering some LLVM "Cannot select" issues when using bf16 with certain ops, even with the latest bf16 patch. I've added a minimal reproducer as a test.

    I don't fully understand the source of the issue here, but switching to representing bfloat16 types as int16 seems to solve the issue and still give correct results. I also found that a couple other "Cannot select" errors were fixed by this as well, and removed those workarounds.

    Without this patch, the new test fails with the following error:

    triton/python/test/unit/operators/test_norm.py::test_normalized[dtype0] LLVM ERROR: Cannot select: 0x55b44dbdb690: bf16 = bitcast 0x55b44dbdea60
      0x55b44dbdea60: i16,ch,glue = CopyFromReg 0x55b44dbde9f8, Register:i16 %9, 0x55b44dbde9f8:1
        0x55b44dbde720: i16 = Register %9
        0x55b44dbde9f8: ch,glue = inlineasm 0x55b44dbde928, TargetExternalSymbol:i64'@$1 ld.global.b16 {$0}, [ $2 + 0];', MDNode:ch<null>, TargetConstant:i64<1>, TargetConstant:i32<196618>, Register:i16 %9, TargetConstant:i32<65545>, Register:i1 %10, TargetConstant:i32<851977>, Register:i64 %11, 0x55b44dbde928:1
          0x55b44dbde580: i64 = TargetExternalSymbol'@$1 ld.global.b16 {$0}, [ $2 + 0];'
          0x55b44dbde650: i64 = TargetConstant<1>
          0x55b44dbde6b8: i32 = TargetConstant<196618>
          0x55b44dbde720: i16 = Register %9
          0x55b44dbde858: i32 = TargetConstant<65545>
          0x55b44dbde788: i1 = Register %10
          0x55b44dbde990: i32 = TargetConstant<851977>
          0x55b44dbde8c0: i64 = Register %11
          0x55b44dbde928: ch,glue = CopyToReg 0x55b44dbde7f0, Register:i64 %11, 0x55b44dbde448, 0x55b44dbde7f0:1
            0x55b44dbde8c0: i64 = Register %11
            0x55b44dbde448: i64 = add 0x55b44dbdbbd8, 0x55b44dbe2b08
              0x55b44dbdbbd8: i64 = add 0x55b44dbdb830, 0x55b44dbde3e0
                0x55b44dbdb830: i64,ch = load<(dereferenceable invariant load 8 from `i64 addrspace(101)* null`, addrspace 101)> 0x55b44d28a958, TargetExternalSymbol:i64'_normalized_op_param_0', undef:i64
                  0x55b44dbdacd0: i64 = TargetExternalSymbol'_normalized_op_param_0'
                  0x55b44dbdada0: i64 = undef
                0x55b44dbde3e0: i64 = NVPTXISD::MUL_WIDE_SIGNED 0x55b44dbdb968, Constant:i32<2>
                  0x55b44dbdb968: i32 = mul 0x55b44dbdb900, 0x55b44dbdbb70
                    0x55b44dbdb900: i32 = llvm.nvvm.read.ptx.sreg.ctaid.x TargetConstant:i64<4999>
                      0x55b44dbdb898: i64 = TargetConstant<4999>
                    0x55b44dbdbb70: i32,ch = load<(dereferenceable invariant load 4 from `i32 addrspace(101)* null`, addrspace 101)> 0x55b44d28a958, TargetExternalSymbol:i64'_normalized_op_param_2', undef:i64
                      0x55b44dbdb010: i64 = TargetExternalSymbol'_normalized_op_param_2'
                      0x55b44dbdada0: i64 = undef
                  0x55b44dbe0f60: i32 = Constant<2>
              0x55b44dbe2b08: i64 = NVPTXISD::MUL_WIDE_SIGNED 0x55b44dbe29d0, Constant:i32<2>
                0x55b44dbe29d0: i32 = or 0x55b44dbe2e48, 0x55b44dbe2fe8
                  0x55b44dbe2e48: i32 = shl 0x55b44dbdb7c8, Constant:i32<5>
                    0x55b44dbdb7c8: i32 = srl 0x55b44dbdb558, Constant:i32<5>
                      0x55b44dbdb558: i32 = llvm.nvvm.read.ptx.sreg.tid.x TargetConstant:i64<5057>
    
                      0x55b44dbdee70: i32 = Constant<5>
                    0x55b44dbdee70: i32 = Constant<5>
                  0x55b44dbe2fe8: i32 = sub 0x55b44dbdb558, 0x55b44dbe2e48
                    0x55b44dbdb558: i32 = llvm.nvvm.read.ptx.sreg.tid.x TargetConstant:i64<5057>
                      0x55b44dbdb4f0: i64 = TargetConstant<5057>
                    0x55b44dbe2e48: i32 = shl 0x55b44dbdb7c8, Constant:i32<5>
                      0x55b44dbdb7c8: i32 = srl 0x55b44dbdb558, Constant:i32<5>
    
    
                      0x55b44dbdee70: i32 = Constant<5>
                0x55b44dbe0f60: i32 = Constant<2>
            0x55b44dbde7f0: ch,glue = CopyToReg 0x55b44d28a958, Register:i1 %10, 0x55b44dbde518
              0x55b44dbde788: i1 = Register %10
              0x55b44dbde518: i1 = setcc 0x55b44dbe29d0, 0x55b44dbdbb70, setlt:ch
                0x55b44dbe29d0: i32 = or 0x55b44dbe2e48, 0x55b44dbe2fe8
                  0x55b44dbe2e48: i32 = shl 0x55b44dbdb7c8, Constant:i32<5>
                    0x55b44dbdb7c8: i32 = srl 0x55b44dbdb558, Constant:i32<5>
                      0x55b44dbdb558: i32 = llvm.nvvm.read.ptx.sreg.tid.x TargetConstant:i64<5057>
    
                      0x55b44dbdee70: i32 = Constant<5>
                    0x55b44dbdee70: i32 = Constant<5>
                  0x55b44dbe2fe8: i32 = sub 0x55b44dbdb558, 0x55b44dbe2e48
                    0x55b44dbdb558: i32 = llvm.nvvm.read.ptx.sreg.tid.x TargetConstant:i64<5057>
                      0x55b44dbdb4f0: i64 = TargetConstant<5057>
                    0x55b44dbe2e48: i32 = shl 0x55b44dbdb7c8, Constant:i32<5>
                      0x55b44dbdb7c8: i32 = srl 0x55b44dbdb558, Constant:i32<5>
    
    
                      0x55b44dbdee70: i32 = Constant<5>
                0x55b44dbdbb70: i32,ch = load<(dereferenceable invariant load 4 from `i32 addrspace(101)* null`, addrspace 101)> 0x55b44d28a958, TargetExternalSymbol:i64'_normalized_op_param_2', undef:i64
                  0x55b44dbdb010: i64 = TargetExternalSymbol'_normalized_op_param_2'
                  0x55b44dbdada0: i64 = undef
    In function: _normalized_op
    Fatal Python error: Aborted
    
  • Comparing TritonIR to TVM TensorIR

    Comparing TritonIR to TVM TensorIR

    As Philippe states, Triton helps researchers with no CUDA experience to write expert-level GPU code. In AI community, there are other frameworks that generate high performance kernels without CUDA code, among these frameworks, TVM seems to be a good competitor.

    TVM separates compute logic and execution scheduling, users describe the logic by writing a lambda function that describes how the whole data should be processed, and the code is compiled to TensorIR. After that, users could optimize the execution flow by transforming TensorIR module with several schedule passes. Here's a graph copied from TVM doc: image

    I wonder what's the difference between TritonIR and TensorIR? More specific, could TritonIR be considered as a specialized schedule schema in TensorIR? If not so, what are the cases that Triton is more suitable for? Thanks!

  • [OPTIMIZER] Update the versionMinor in MMA layout for volta

    [OPTIMIZER] Update the versionMinor in MMA layout for volta

    Continue the work https://github.com/openai/triton/pull/990

    Background

    The versionMinor in MmaEncodingAttr holds some states of DotOp's operands in Volta, while such operands will be modified by some patterns, making the states out-of-date.

    This PR helps to correct the states.

    Implementation

    It adds three new patterns:

    1. CollectMmaToUpdateForVolta helps to collect and build a map holding the MmaEncodingAttr instances with wrong states and create new correct ones for them,
    2. UpdateMMAVersionMinorForVolta helps to replace the Ops generating the wrong MmaEncodingAttr instances with new correct ones, currently it supports the following Ops a. convert_layout[X -> mma] b. arith.constant SplatAttr : !tensor<mma> c. dot ... : !tensor<mma>
    3. RematerializeForloop helps to rematerialize the scf.for holding the correct MmaEncodingAttr in InitArgs but returning wrong MmaEncodingAttr instances as a result

    Limitation

    This PR chooses the mapping way to bypass the IR walk complexity from the circular dependency between dot_operand[parent] and mma. We use the MmaEncodingAttr instance as the mapping key, but there might be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the same wrong MmaEncodingAttr instance. To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID field to MmaEncodingAttr.

  • How to leverage hardware intrinsics like convolution in triton

    How to leverage hardware intrinsics like convolution in triton

    Background

    We are evaluating how to build a new backend on triton, But there is a problem, that is, there are some relatively coarse-grained instructions on our hardware, such as convolution, and there is no similar mechanism on the triton language.

    The following is the conv instruction decription(refer to https://www.cambricon.com/docs/sdk_1.9.0/cntoolkit_3.1.4/cambricon_bang_c_4.1.3/2Builtin-Functions/Artificial%20Intelligence%20Functions.html#bang-conv for more information):

    void __bang_conv(float *dst, float *src, float *kernel, int channel_input, int height, int width, int kernel_height, int kernel_width, int stride_width, int stride_height, int channel_output)
    

    We notice there is a dot op in triton language, which can easily be mapped to mma instruction in NV backend. Similarly, we can also see the convolution based on the dot operator from the pytorch inductor, The following is the code snippet from pytorch inductor (Delete some code that handles conv1x1 specially and prefetch):

    // compute offset for output and the initial ptr for input
    ...
    # -----------------------------------------------------------
    # allocate accumulator
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for crs in range(0, CRS, BLOCK_K):
        # load inc ptr of x, upade x_ptrs
        off_x_crs = crs + BLOCK_K + tl.arange(0, BLOCK_K)
        delta_xh_ptrs += BLOCK_K
         delta_xw_ptrs += BLOCK_K
         delta_xc_ptrs += BLOCK_K
         delta_xh = tl.load(delta_xh_ptrs, mask=off_x_crs < CRS, other=0)
         delta_xw = tl.load(delta_xw_ptrs, mask=off_x_crs < CRS, other=0)
         delta_xc = tl.load(delta_xc_ptrs, mask=off_x_crs < CRS, other=0)
         off_x_crs_unpacked = (
                delta_xh * stride_xh + delta_xw * stride_xw + delta_xc * stride_xc
          )
          x_ptrs = x + off_x_nhw[:, None] + off_x_crs_unpacked[None, :]
    
        mask_x = (
            (off_x_n < BATCH)[:, None]
            & (off_x_crs < CRS)[None, :]
            & (off_x_h[:, None] + delta_xh[None, :] >= 0)
            & (off_x_h[:, None] + delta_xh[None, :] < IN_H)
            & (off_x_w[:, None] + delta_xw[None, :] >= 0)
            & (off_x_w[:, None] + delta_xw[None, :] < IN_W)
        )
        mask_w = (off_x_crs < CRS)[:, None] & (off_w_k < KERNEL_N)[None, :]
        # ------ prefetch ------
        # ------ load x ------
        matrix_x = tl.load(x_ptrs, mask=mask_x, other=0.0)
        # ------ load w ------
        matrix_w = tl.load(w_ptrs, mask=mask_w, other=0.0)
        # ------ matrix multiplication ------
        acc += tl.dot(matrix_x, matrix_w)
    
    

    For the above code, it can be divided into several steps:

    1. calculate the offset of the data
    2. load the data
    3. call dot to do multiplication and accumulation

    How to convert the above logic into conv instruction, there are several possible solutions below, which may not be complete. I don’t know which one Triton officially prefers, or whether there is a better solution.

    Proposal 1: Try to identify convolution instructions on Triton IR

    The principle of this proposal is very simple, but it may be difficult to implement. It is based on the Triton IR obtained from the triton language, and tries to generate a conv instruction by analyzing the instruction.

    But from the above logic, how to recognize that it is a convolution of [KH, KW, PH, PW, ...] ?

    Proposal 2: Add conv ops in triton language

    Similar to dot op, we introduce conv op in triton language:

    tl.conv(input, weights, IC, ....)
    
  • Any way to synchronize within kernel?

    Any way to synchronize within kernel?

    Hello all

    I am trying to write a kernel that captures the forward pass of a HMM. Basically, this involves taking an emission matrix of shape (sequence_length, batch_size, num_states) and iteratively doing numerically-stable dot products against a transition matrix of size (num_states, num_states). When sequence_length is long and batch_size is small and num_states are small (1.2M, 1, and 32, respectively, for an example I'm working with) the overhead of launching >1.2M PyTorch ops takes up the majority of the time. I was hoping I could capture the entire forward pass in a Triton kernel to get rid of some of this overhead.

    Today was my first day doing GPU programming of any sort, so apologies if the kernel or the question are silly, but this is what I have so far:

    @triton.jit
    def _forward_kernel(
    	f_ptr, t_ptr,
    	L, M, K,
    	stride_fl, stride_fm, stride_fk,
    	stride_tk, stride_tn,
    	BLOCK_SIZE: tl.constexpr
    ):
    
    	row_idx = tl.program_id(axis=0)
    	row_start_ptr = f_ptr + row_idx * stride_fm
    
    	offsets = tl.arange(0, BLOCK_SIZE)
    
    	row_ptrs = row_start_ptr + offsets
    	col_ptrs = t_ptr + offsets[:, None]*stride_tk + offsets[None, :]*stride_tn
    	out_ptrs = row_ptrs + stride_fl
    
    	cols = tl.load(col_ptrs)
    	row = tl.load(row_ptrs)
    	emissions = tl.load(out_ptrs)
    
    	for i in range(L-1):
    		row_max = tl.max(row, axis=0)
    		row_norm = tl.exp(row - row_max)
    
    		dot = tl.sum(row_norm[:, None] * cols, axis=0)
    		log_dot = tl.log(dot) + row_max
    		
    		tl.store(out_ptrs, log_dot + emissions)
    
    		row_ptrs += stride_fl
    		out_ptrs += stride_fl
    
    		row = log_dot
    		emissions = tl.load(out_ptrs)
    

    I'm almost positive that the underlying logic is correct because of other tests I've run but I keep getting nans in the output. However, if I set the loop to for i in range(1) then the first output is correct. This suggests to me that the issue is due to the asynchronous operations. Is there a way to force a kernel to synchronize that I can do at the end of each loop? Alternatively, is there a better way to do what I'd like to do regarding reducing overhead of tons of small operations?

    Thanks!

  • Bug when using for-loop in kernel.

    Bug when using for-loop in kernel.

    Triton seems to be a nice framework. But I have met a strange bug when using for-loop in kernel. Here is my minimal reproducible example.

    import torch
    import triton
    import triton.language as tl
    
    @triton.jit
    def test_kernel(x_ptr, LOOP_NUM: tl.constexpr):
        # pid = tl.program_id(axis=0)
        for i in range(LOOP_NUM):
            print(i)
    
    @triton.jit
    def test_kernel_wrapper(x_ptr, LOOP_NUM: tl.constexpr):
        test_kernel(x_ptr, LOOP_NUM)
    
    def test(loop_num=10):
        x = torch.randn(5).cuda() # avoid bug: RuntimeError: CUDA: Error- context is destroyed
        test_kernel_wrapper[(1, )](x, loop_num)
    
    

    When the loop_num >= 11, following bug occurs:

    int32[constexpr[1]]
    Traceback (most recent call last):
      File "test.py", line 754, in <module>
        test(loop_num=11)
      File "test.py", line 752, in test
        test_kernel_wrapper[(1, )](x, loop_num)
      File "/home/junjie/anaconda3/envs/mmnew/lib/python3.8/site-packages/triton/code_gen.py", line 999, in __call__
        return self.kernel(*wargs, **kwargs, grid=self.grid)
      File "/home/junjie/anaconda3/envs/mmnew/lib/python3.8/site-packages/triton/code_gen.py", line 988, in __call__
        return _triton.runtime.launch(wargs, self.fn.do_not_specialize, cache_key, self.fn.arg_names,
      File "/home/junjie/anaconda3/envs/mmnew/lib/python3.8/site-packages/triton/code_gen.py", line 956, in add_to_cache
        return self.fn._warmup(key, arg_types=arg_types, device=device_idx, attributes=attributes, constants=constants, num_warps=num_warps, num_stages=num_stages,
      File "/home/junjie/anaconda3/envs/mmnew/lib/python3.8/site-packages/triton/code_gen.py", line 1285, in _warmup
        binary = self._compile(**compile)
      File "/home/junjie/anaconda3/envs/mmnew/lib/python3.8/site-packages/triton/code_gen.py", line 1320, in _compile
        name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, generator.module, device, num_warps, num_stages, extern_libs)
    IndexError: map::at
    

    If we set loop_num <= 10, everything works well and the output is correct.

PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

Dec 29, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

Dec 14, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
Deep Learning in C Programming Language. Provides an easy way to create and train ANNs.
Deep Learning in C Programming Language. Provides an easy way to create and train ANNs.

cDNN is a Deep Learning Library written in C Programming Language. cDNN provides functions that can be used to create Artificial Neural Networks (ANN)

Dec 24, 2022
Deploying Deep Learning Models in C++: BERT Language Model
 Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

Nov 14, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Dec 30, 2022
Efficient training of deep recommenders on cloud.
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Jan 5, 2023
Cooperative primitives for CUDA C++.
Cooperative primitives for CUDA C++.

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

Jan 4, 2023
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Jun 30, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

Jan 6, 2023
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

Dec 30, 2022
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

Dec 31, 2022
tutorial on how to train deep learning models with c++ and dlib.

Dlib Deep Learning tutorial on how to train deep learning models with c++ and dlib. usage git clone https://github.com/davisking/dlib.git mkdir build

Dec 21, 2021
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

Jan 4, 2023
Caffe2 is a lightweight, modular, and scalable deep learning framework.

Source code now lives in the PyTorch repository. Caffe2 Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the origin

Jan 6, 2023
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Dec 23, 2022
header only, dependency-free deep learning framework in C++14
header only, dependency-free deep learning framework in C++14

The project may be abandoned since the maintainer(s) are just looking to move on. In the case anyone is interested in continuing the project, let us k

Dec 31, 2022
LibDEEP BSD-3-ClauseLibDEEP - Deep learning library. BSD-3-Clause

LibDEEP LibDEEP is a deep learning library developed in C language for the development of artificial intelligence-based techniques. Please visit our W

Dec 8, 2022