TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

License Documentation

TensorRT Open Source Software

This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.

Build

Prerequisites

To build the TensorRT-OSS components, you will first need the following software packages.

TensorRT GA build

System Packages

Optional Packages

Downloading TensorRT Build

  1. Download TensorRT OSS

    git clone -b master https://github.com/nvidia/TensorRT TensorRT
    cd TensorRT
    git submodule update --init --recursive
  2. (Optional - if not using TensorRT container) Specify the TensorRT GA release build

    If using the TensorRT OSS build container, TensorRT libraries are preinstalled under /usr/lib/x86_64-linux-gnu and you may skip this step.

    Else download and extract the TensorRT GA build from NVIDIA Developer Zone.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4

    cd ~/Downloads
    tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
    export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8

    Example: Windows on x86-64 with cuda-11.4

    cd ~\Downloads
    Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
    $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
  3. (Optional - for Jetson builds only) Download the JetPack SDK

    1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
    2. Select the platform and target OS (example: Jetson AGX Xavier, Linux Jetpack 4.6), and click Continue.
    3. Under Download & Install Options change the download folder and select Download now, Install later. Agree to the license terms and click Continue.
    4. Move the extracted files into the /docker/jetpack_files folder.

Setting Up The Build Environment

For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, on Windows for example, please install the prerequisite System Packages.

  1. Generate the TensorRT-OSS build container.

    The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build script. The build container is configured for building TensorRT OSS out-of-the-box.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4.2 (default)

    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.4

    Example: CentOS/RedHat 7 on x86-64 with cuda-10.2

    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda10.2 --cuda 10.2

    Example: Ubuntu 18.04 cross-compile for Jetson (aarch64) with cuda-10.2 (JetPack SDK)

    ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2

    Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2

    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
  2. Launch the TensorRT-OSS build container.

    Example: Ubuntu 18.04 build container

    ./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.4 --gpus all

    NOTE:

    1. Use the --tag corresponding to build container generated in Step 1.
    2. NVIDIA Container Toolkit is required for GPU access (running TensorRT applications) inside the build container.
    3. sudo password for Ubuntu build containers is 'nvidia'.
    4. Specify port number using --jupyter for launching Jupyter notebooks.

Building TensorRT-OSS

  • Generate Makefiles or VS project (Windows) and build.

    Example: Linux (x86-64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
     make -j$(nproc)

    NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:

    yum -y install centos-release-scl
    yum-config-manager --enable rhel-server-rhscl-7-rpms
    yum -y install devtoolset-8
    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}

    Example: Linux (aarch64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
     make -j$(nproc)

    Example: Native build on Jetson (aarch64) with cuda-10.2

    cd $TRT_OSSPATH
    mkdir -p build && cd build
    cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=10.2
    CC=/usr/bin/gcc make -j$(nproc)

    NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf.

    Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
     make -j$(nproc)

    NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.

    Example: Windows (x86-64) build in Powershell

     cd $Env:TRT_OSSPATH
     mkdir -p build ; cd build
     cmake .. -DTRT_LIB_DIR=$Env:TRT_LIBPATH -DTRT_OUT_DIR='$(Get-Location)\out' -DCMAKE_TOOLCHAIN_FILE=..\cmake\toolchains\cmake_x64_win.toolchain
     msbuild ALL_BUILD.vcxproj

    NOTE:

    1. The default CUDA version used by CMake is 11.4.2. To override this, for example to 10.2, append -DCUDA_VERSION=10.2 to the cmake command.
    2. If samples fail to link on CentOS7, create this symbolic link: ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8
  • Required CMake build arguments are:

    • TRT_LIB_DIR: Path to the TensorRT installation directory containing libraries.
    • TRT_OUT_DIR: Output directory where generated build artifacts will be copied.
  • Optional CMake build arguments:

    • CMAKE_BUILD_TYPE: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [Release] | Debug
    • CUDA_VERISON: The version of CUDA to target, for example [11.4.2].
    • CUDNN_VERSION: The version of cuDNN to target, for example [8.2].
    • PROTOBUF_VERSION: The version of Protobuf to use, for example [3.0.0]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.
    • CMAKE_TOOLCHAIN_FILE: The path to a toolchain file for cross compilation.
    • BUILD_PARSERS: Specify if the parsers should be built, for example [ON] | OFF. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_PLUGINS: Specify if the plugins should be built, for example [ON] | OFF. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_SAMPLES: Specify if the samples should be built, for example [ON] | OFF.
    • GPU_ARCHS: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found here. Examples:
      • NVidia A100: -DGPU_ARCHS="80"
      • Tesla T4, GeForce RTX 2080: -DGPU_ARCHS="75"
      • Titan V, Tesla V100: -DGPU_ARCHS="70"
      • Multiple SMs: -DGPU_ARCHS="80 75"
    • TRT_PLATFORM_ID: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: x86_64 (default), aarch64

References

TensorRT Resources

Known Issues

Comments
  • How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    All right, so, I have a PyTorch detector SSD with MobileNet. Since I failed to convert model with NMS in it (to be more precise, I converted it, but TRT engine is built in a wrong way with that .onnx file), I decided to leave NMS part to TRT.

    In general, there are several ways to add NMS in TRT:

    1. Use graphsurgeon with TensorFlow model and add NMS as graphsurgeon.create_plugin_node
    2. Use CPP code for plugin (https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin)
    3. Use DeepStream that has NMS plugin

    But, I have a PyTorch model that I converted to onnx and then to TRT without any CPP code (Python only). My question is very simple: how can I combine my current pipeline with the CPP plugin for NMS?

  • [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2
    &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d ~/data
    

    ubuntu16.04 TensorRT 6.x (build source from git branch release/6.0) following tutorial converts matterport maskrcnn model successfully to uff, inference got this result.

  • tensort7 load onnx resize ops error

    tensort7 load onnx resize ops error

    Description

    when i load onnx model, fpn F.interpolate ops error


    While parsing node number 209 [Resize]: ERROR: builtin_op_importers.cpp:2412 In function importResize: [8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"


    this error in onnx-tensorrt

    Environment

    TensorRT Version: 7.0 GPU Type: 1060 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.5.32 Operating System + Version: win10

  • how to create an engine serve for multiple source inputs?

    how to create an engine serve for multiple source inputs?

    How can i create 1 engine (ex: 1 tensorrt detector engine) that can serve for 6 or 10 camera to detect object by using threading without get confused output between these source?

  • How to add NMS with Tensorflow Model (that was converted to ONNX)

    How to add NMS with Tensorflow Model (that was converted to ONNX)

    I have taken an ssdlite mobile net v2 model from the tensorflow model zoo

    steps :

    1. generated the onnx model using the tf2onnx lib python -m tf2onnx.convert --graphdef mv2/ssdlite_mobilenet_v2_coco_2018_05_09/frozen_inference_graph.pb --output MODEL_frozen.onnx \ --fold_const --opset 11 \ --inputs image_tensor:0 \ --outputs num_detections:0,detection_boxes:0,detection_scores:0,detection_classes:0

    2. add the nms layers in the onnx model based on refferences from this issue

    import onnx_graphsurgeon as gs
    import onnx
    import numpy as np
    
    input_model_path = "MODEL_frozen.onnx"
    output_model_path = "model_gs.onnx"
    
    @gs.Graph.register()
    def trt_batched_nms(self, boxes_input, scores_input, nms_output,
                        share_location, num_classes):
    
        boxes_input.outputs.clear()
        scores_input.outputs.clear()
        nms_output.inputs.clear()
    
        attrs = {
            "shareLocation": share_location,
            "numClasses": num_classes,
            "backgroundLabelId": 0,
            "topK": 116740,
            "keepTopK": 100,
            "scoreThreshold": 0.3,
            "iouThreshold": 0.6,
            "isNormalized": True,
            "clipBoxes": True
        }
        return self.layer(op="BatchedNMS_TRT", attrs=attrs,
                          inputs=[boxes_input, scores_input],
                          outputs=[nms_output])
    
    
    graph = gs.import_onnx(onnx.load(input_model_path))
    graph.inputs[0].shape=[1,300,300,3]
    print(graph.inputs[0].shape)
    
    for inp in graph.inputs:
        inp.dtype = np.int
    
    input = graph.inputs[0]
    
    tmap = graph.tensors()
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    
    
    # Remove unused nodes, and topologically sort the graph.
    # graph.cleanup()
    # graph.toposort()
    # graph.fold_constants().cleanup()
    
    # Export the ONNX graph from graphsurgeon
    onnx.checker.check_model(gs.export_onnx(graph))
    onnx.save_model(gs.export_onnx(graph), output_model_path)
    
    print("Saving the ONNX model to {}".format(output_model_path))
    
    

    I am not able to figure it out in the onnx graph which nodes i should repalce in place of "Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0" and other

    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
          
    tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    

    MODEL_frozen.onnx.zip

    I have also attach the onnx file. Any sugeestions how to find it ?

  • BERT fp16 accuracy problem

    BERT fp16 accuracy problem

    Description

    When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

    Environment

    TensorRT Version: 7.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 440.59 CUDA Version: 10.2 CUDNN Version: 8.0.4 Operating System: centos7 Python Version (if applicable): 3.6 Tensorflow Version (if applicable): 1.15.4 PyTorch Version (if applicable): Baremetal or Container (if so, version):

    Steps To Reproduce

    Proceed as follows: 1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine 2、when trt building, set these parameters: with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ... 3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
    network.get_layer(i).precision = trt.DataType.FLOAT BUT no effect

    I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

  • Onnx Dynamic input to TensorRT

    Onnx Dynamic input to TensorRT

    [TensorRT] INTERNAL ERROR: Assertion failed: aMatrix.second == bMatrix.first ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp:35 Aborting... [TensorRT] ERROR: ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp (35) - Assertion Error in assertDimsOkayForMatrixMultiplyLayer: 0 (aMatrix.second == bMatrix.first)

  • ONNX networks can't use INT8 calibration and batching

    ONNX networks can't use INT8 calibration and batching

    Description

    This is due to mutually incompatible changes in the TRT7 release:

    https://docs.nvidia.com/deeplearning/sdk/tensorrt-release-notes/tensorrt-7.html

    ONNX parser with dynamic shapes support The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

    versus

    Known Issues The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code: Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.

    This means the ONNX network must be exported at a fixed batch size in order to get INT8 calibration working, but now it's no longer possible to specify the batch size. I also verified that manually fixing up the inputs with setDimensions(...-1...) does not work, you will hit an assertion mg.nodes[mg.regionIndices[outputRegion]].size ==mg.nodes[mg.regionIndices[inputRegion]].size while building.

    One would think there might be sort of a workaround by exporting two different networks, one with a fixed batch size and a second one with a dynamic_axis, and then using the calibration from one for the other. ~~However, even here there are severe pitfalls: a calibration cache that is generated for, say, batch_size=1 won't necessarily work for larger batch sizes, presumably because they will generate a different convolution strategy that causes different accuracy issues.~~ Edit: This might've been another issue.

    Lastly, the calibrator itself appears to be using implicit batch sizes, and breaks on batch size > 1 as follows:

    TRT: Starting Calibration with batch size 16. Calibrated 16 images. TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: C:\source\builder\cudnnCalibrator.cpp (707) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\builder\cudnnCalibrator.cpp (703) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\rtSafe\cuda\caskConvolutionRunner.cpp (233) - Cuda Error in nvinfer1::rt::task::CaskConvolutionRunner::allocateContextResources: 700 (an illegal memory access was encountered) TRT: FAILED_EXECUTION: Unknown exception TRT: Calibrated batch 0 in 2.62865 seconds. Cuda failure: 700

    with batch_size == 1, it's also hitting assertions:

    TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: Assertion failed: d.nbDims >= 1 C:\source\rtSafe\safeHelpers.cpp:419 Aborting...

    The combination of all these failures means that you can't really use ONNX networks in INT8 mode, at least the "Using a fixed shape input to build the engine in the first pass" recommendation hits all kinds of internal assertions as you can see above.

    Environment

    TensorRT Version: 7.0.0.11 GPU Type: RTX 2080 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.0.5 Operating System + Version: Windows 10 Python Version (if applicable): 3.6 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.3 stable Baremetal or Container (if container which image + tag): bare

    Relevant Files

    Steps To Reproduce

  • trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    Hi, I have used samleUffMaskrcnn for my own dataset, it worked, but the results are different. The result of trt samleUffMaskrcnn depends much on anchor scales and anchor ratios, i set the same params in both test codes. The keras one performs better, as it can show more object(instances), but some object in trt maskrcnn can't be detected, especially Slender object, like a pole. Thanks for help

  • [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    while I try to run the maskrcnn demo following this page

    Ubuntu 16.04.6 CUDA 10.1.168 tensorrt 5.1.5.0 uff 0.6.3

    Traceback (most recent call last):
      File "mrcnn_to_trt_single.py", line 165, in <module>
        main()
      File "mrcnn_to_trt_single.py", line 123, in main
        text=True, list_nodes=list_nodes)
      File "mrcnn_to_trt_single.py", line 158, in convert_model
        debug_mode = False
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 233, in from_tensorflow_frozen_model
        return from_tensorflow(graphdef, output_nodes, preprocessor, **kwargs)
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 108, in from_tensorflow
        pre.preprocess(dynamic_graph)
      File "./config.py", line 123, in preprocess
        connect(dynamic_graph, timedistributed_connect_pairs)
      File "./config.py", line 113, in connect
        if node_a_name not in dynamic_graph.node_map[node_b_name].input:
    KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1'
    
  • (Upsample) How can I use onnx parser with opset 11 ?

    (Upsample) How can I use onnx parser with opset 11 ?

    Description

    onnx-parser is basically built with ir_version 3, opset 7 (https://github.com/onnx/onnx-tensorrt/blob/master/onnx_trt_backend.cpp)

    Is there any way to use onnx parser with opset 11 support ?

    I mean, parser works only with opset7 version. parser works well if I use ir4_opset7 version onnx model, but doesn't work if I use ir4_opset11 version onnx model.

    It also cannot parse opset 8 and 9.

    My onnx models are made by pytorch 1.4.0a.

    Can I rebuild the parser by changing only the BACKEND_OPSET constant inside onnx_trt_backend.cpp?

    Environment

    TensorRT Version: 7.0.0 GPU Type: T4 Nvidia Driver Version: 440.33.01 CUDA Version: 10.2.89 CUDNN Version: 7.6.5 Operating System + Version: Ubuntu18.04 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): 1.4.0 PyTorch Version (if applicable): 1.4.0a

  •  Assertion failed: creator &&

    Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

    Description

    Environment

    TensorRT Version: trt7 NVIDIA GPU: 3060 NVIDIA Driver Version: 515.57 CUDA Version: 11.7 CUDNN Version: Operating System: ubuntu18.04 Python Version (if applicable): 3.9 Tensorflow Version (if applicable): None PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version):

    Relevant Files

    Hello,I want to use the pipeline "pytorch to onnx to tensorrt" to convert a unsupported op by both onnx and trt。So I use a fake onnx op to complete the process from pytorch to onnx。After this,the onnx model looks like this: image

    After this,I copy the msda from trt8 as a plugin in trt7。When I convert the model from onnx to tensorrt,an error occurs: image

    Then I checked the pluginType、pluginVersion and pluginName,they are correct: image image

    And in cmakelist,I have: image

    So what can I do to solve this problem?

    Steps To Reproduce

  • Error with multiple optimization profiles when fetching runtime dimensions. Assertion slots.size() >= static_cast<size_t>(code.nbSlots) failed. insufficient number of slots provided.

    Error with multiple optimization profiles when fetching runtime dimensions. Assertion slots.size() >= static_cast(code.nbSlots) failed. insufficient number of slots provided.

    Description

    After compiling our model with multiple optimization profiles, we create multiple execution contexts for each profile for inference. When using one of the contexts and trying to call context.get_tensor_shape to get the shape of our output, after we have set our input shape with context.set_input_shape, we see the following error:

    Error Code 2: Internal Error (Assertion slots.size() >= static_cast<size_t>(code.nbSlots) failed. insufficient number of slots provided)
    (0) ## this is the dimension printed.
    

    However, when compiling our model with one optimization profile and running the same code, we do not see any issues as the output dimensions are printed properly.

    Could someone explain the error message a bit more and suggest a path toward resolution?

    Environment

    TensorRT Version: 8.5.1.7 NVIDIA GPU: Tesla T4 NVIDIA Driver Version: 510.73.08 CUDA Version: 11.6 CUDNN Version: 8.6 Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version):

    Relevant Files

    Steps To Reproduce

    1. deserialized engine: engine = runtime.deserialize_cuda_engine(serialized_engine)
    2. Create execution context: context = engine.create_execution_context()
    3. Set our input shape and print dimensions as follows
        in0 = engine.get_tensor_name(0)
        in1 = engine.get_tensor_name(1)
        out1 = engine.get_tensor_name(2)
        out2 = engine.get_tensor_name(3)
    
        context.set_input_shape(in0, (1, 10))
        context.set_input_shape(in1, (1,))
        print(context.get_tensor_shape(out1))
    
  • Can not convert auto-regression model to TensorRT engine

    Can not convert auto-regression model to TensorRT engine

    TensorRT Version: 8.2.3.0

    When I convert an auto-regression with while_loop operator model to tensorrt engine with trtexec, it gives the following error:

    [01/03/2023-08:43:00] [E] [TRT] parsers/onnx/ModelImporter.cpp:783: --- End node --- [01/03/2023-08:43:00] [E] [TRT] parsers/onnx/ModelImporter.cpp:785: ERROR: parsers/onnx/ModelImporter.cpp:166 In function parseGraph: [6] Invalid Node - generic_loop_Loop__352 [graphShapeAnalyzer.cpp::processCheck::581] Error Code 4: Internal Error ((Unnamed Layer* 3582) [Recurrence]: inputs to IRecurrenceLayer have different dimensions. First input has dimensions [3,1] and second input has dimensions [3,2]. ) [graphShapeAnalyzer.cpp::processCheck::581] Error Code 4: Internal Error ((Unnamed Layer* 3582) [Recurrence]: inputs to IRecurrenceLayer have different dimensions. First input has dimensions [3,1] and second input has dimensions [3,2]. ) [01/03/2023-08:43:00] [E] Failed to parse onnx file [01/03/2023-08:43:00] [E] Parsing model failed [01/03/2023-08:43:00] [E] Failed to create engine from model. [01/03/2023-08:43:00] [E] Engine set up failed

    It seems that the input to the while_loop must be constant shape? How can I solve this problem?

  • stable diffusion demo ,run error

    stable diffusion demo ,run error

    Description

    when run demo-diffusion.py,met error.

    Environment

    used the docker provided ,nvcr.io/nvidia/tensorrt :22.10-py3

    TensorRT Version: 8.5.0.12 NVIDIA GPU: V100 NVIDIA Driver Version: 515.43.04 CUDA Version: 11.8 CUDNN Version: None Operating System: Ubantu Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.12.0+cu116 Baremetal or Container (if so, version):

    Relevant Files

    [I] Total Nodes | Original: 1251, After Folding: 1078 | 173 Nodes Folded [I] Folding Constants | Pass 3 [I] Total Nodes | Original: 1078, After Folding: 1078 | 0 Nodes Folded CLIP: fold constants .. 1078 nodes, 1812 tensors, 1 inputs, 1 outputs CLIP: shape inference .. 1078 nodes, 1812 tensors, 1 inputs, 1 outputs CLIP: removed 12 casts .. 1054 nodes, 1788 tensors, 1 inputs, 1 outputs CLIP: inserted 25 LayerNorm plugins .. 842 nodes, 1526 tensors, 1 inputs, 1 outputs CLIP: final .. 842 nodes, 1526 tensors, 1 inputs, 1 outputs Building TensorRT engine for onnx/clip.opt.onnx: engine/clip.plan [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [W] parsers/onnx/onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [E] parsers/onnx/ModelImporter.cpp:740: While parsing node number 7 [LayerNorm -> "LayerNormV-0"]: [E] parsers/onnx/ModelImporter.cpp:741: --- Begin node --- [E] parsers/onnx/ModelImporter.cpp:742: input: "input.7" input: "LayerNormGamma-0" input: "LayerNormBeta-0" output: "LayerNormV-0" name: "LayerNormN-0" op_type: "LayerNorm" attribute { name: "epsilon" f: 1e-05 type: FLOAT } [E] parsers/onnx/ModelImporter.cpp:743: --- End node --- [E] parsers/onnx/ModelImporter.cpp:745: ERROR: parsers/onnx/builtin_op_importers.cpp:5365 In function importFallbackPluginImporter: [8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" [E] In node 7 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" [!] Could not parse ONNX correctly Traceback (most recent call last): File "demo-diffusion.py", line 482, in demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset, File "demo-diffusion.py", line 241, in loadEngines engine.build(onnx_opt_path, fp16=True,
    File "/workspace/demo/Diffusion/utilities.py", line 72, in build engine = engine_from_network(network_from_onnx_path(onnx_path), config=CreateConfig(fp16=fp16, profiles=[p], File "", line 3, in func_impl File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in call return self.call_impl(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 183, in call_impl trt_util.check_onnx_parser_errors(parser, success) File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/util.py", line 85, in check_onnx_parser_errors G_LOGGER.critical("Could not parse ONNX correctly") File "/usr/local/lib/python3.8/dist-packages/polygraphy/logger/logger.py", line 597, in critical raise PolygraphyException(message) from None polygraphy.exception.exception.PolygraphyException: Could not parse ONNX correctly

  • [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)

    [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)

    Description

    my model is trained in pytorch then quantized using pytorch-quantization in tensorrt/tools then exported to onnx then build engine from onnx

    the error in title occured when i parsed onnx model to build engine in my jetson xaiver nx (i tested this pipeline in gpu tensorrt 8.5, and no error occured) this is the log context:

    [TensorRT] VERBOSE: Eliminating concatenation node_of_outputs_coords
    [TensorRT] VERBOSE: Generating copy for 15813 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15815 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15817 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15819 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15821 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: Generating copy for 15823 to outputs_coords because input does not support striding.
    [TensorRT] VERBOSE: After concat removal: 3085 layers
    [TensorRT] VERBOSE: Graph construction and optimization completed in 220.024 seconds.
    [TensorRT] INFO: ---------- Layers Running on DLA ----------
    [TensorRT] INFO: ---------- Layers Running on GPU ----------
    [TensorRT] INFO: [GpuLayer] node_of_1001_quantize_scale_node
    [TensorRT] INFO: [GpuLayer] node_of_inputs
    ...##(too much layers output, more than 3000 layers, )
    [TensorRT] INFO: [GpuLayer] node_of_14036
    [TensorRT] INFO: [GpuLayer] node_of_15819
    [TensorRT] INFO: [GpuLayer] node_of_13155
    [TensorRT] INFO: [GpuLayer] node_of_15817
    [TensorRT] INFO: [GpuLayer] node_of_12274
    [TensorRT] INFO: [GpuLayer] node_of_15815
    [TensorRT] INFO: [GpuLayer] 15813 copy
    [TensorRT] INFO: [GpuLayer] 15815 copy
    [TensorRT] INFO: [GpuLayer] 15817 copy
    [TensorRT] INFO: [GpuLayer] 15819 copy
    [TensorRT] INFO: [GpuLayer] 15821 copy
    [TensorRT] INFO: [GpuLayer] 15823 copy
    [TensorRT] VERBOSE: Using cublas a tactic source
    [TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +227, GPU +209, now: CPU 978, GPU 4493 (MiB)
    [TensorRT] VERBOSE: Using cuDNN as a tactic source
    [TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +307, GPU +306, now: CPU 1285, GPU 4799 (MiB)
    [TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead
    [TensorRT] VERBOSE: Constructing optimization profile number 0 [1/1].
    [TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1285, GPU 4811 (MiB)
    [TensorRT] ERROR: 2: [shuffleBuilder.cpp::addSupportedFormats::50] Error Code 2: Internal Error (Assertion formats.nbInputs() == 1 || formats.nbInputs() == 2 failed.)
    [TensorRT] ERROR: 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
    

    this error seems to be an internal error reported from tensorrt, i searched this error in google, yet found no meaningful information, i want to know which layer causes this error, then i can use some ops to substitute, but i cannot understand what does this error mean, this is where I need help

    Environment

    TensorRT Version: 8.0.1.6 NVIDIA GPU: jetson NVIDIA Driver Version: jetpack 5.0 CUDA Version: 10.2 Operating System: Ubuntu 18.04.5 LTS Python Version (if applicable): 3.6.9 PyTorch Version (if applicable): 1.11.0a0+17540c5 Baremetal or Container (if so, version):

    Relevant Files

    Steps To Reproduce

A library for high performance deep learning inference on NVIDIA GPUs.
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Dec 17, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

Jan 1, 2023
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

Nov 24, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

Dec 5, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

Dec 26, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

Dec 29, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

Dec 14, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

Jan 5, 2023
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

Dec 15, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Dec 30, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference
 Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Dec 20, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

Dec 30, 2022
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)

NVIDIA Deep learning Dataset Synthesizer (NDDS) Overview NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-qualit

Dec 27, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Dec 23, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

Dec 27, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Jan 3, 2023
NVIDIA GPUs htop like monitoring tool
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Dec 31, 2022
Inference framework for MoE layers based on TensorRT with Python binding

InfMoE Inference framework for MoE-based models, based on a TensorRT custom plugin named MoELayerPlugin (including Python binding) that can run infere

Nov 25, 2022