TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

License Documentation

TensorRT Open Source Software

This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.

Build

Prerequisites

To build the TensorRT-OSS components, you will first need the following software packages.

TensorRT GA build

System Packages

Optional Packages

Downloading TensorRT Build

  1. Download TensorRT OSS

    git clone -b master https://github.com/nvidia/TensorRT TensorRT
    cd TensorRT
    git submodule update --init --recursive
  2. (Optional - if not using TensorRT container) Specify the TensorRT GA release build

    If using the TensorRT OSS build container, TensorRT libraries are preinstalled under /usr/lib/x86_64-linux-gnu and you may skip this step.

    Else download and extract the TensorRT GA build from NVIDIA Developer Zone.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4

    cd ~/Downloads
    tar -xvzf TensorRT-8.2.1.8.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
    export TRT_LIBPATH=`pwd`/TensorRT-8.2.1.8

    Example: Windows on x86-64 with cuda-11.4

    cd ~\Downloads
    Expand-Archive .\TensorRT-8.2.1.8.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.1.8'
    $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
  3. (Optional - for Jetson builds only) Download the JetPack SDK

    1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
    2. Select the platform and target OS (example: Jetson AGX Xavier, Linux Jetpack 4.6), and click Continue.
    3. Under Download & Install Options change the download folder and select Download now, Install later. Agree to the license terms and click Continue.
    4. Move the extracted files into the /docker/jetpack_files folder.

Setting Up The Build Environment

For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, on Windows for example, please install the prerequisite System Packages.

  1. Generate the TensorRT-OSS build container.

    The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build script. The build container is configured for building TensorRT OSS out-of-the-box.

    Example: Ubuntu 18.04 on x86-64 with cuda-11.4.2 (default)

    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.4

    Example: CentOS/RedHat 7 on x86-64 with cuda-10.2

    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda10.2 --cuda 10.2

    Example: Ubuntu 18.04 cross-compile for Jetson (aarch64) with cuda-10.2 (JetPack SDK)

    ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda10.2 --cuda 10.2

    Example: Ubuntu 20.04 on aarch64 with cuda-11.4.2

    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.4
  2. Launch the TensorRT-OSS build container.

    Example: Ubuntu 18.04 build container

    ./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.4 --gpus all

    NOTE:

    1. Use the --tag corresponding to build container generated in Step 1.
    2. NVIDIA Container Toolkit is required for GPU access (running TensorRT applications) inside the build container.
    3. sudo password for Ubuntu build containers is 'nvidia'.
    4. Specify port number using --jupyter for launching Jupyter notebooks.

Building TensorRT-OSS

  • Generate Makefiles or VS project (Windows) and build.

    Example: Linux (x86-64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
     make -j$(nproc)

    NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:

    yum -y install centos-release-scl
    yum-config-manager --enable rhel-server-rhscl-7-rpms
    yum -y install devtoolset-8
    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}

    Example: Linux (aarch64) build with default cuda-11.4.2

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
     make -j$(nproc)

    Example: Native build on Jetson (aarch64) with cuda-10.2

    cd $TRT_OSSPATH
    mkdir -p build && cd build
    cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=10.2
    CC=/usr/bin/gcc make -j$(nproc)

    NOTE: C compiler must be explicitly specified via CC= for native aarch64 builds of protobuf.

    Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack)

     cd $TRT_OSSPATH
     mkdir -p build && cd build
     cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=10.2 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
     make -j$(nproc)

    NOTE: The latest JetPack SDK v4.6 only supports TensorRT 8.0.1.

    Example: Windows (x86-64) build in Powershell

     cd $Env:TRT_OSSPATH
     mkdir -p build ; cd build
     cmake .. -DTRT_LIB_DIR=$Env:TRT_LIBPATH -DTRT_OUT_DIR='$(Get-Location)\out' -DCMAKE_TOOLCHAIN_FILE=..\cmake\toolchains\cmake_x64_win.toolchain
     msbuild ALL_BUILD.vcxproj

    NOTE:

    1. The default CUDA version used by CMake is 11.4.2. To override this, for example to 10.2, append -DCUDA_VERSION=10.2 to the cmake command.
    2. If samples fail to link on CentOS7, create this symbolic link: ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8
  • Required CMake build arguments are:

    • TRT_LIB_DIR: Path to the TensorRT installation directory containing libraries.
    • TRT_OUT_DIR: Output directory where generated build artifacts will be copied.
  • Optional CMake build arguments:

    • CMAKE_BUILD_TYPE: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [Release] | Debug
    • CUDA_VERISON: The version of CUDA to target, for example [11.4.2].
    • CUDNN_VERSION: The version of cuDNN to target, for example [8.2].
    • PROTOBUF_VERSION: The version of Protobuf to use, for example [3.0.0]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.
    • CMAKE_TOOLCHAIN_FILE: The path to a toolchain file for cross compilation.
    • BUILD_PARSERS: Specify if the parsers should be built, for example [ON] | OFF. If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_PLUGINS: Specify if the plugins should be built, for example [ON] | OFF. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in ${TRT_LIB_DIR}, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
    • BUILD_SAMPLES: Specify if the samples should be built, for example [ON] | OFF.
    • GPU_ARCHS: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found here. Examples:
      • NVidia A100: -DGPU_ARCHS="80"
      • Tesla T4, GeForce RTX 2080: -DGPU_ARCHS="75"
      • Titan V, Tesla V100: -DGPU_ARCHS="70"
      • Multiple SMs: -DGPU_ARCHS="80 75"
    • TRT_PLATFORM_ID: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: x86_64 (default), aarch64

References

TensorRT Resources

Known Issues

Comments
  • How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    How to use NMS with Pytorch model (that was converted to ONNX -> TensorRT)

    All right, so, I have a PyTorch detector SSD with MobileNet. Since I failed to convert model with NMS in it (to be more precise, I converted it, but TRT engine is built in a wrong way with that .onnx file), I decided to leave NMS part to TRT.

    In general, there are several ways to add NMS in TRT:

    1. Use graphsurgeon with TensorFlow model and add NMS as graphsurgeon.create_plugin_node
    2. Use CPP code for plugin (https://github.com/NVIDIA/TensorRT/tree/master/plugin/batchedNMSPlugin)
    3. Use DeepStream that has NMS plugin

    But, I have a PyTorch model that I converted to onnx and then to TRT without any CPP code (Python only). My question is very simple: how can I combine my current pipeline with the CPP plugin for NMS?

  • [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2 &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d

    [08/18/2019-20:58:10] [E] [TRT] UffParser: Validator error: mrcnn_mask_deconv/add_1: Unsupported operation _AddV2
    &&&& FAILED TensorRT.sample_maskrcnn # ./sample_uff_maskRCNN -d ~/data
    

    ubuntu16.04 TensorRT 6.x (build source from git branch release/6.0) following tutorial converts matterport maskrcnn model successfully to uff, inference got this result.

  • tensort7 load onnx resize ops error

    tensort7 load onnx resize ops error

    Description

    when i load onnx model, fpn F.interpolate ops error


    While parsing node number 209 [Resize]: ERROR: builtin_op_importers.cpp:2412 In function importResize: [8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"


    this error in onnx-tensorrt

    Environment

    TensorRT Version: 7.0 GPU Type: 1060 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.5.32 Operating System + Version: win10

  • how to create an engine serve for multiple source inputs?

    how to create an engine serve for multiple source inputs?

    How can i create 1 engine (ex: 1 tensorrt detector engine) that can serve for 6 or 10 camera to detect object by using threading without get confused output between these source?

  • How to add NMS with Tensorflow Model (that was converted to ONNX)

    How to add NMS with Tensorflow Model (that was converted to ONNX)

    I have taken an ssdlite mobile net v2 model from the tensorflow model zoo

    steps :

    1. generated the onnx model using the tf2onnx lib python -m tf2onnx.convert --graphdef mv2/ssdlite_mobilenet_v2_coco_2018_05_09/frozen_inference_graph.pb --output MODEL_frozen.onnx \ --fold_const --opset 11 \ --inputs image_tensor:0 \ --outputs num_detections:0,detection_boxes:0,detection_scores:0,detection_classes:0

    2. add the nms layers in the onnx model based on refferences from this issue

    import onnx_graphsurgeon as gs
    import onnx
    import numpy as np
    
    input_model_path = "MODEL_frozen.onnx"
    output_model_path = "model_gs.onnx"
    
    @gs.Graph.register()
    def trt_batched_nms(self, boxes_input, scores_input, nms_output,
                        share_location, num_classes):
    
        boxes_input.outputs.clear()
        scores_input.outputs.clear()
        nms_output.inputs.clear()
    
        attrs = {
            "shareLocation": share_location,
            "numClasses": num_classes,
            "backgroundLabelId": 0,
            "topK": 116740,
            "keepTopK": 100,
            "scoreThreshold": 0.3,
            "iouThreshold": 0.6,
            "isNormalized": True,
            "clipBoxes": True
        }
        return self.layer(op="BatchedNMS_TRT", attrs=attrs,
                          inputs=[boxes_input, scores_input],
                          outputs=[nms_output])
    
    
    graph = gs.import_onnx(onnx.load(input_model_path))
    graph.inputs[0].shape=[1,300,300,3]
    print(graph.inputs[0].shape)
    
    for inp in graph.inputs:
        inp.dtype = np.int
    
    input = graph.inputs[0]
    
    tmap = graph.tensors()
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    
    
    # Remove unused nodes, and topologically sort the graph.
    # graph.cleanup()
    # graph.toposort()
    # graph.fold_constants().cleanup()
    
    # Export the ONNX graph from graphsurgeon
    onnx.checker.check_model(gs.export_onnx(graph))
    onnx.save_model(gs.export_onnx(graph), output_model_path)
    
    print("Saving the ONNX model to {}".format(output_model_path))
    
    

    I am not able to figure it out in the onnx graph which nodes i should repalce in place of "Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0" and other

    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
          
    tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores/NonMaxSuppressionV5__1761:0"],
                          tmap["NonMaxSuppression__1763:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5__1737:0"],
                          tmap["NonMaxSuppression__1739:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1713:0"],
                          tmap["NonMaxSuppression__1715:0"],
                          share_location=False,
                          num_classes=8)
    
    graph.trt_batched_nms(tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_2/NonMaxSuppressionV5__1712:0"],
                          tmap["Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_3/NonMaxSuppressionV5__1689:0"],
                          tmap["NonMaxSuppression__1691:0"],
                          share_location=False,
                          num_classes=8)
    

    MODEL_frozen.onnx.zip

    I have also attach the onnx file. Any sugeestions how to find it ?

  • BERT fp16 accuracy problem

    BERT fp16 accuracy problem

    Description

    When using trt to build an fp16 model, in inference, the accuracy is too different from fp32. The model is BERT base. why?

    Environment

    TensorRT Version: 7.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 440.59 CUDA Version: 10.2 CUDNN Version: 8.0.4 Operating System: centos7 Python Version (if applicable): 3.6 Tensorflow Version (if applicable): 1.15.4 PyTorch Version (if applicable): Baremetal or Container (if so, version):

    Steps To Reproduce

    Proceed as follows: 1、tf(freeze mode) -> onnx(version: 1.8.1) -> trt engine 2、when trt building, set these parameters: with builder.create_builder_config() as config: config.flags = config.flags | 1<<int(trt.BuilderFlag.FP16) ... 3、at the same time, I also tried to set the accuracy on these layers(such as: LayerNorm/moments/SquaredDifference、intermediate/dense/Erf、pooler/dense/Tanh、query_head_contrastive/Relu and so on):
    network.get_layer(i).precision = trt.DataType.FLOAT BUT no effect

    I also found a very strange place: when I was in layer0 and layer1, I compared the accuracy is not much different, but when in layer2, there is a big difference. This model has 12 layers, each layer has the same structure

  • Onnx Dynamic input to TensorRT

    Onnx Dynamic input to TensorRT

    [TensorRT] INTERNAL ERROR: Assertion failed: aMatrix.second == bMatrix.first ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp:35 Aborting... [TensorRT] ERROR: ../rtExt/cuda/cudaMatrixMultiplyRunner.cpp (35) - Assertion Error in assertDimsOkayForMatrixMultiplyLayer: 0 (aMatrix.second == bMatrix.first)

  • ONNX networks can't use INT8 calibration and batching

    ONNX networks can't use INT8 calibration and batching

    Description

    This is due to mutually incompatible changes in the TRT7 release:

    https://docs.nvidia.com/deeplearning/sdk/tensorrt-release-notes/tensorrt-7.html

    ONNX parser with dynamic shapes support The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set.

    versus

    Known Issues The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code: Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.

    This means the ONNX network must be exported at a fixed batch size in order to get INT8 calibration working, but now it's no longer possible to specify the batch size. I also verified that manually fixing up the inputs with setDimensions(...-1...) does not work, you will hit an assertion mg.nodes[mg.regionIndices[outputRegion]].size ==mg.nodes[mg.regionIndices[inputRegion]].size while building.

    One would think there might be sort of a workaround by exporting two different networks, one with a fixed batch size and a second one with a dynamic_axis, and then using the calibration from one for the other. ~~However, even here there are severe pitfalls: a calibration cache that is generated for, say, batch_size=1 won't necessarily work for larger batch sizes, presumably because they will generate a different convolution strategy that causes different accuracy issues.~~ Edit: This might've been another issue.

    Lastly, the calibrator itself appears to be using implicit batch sizes, and breaks on batch size > 1 as follows:

    TRT: Starting Calibration with batch size 16. Calibrated 16 images. TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: C:\source\builder\cudnnCalibrator.cpp (707) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\builder\cudnnCalibrator.cpp (703) - Cuda Error in nvinfer1::builder::Histogram::add: 700 (an illegal memory access was encountered) TRT: FAILED_ALLOCATION: Unknown exception TRT: C:\source\rtSafe\cuda\caskConvolutionRunner.cpp (233) - Cuda Error in nvinfer1::rt::task::CaskConvolutionRunner::allocateContextResources: 700 (an illegal memory access was encountered) TRT: FAILED_EXECUTION: Unknown exception TRT: Calibrated batch 0 in 2.62865 seconds. Cuda failure: 700

    with batch_size == 1, it's also hitting assertions:

    TRT: Explicit batch network detected and batch size specified, use execute without batch size instead. TRT: Assertion failed: d.nbDims >= 1 C:\source\rtSafe\safeHelpers.cpp:419 Aborting...

    The combination of all these failures means that you can't really use ONNX networks in INT8 mode, at least the "Using a fixed shape input to build the engine in the first pass" recommendation hits all kinds of internal assertions as you can see above.

    Environment

    TensorRT Version: 7.0.0.11 GPU Type: RTX 2080 Nvidia Driver Version: 441.22 CUDA Version: 10.2 CUDNN Version: 7.6.0.5 Operating System + Version: Windows 10 Python Version (if applicable): 3.6 TensorFlow Version (if applicable): PyTorch Version (if applicable): 1.3 stable Baremetal or Container (if container which image + tag): bare

    Relevant Files

    Steps To Reproduce

  • trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    trt sampleUffMaskrcnn has a different result with maskrcnn implemented in keras

    Hi, I have used samleUffMaskrcnn for my own dataset, it worked, but the results are different. The result of trt samleUffMaskrcnn depends much on anchor scales and anchor ratios, i set the same params in both test codes. The keras one performs better, as it can show more object(instances), but some object in trt maskrcnn can't be detected, especially Slender object, like a pole. Thanks for help

  • [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    [REFERENCE] KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1' in running sampleUffMaskRCNN demo

    while I try to run the maskrcnn demo following this page

    Ubuntu 16.04.6 CUDA 10.1.168 tensorrt 5.1.5.0 uff 0.6.3

    Traceback (most recent call last):
      File "mrcnn_to_trt_single.py", line 165, in <module>
        main()
      File "mrcnn_to_trt_single.py", line 123, in main
        text=True, list_nodes=list_nodes)
      File "mrcnn_to_trt_single.py", line 158, in convert_model
        debug_mode = False
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 233, in from_tensorflow_frozen_model
        return from_tensorflow(graphdef, output_nodes, preprocessor, **kwargs)
      File "/usr/lib/python3.5/dist-packages/uff/converters/tensorflow/conversion_helpers.py", line 108, in from_tensorflow
        pre.preprocess(dynamic_graph)
      File "./config.py", line 123, in preprocess
        connect(dynamic_graph, timedistributed_connect_pairs)
      File "./config.py", line 113, in connect
        if node_a_name not in dynamic_graph.node_map[node_b_name].input:
    KeyError: 'mrcnn_mask_bn4/batchnorm/mul_1'
    
  • (Upsample) How can I use onnx parser with opset 11 ?

    (Upsample) How can I use onnx parser with opset 11 ?

    Description

    onnx-parser is basically built with ir_version 3, opset 7 (https://github.com/onnx/onnx-tensorrt/blob/master/onnx_trt_backend.cpp)

    Is there any way to use onnx parser with opset 11 support ?

    I mean, parser works only with opset7 version. parser works well if I use ir4_opset7 version onnx model, but doesn't work if I use ir4_opset11 version onnx model.

    It also cannot parse opset 8 and 9.

    My onnx models are made by pytorch 1.4.0a.

    Can I rebuild the parser by changing only the BACKEND_OPSET constant inside onnx_trt_backend.cpp?

    Environment

    TensorRT Version: 7.0.0 GPU Type: T4 Nvidia Driver Version: 440.33.01 CUDA Version: 10.2.89 CUDNN Version: 7.6.5 Operating System + Version: Ubuntu18.04 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): 1.4.0 PyTorch Version (if applicable): 1.4.0a

  • When the same model is loaded with different weights, different performance is obtained using trtexec

    When the same model is loaded with different weights, different performance is obtained using trtexec

    command:

    trtexec --onnx=model.onnx --shapes=input:1x3x299x299

    Issue:

    I got different' Throughput' numbers when the same model is loaded with different weights. And the difference is not negligible. For example, for Inception V3: a) random initialization - Throughput: 220.133 qps b) load pre-trained weights- Throughput: 208.012 qps

    I'm curious as to why this is. Is this a normal phenomenon?

  • Faster way to handle duplicated OptimizationProfiles?

    Faster way to handle duplicated OptimizationProfiles?

    Hello!

    I am working on dynamic shape model with multiple contexts which run concurrently. The dynamic range (min/opt/max) is the same for all contexts, so I added the same OptimizationProfile N times to the BuilderConfig.

     profile = builder.create_optimization_profile()
     profile.set_shape(name, shape_min, shape_opt, shape_max)
     
     config = builder.create_builder_config()
     for _ in range(num_profiles):
         config.add_optimization_profile(profile)
    

    I found that the profiles are built sequentially, is there any optimized way to do this? I would be great if I can make profiles duplicated in the serialized model.

    I am using TensorRT 8.2.1.8 now.

    Thank you in advance!

  • how to convert a static quantized onnx model to tensorrt int8 engine?

    how to convert a static quantized onnx model to tensorrt int8 engine?

    I quantized a onnx model by onnx quantization tool, and evaluate the result on onnxruntime . After that, i convert the onnx model to engine file with this code "trtexec.exe --onnx=test.onnx --saveEngine=test.trt --explicitBatch=1 --int8".Then, i found the output of engine file is error.

  • FAILED TensorRT.trtexec [TensorRT v8001]

    FAILED TensorRT.trtexec [TensorRT v8001]

    RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --verbose --noDataTransfers --useCudaGraph --separateProfileRun --useSpinWait --nvtxMode=verbose --loadEngine=yolov5n_ptq_int8/yolov5n_ptq_detect_dynamic.onnx.engine --exportTimes=yolov5n_ptq_int8/yolov5n_ptq_detect_dynamic.onnx.engine.timing.json --exportProfile=yolov5n_ptq_int8/yolov5n_ptq_detect_dynamic.onnx.engine.profile.json --exportLayerInfo=yolov5n_ptq_int8/yolov5n_ptq_detect_dynamic.onnx.engine.graph.json --timingCacheFile=./timing.cache --int8 [E] Unknown option: --exportLayerInfo yolov5n_ptq_int8/yolov5n_ptq_detect_dynamic.onnx.engine.graph.json

    BUT it showed build engine success. Failed to profile engine

  • How to use polygraphy to limit the input dtype of the converted model?

    How to use polygraphy to limit the input dtype of the converted model?

    If I want to use TensorRT/tools/Polygraphy/polygraphy/tools/convert/to convert onnx fp32 model to tensorrt fp16, how to restrict its input dtype to fp16 as well?

  • demodiffusion - fMHA_0: could not find any supported formats consistent with input/output data types

    demodiffusion - fMHA_0: could not find any supported formats consistent with input/output data types

    demodiffusion engine creation fails

    Trying to reproduce the same steps as demodiffusion on the same container mentioned in the repo.

    Environment

    TensorRT Version: 8.5.1.7 NVIDIA GPU: 2080Ti NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Ubuntu 20.04 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

    WhatsApp Image 2022-11-30 at 2 44 52 PM

    Relevant Files

    Steps To Reproduce

A library for high performance deep learning inference on NVIDIA GPUs.
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Nov 21, 2022
SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters

SHARK Communication Channels GitHub issues: Feature requests, bugs etc Nod.ai SHARK Discord server: Real time discussions with the nod.ai team and oth

Dec 1, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

Nov 21, 2022
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla
Helper Class for Deep Learning Inference Frameworks: TensorFlow Lite, TensorRT, OpenCV, ncnn, MNN, SNPE, Arm NN, NNAbla

InferenceHelper This is a helper class for deep learning frameworks especially for inference This class provides an interface to use various deep lear

Nov 16, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

Nov 7, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

Nov 24, 2022
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

Nov 24, 2022
An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support

Nov 22, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Nov 17, 2022
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

Nov 11, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference
 Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Nov 16, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

Nov 28, 2022
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)
Dataset Synthesizer - NVIDIA Deep learning Dataset Synthesizer (NDDS)

NVIDIA Deep learning Dataset Synthesizer (NDDS) Overview NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-qualit

Nov 22, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Sep 28, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

Nov 2, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices.

Nov 27, 2022
NVIDIA GPUs htop like monitoring tool
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Nov 29, 2022
Inference framework for MoE layers based on TensorRT with Python binding

InfMoE Inference framework for MoE-based models, based on a TensorRT custom plugin named MoELayerPlugin (including Python binding) that can run infere

Nov 25, 2022