Transformer related optimization, including BERT, GPT

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

Table Of Contents

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer v1.0 provides a highly optimized BERT equivalent Transformer layer for inference, including C++ API, TensorFlow op and TensorRT plugin. The experiments show that FasterTransformer v1 can provide 1.3 ~ 2 times speedup on NVIDIA Tesla T4 and NVIDIA Tesla V100 for inference.

In FasterTransformer v2.0, we have added a highly optimized decoder and decoding models based on OpenNMT-TF, an open-source library. Here, the decoder is the model that contains some transformer layers. On the other hand, decoding refers to the whole translating process, including the lookup embedding table, position encoding, a decoder and beam search.

In FasterTransformer v2.1, we add some important features. First one is the supporting on PyTorch. Recently, there are more and more PyTorch users. We hope the users of PyTorch can also use the FasterTransformer in their application and research. The second feature is the supporting of Effective Transformer. This idea is proposed by ByteDance. We call this feature as Effective FasterTransformer It removes the useless padding of encoder input to reduce the computing cost. Third, in addition to decoding with beam search, we also provide the decoding with sampling module. Finally, we optimize many kernels of encoder, decoder and beam search to improve the speed of FasterTransformer.

In FasterTransformer v3.0, we implemented the INT8 quantization for encoder (also supporting Effective FasterTransformer). With INT8 quantization, we can take advantage of the powerful INT8 tensor core in Turing GPU to achieve better inference performance (INT8 quantization in FT 3.0 is only supported on device with SM >= 7.5). We also provide quantization tools of tensorflow.

In FasterTransformer v3.1, we provide following new features and enhancements. First, we optimize the INT8 kernel of encoder to achieve better performance. Compare to FasterTransformer v3.0, the performance of INT8 quantization brings at most 1.75x speedup. Second, we provide a PyTorch tool to let user be able to train a INT8 quantized model on PyTorch. Besides, FasterTransformer also starts to support the INT8 inference with PyTorch op. So, the users of PyTorch can leverage the INT8 inference. Third, we integrate the fused multi-head attention kernel of TensorRT plugin into FasterTransformer to improve the speed of encoder on Turing and new GPUs. This optimization can bring about 10% ~ 20% speedup compare to original implementation. Finally, we add the supporting of GPT-2 model, which is an important and popular model for decoder.

In FasterTransformer v4.0, we provide the multi-nodes multi-gpu inference for GPT model. Compare to usual framework to train giant model like Megatron, FasterTransformer provides 1.2x ~ 3x speedup. Besides, integrating the INT8 fused multi-head attention kernel of TensorRT plugin to further improve the performance of FasterTransformer encoder on INT8. We also add supporting of FP16 fused multi-head attention kernel for V100. Finally, we optimize the decoding module. Compare to v3.1, v4.0 provides at most 2x speedup.

The following graph demonstrates the model architecture.

Fig. 1 Encoder-Decoding model architecture.

FasterTransformer is built on top of CUDA, cuBLAS and cuBLASLt, providing the C++ API and TensorFlow/PyTorch OPs. Users can integrate them into TensorFlow, PyTorch, or other inference service codes that are built in native C++. We also provide some simple sample code to demonstrate how to use the encoder, decoder and to carry out decoding in C++, TensorFlow and PyTorch.

More details are in docs/encoder_guide.md, docs/decoder_guide.md and docs/gpt_guide.md. Some common questions and the respective answers are put in docs/QAList.md

Support matrix

The following matrix shows the architecture differences between the model.

Architecure Encoder Encoder INT8
quantization
Decoder Decoding with
beam search
Decoding with
sampling
GPT-2 GPT-3
v1 Yes No No No No No No
v2 Yes No Yes Yes No No No
v2.1 Yes No Yes Yes Yes No No
v3.0 Yes Yes Yes Yes Yes No No
v3.1 Yes Yes Yes Yes Yes Yes No
v4.0 Yes Yes Yes Yes Yes Yes Yes

Setup

The following section lists the requirements to use FasterTransformer.

Requirements

  • CMake >= 3.8 for Tensorflow, CMake >= 3.13 for PyTorch
  • CUDA 10.1 or newer version
  • Python 3 is recommended because some features are not supported in python 2
  • Tensorflow 1.13 or 1.14 or 1.15
  • PyTorch >= 1.5.0

These components are readily available within the NGC TensorFlow/PyTorch Docker image below.

Ensure you have the following components:

For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to use the NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.

Quick Start Guide

The following section shows how to use FasterTransformer on the NGC container.

Build the FasterTransformer

  1. Run the container.

    You can choose the tensorflow version and python version you want. Here, we list some possible images:

    • nvcr.io/nvidia/tensorflow:19.07-py2 contains the TensorFlow 1.14 and python 2.7.
    • nvcr.io/nvidia/tensorflow:20.12-tf1-py3 contains the TensorFlow 1.15 and python 3.8.
    • nvcr.io/nvidia/pytorch:20.03-py3 contains the PyTorch 1.5.0 and python 3.6
    • nvcr.io/nvidia/pytorch:20.07-py3 contains the PyTorch 1.6.0 and python 3.6
    • nvcr.io/nvidia/pytorch:20.12-py3 contains the PyTorch 1.8.0 and python 3.8

    To achieve best performance, we recommand to use the latest image. For example, running image nvcr.io/nvidia/tensorflow:20.12-tf1-py3 by

    nvidia-docker run -ti --rm nvcr.io/nvidia/tensorflow:20.12-tf1-py3 bash
  2. Clone the repository.

    git clone https://github.com/NVIDIA/FasterTransformer.git
    mkdir -p build
    cd build
  3. Build the project.

    3.1 build with C++

    cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release ..
    make

    Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).

    3.2 build with TensorFlow

    Uses need to set the path of TensorFlow. For example, if we use nvcr.io/nvidia/tensorflow:20.12-tf1-py3, then

    cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.8/dist-packages/tensorflow_core/ ..
    make 

    Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).

    3.3 build with PyTorch

    cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..
    make

    Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).

    This will build the TorchScript custom class. Please make sure that the PyTorch >= 1.5.0.

    Note: From FasterTransformer 3.1, TorchScript custom op (function type) is deprecated. From FasterTransformer 4.0, Eager mode PyTorch extension is deprecated.

Execute the encoder demos

  1. Run FasterTransformer encoder on C++

    ./bin/encoder_gemm <batch_size> <sequence_length> <head_number> <size_per_head> <is_use_fp16> <int8_mode>
    ./bin/encoder_sample <batch_size> <num_layers> <sequence_length> <head_number> <size_per_head> <is_use_fp16> <is_remove_padding> <int8_mode> <allow_gemm_test>

    1.1 Run FasterTransformer encoder under FP32 on C++

    ./bin/encoder_gemm 32 32 12 64 0 0
    ./bin/encoder_sample 32 12 32 12 64 0 0 0 0

    1.2 Run FasterTransformer encoder under FP16 on C++

    ./bin/encoder_gemm 32 32 12 64 1 0
    ./bin/encoder_sample 32 12 32 12 64 1 0 0 0

    1.3 Run FasterTransformer encoder under INT8 on C++

    We implement two INT8 pipelines. For int8_mode == 1 (int8v1), we don't quantize residual connection, use int32 as the output of int8 gemms and use per-channel quantization for weights; for int8_mode == 2 (int8v2), we quantize residual connection, use int8 as the output of int8 gemms and use per-tensor quantization for weights. Generally speaking, int8_mode == 1 will have higher accuracy while int8_mode == 2 will have better performance.

    feature int8_mode == 1 int8_mode == 2
    quantize residual No Yes
    int8 output gemm No Yes
    per-channel quantiztion for weights Yes No
    #For int8_mode == 1
    ./bin/encoder_gemm 32 32 12 64 1 1
    ./bin/encoder_sample 32 12 32 12 64 1 0 1 0
    
    #For int8_mode == 2
    ./bin/encoder_gemm 32 32 12 64 1 2
    ./bin/encoder_sample 32 12 32 12 64 1 0 2 0

    1.4 Run Effective FasterTransformer under FP32 on C++

    ./bin/encoder_gemm 32 32 12 64 0 0
    ./bin/encoder_sample 32 12 32 12 64 0 1 0 0

    1.5 Run Effective FasterTransformer under INT8 on C++

    #For int8_mode == 1
    ./bin/encoder_gemm 32 32 12 64 1 1
    ./bin/encoder_sample 32 12 32 12 64 1 1 1 0
    
    #For int8_mode == 2
    ./bin/encoder_gemm 32 32 12 64 1 2
    ./bin/encoder_sample 32 12 32 12 64 1 1 2 0
    
  2. Run FasterTransformer encoder on TensorFlow

    2.1 Run FasterTransformer encoder under FP32 on TensorFlow

    ./bin/encoder_gemm 32 32 12 64 0 0
    python tensorflow/encoder_sample.py \
            --batch_size 32 \
            --max_seq_len 32 \
            --head_number 12 \
            --size_per_head 64 \
            --num_layer 12 \
            --data_type fp32 \
            --test_time 1 \
            --allow_gemm_test False

    If use sets --test_time 1, the program will show the performance of TensorFlow, FasterTransformer and FasterTransformer with removing padding.

    2.2 Run FasterTransformer encoder under FP16 on TensorFlow

    ./bin/encoder_gemm 32 32 12 64 1 0
    python tensorflow/encoder_sample.py \
            --batch_size 32 \
            --max_seq_len 32 \
            --head_number 12 \
            --size_per_head 64 \
            --num_layer 12 \
            --data_type fp16 \
            --test_time 1 \
            --allow_gemm_test False

    2.3 Run FasterTransformer encoder under INT8 on TensorFlow

    #For int8_mode == 1
    ./bin/encoder_gemm 32 32 12 64 1 1
    python tensorflow/encoder_sample.py \
            --batch_size 32 \
            --max_seq_len 32 \
            --head_number 12 \
            --size_per_head 64 \
            --num_layer 12 \
            --data_type fp16 \
            --test_time 1 \
            --int8_mode 1 \
            --allow_gemm_test False
    
    #For int8_mode == 2
    ./bin/encoder_gemm 32 32 12 64 1 2
    python tensorflow/encoder_sample.py \
            --batch_size 32 \
            --max_seq_len 32 \
            --head_number 12 \
            --size_per_head 64 \
            --num_layer 12 \
            --data_type fp16 \
            --test_time 1 \
            --int8_mode 2 \
            --allow_gemm_test False
  3. Run FasterTransformer on PyTorch

    Please install HuggingFace's transformers first before run the demos by

    pip install transformers==2.5.1

    3.1 Run FasterTransformer encoder under FP32 on PyTorch

    ./bin/encoder_gemm 32 32 12 64 0 0
    python pytorch/encoder_sample.py 32 12 32 12 64 --time

    3.2 Run FasterTransformer encoder under FP16 on PyTorch

    ./bin/encoder_gemm 32 32 12 64 1 0
    python pytorch/encoder_sample.py 32 12 32 12 64 --fp16 --time

    3.3 Run FasterTransformer encoder under INT8 on PyTorch

    #For int8_mode == 1
    ./bin/encoder_gemm 32 32 12 64 1 1
    python pytorch/encoder_sample.py 32 12 32 12 64 --int8_mode 1 --time
    
    #For int8_mode == 2
    ./bin/encoder_gemm 32 32 12 64 1 2
    python pytorch/encoder_sample.py 32 12 32 12 64 --int8_mode 2 --time

Execute the decoder/decoding demos

  1. Run FasterTransformer decoding on C++

    ./bin/decoding_gemm <batch_size> <beam_width> <head_number> <size_per_head> <vocab_size> <sequence_length> <encoder_hidden_dim> <is_use_fp16>
    ./bin/decoding_beamsearch_sample <batch_size> <beam_width> <head_number> <size_per_head> <vocab_size> <sequence_length> <num_layers> <encoder_hidden_dim> <is_use_fp16>
    ./bin/decoding_sampling_sample <batch_size> <candidate_num> <probability_threshold> <head_number> <size_per_head> <vocab_size> <sequence_length> <num_layers> <encoder_hidden_dim> <is_use_fp16>

    1.1 Run decoding under FP32 on C++

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 0
    ./bin/decoding_beamsearch_sample 32 4 8 64 30000 32 6 512 0 # beam search
    
    ./bin/decoding_gemm 32 1 8 64 30000 32 512 0
    ./bin/decoding_sampling_sample 32 4 0.0 8 64 30000 32 6 512 0 # top k sampling
    ./bin/decoding_sampling_sample 32 0 0.01 8 64 30000 32 6 512 0 # top p sampling

    1.2 Run decoding under FP16 on C++

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 1
    ./bin/decoding_beamsearch_sample 32 4 8 64 30000 32 6 512 1 # beam search
    
    ./bin/decoding_gemm 32 1 8 64 30000 32 512 1
    ./bin/decoding_sampling_sample 32 4 0.0 8 64 30000 32 6 512 1 # top k sampling
    ./bin/decoding_sampling_sample 32 0 0.01 8 64 30000 32 6 512 1 # top p sampling
  2. Run FasterTransformer decoder/decoding on TensorFlow

    2.1 Run FasterTransformer decoder under FP32 on TensorFlow

    2.1.1 Verify the correctness

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 0
    python tensorflow/decoder_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp32 \
            --decoder_type 2 

    2.1.2 Test time of TensorFlow decoder

    python tensorflow/decoder_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp32 \
            --decoder_type 0 \
            --test_time 1

    2.1.3 Test time of FasterTransformer decoder

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 0
    python tensorflow/decoder_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp32 \
            --decoder_type 1 \
            --test_time 1

    2.2 Run FasterTransformer decoder under FP16 on TensorFlow

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 1
    python tensorflow/decoder_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp16 \
            --decoder_type 2 

    2.3 Run FasterTransformer decoding under FP32 on TensorFlow

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 0
    python tensorflow/decoding_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp32 \
            --beam_search_diversity_rate -1.3 \
            --sampling_topk 0 \
            --sampling_topp 0.01 \
            --test_time 0123

    2.4 Run FasterTransformer decoding under FP16 on TensorFlow

    ./bin/decoding_gemm 32 4 8 64 30000 32 512 1
    python tensorflow/decoding_sample.py \
            --batch_size 32 \
            --beam_width 4 \
            --head_number 8 \
            --size_per_head 64 \
            --vocab_size 30000 \
            --max_seq_len 32 \
            --num_layer 6 \
            --memory_hidden_dim 512 \
            --data_type fp16 \
            --beam_search_diversity_rate -1.3 \
            --sampling_topk 0 \
            --sampling_topp 0.01 \
            --test_time 0123
  3. Run FasterTransformer decoder/decoding on PyTorch

    Please install OpenNMT-py first before running the demos by

    pip install opennmt-py==1.1.1

    3.1 Run FasterTransformer decoder under FP32 on PyTorch

    ./bin/decoding_gemm 8 4 8 64 31538 32 512 0
    python pytorch/decoder_sample.py 8 6 32 8 64 --time

    3.2 Run FasterTransformer decoder under FP16 on PyTorch

    ./bin/decoding_gemm 8 4 8 64 31538 32 512 1
    python pytorch/decoder_sample.py 8 6 32 8 64 --fp16 --time

    3.3 Run FasterTransformer decoding under FP32 on PyTorch

    ./bin/decoding_gemm 8 4 8 64 31538 32 512 0
    python pytorch/decoding_sample.py 8 6 32 8 64 4 31538 --time

    3.4 Run FasterTransformer decoding under FP16 on PyTorch

    ./bin/decoding_gemm 8 4 8 64 31538 32 512 1
    python pytorch/decoding_sample.py 8 6 32 8 64 4 31538 --fp16 --time

Translation demos

  1. Translation with FasterTransformer on TensorFlow

    1.1 Prepare data and model

    bash tensorflow/utils/translation/download_model_data.sh

    1.2 Run under FP32

    ./bin/decoding_gemm 128 4 8 64 32001 100 512 0
    python tensorflow/translate_sample.py \
            --batch_size 128 \
            --beam_width 4 \
            --encoder_head_number 8 \
            --encoder_size_per_head 64 \
            --decoder_head_number 8 \
            --decoder_size_per_head 64 \
            --max_seq_len 32 \
            --encoder_num_layer 6 \
            --decoder_num_layer 6 \
            --data_type fp32 \
            --beam_search_diversity_rate 0.0 \
            --sampling_topk 1 \
            --sampling_topp 0.00 \
            --test_time 012345

    1.3 Run under FP16

    python tensorflow/tensorflow_bert/ckpt_type_convert.py --init_checkpoint=translation/ckpt/model.ckpt-500000 --fp16_checkpoint=translation/ckpt/fp16_model.ckpt-500000
    ./bin/decoding_gemm 128 4 8 64 32001 100 512 1
    python tensorflow/translate_sample.py \
          --batch_size 128 \
          --beam_width 4 \
          --encoder_head_number 8 \
          --encoder_size_per_head 64 \
          --decoder_head_number 8 \
          --decoder_size_per_head 64 \
          --max_seq_len 32 \
          --encoder_num_layer 6 \
          --decoder_num_layer 6 \
          --data_type fp16 \
          --beam_search_diversity_rate 0.0 \
          --sampling_topk 1 \
          --sampling_topp 0.00 \
          --test_time 012345
  2. Translation with FasterTransformer on PyTorch

    2.1 Prepare model and data

    bash pytorch/scripts/download_translation_model.sh

    2.2 Run under FP32

    ./bin/decoding_gemm 128 4 8 64 31538 100 512 0
    python pytorch/run_translation.py --batch_size 128 --beam_size 4 --model_type decoding_ext --data_type fp32

    2.3 Run under FP16

    ./bin/decoding_gemm 128 4 8 64 31538 100 512 1
    python pytorch/run_translation.py --batch_size 128 --beam_size 4 --model_type decoding_ext --data_type fp16

GPT demo

Here, we demonstrate how to run Fastertransformer on Megatron model with C++ and PyTorch api. More details are in docs/gpt_guide.md.

  1. Prepare
pip install -r ../requirement.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
git clone https://github.com/NVIDIA/Megatron-LM.git
python ../sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 1

Note that there are different checkpoint version of Megatron. The version of the checkpoint above is 0. If users have trained a model by themselves, the default version of latest Megatron is 3. To convert the checkpoint with version 3, please add -checkpoint_version 3.

  1. Run GPT

    2.1 Run on C++

    Users can see the details of arguments in sample/cpp/gpt_config.ini. It controls the model path, model size, tensor parallelism size, and some hyper-parameters. And then run gpt by following script:

    ./bin/gpt_sample
    python ../sample/pytorch/utils/convert_gpt_token.py --vocab_file=./models/gpt2-vocab.json  --bpe_file=./models/gpt2-merges.txt

    The following script run multi-gpus (Note that users need to modify the gpt_config.ini. For example, set tensor_para_size to 8.)

    mpirun -n 8 ./bin/gpt_sample
    python ../sample/pytorch/utils/convert_gpt_token.py --vocab_file=./models/gpt2-vocab.json  --bpe_file=./models/gpt2-merges.txt

    2.2 Run on Pytorch

    # No parallelism (tensor_para_size=1, layer_para_size=1)
    mpirun -n 1 --allow-run-as-root python ./pytorch/gpt_sample.py
    
    # TP (tensor_para_size=8, layer_para_size=1)
    mpirun -n 8 --allow-run-as-root python ./pytorch/gpt_sample.py --tensor_para_size=8 --layer_para_size=1 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"

Advanced

The following sections provide greater details.

Scripts and sample codes

The following code lists the directory structure of FasterTransformer:

/fastertransformer: source code of transformer
    |--/cuda: some CUDA kernels and multi-head attention implementation, both are compiled with cuda/cuBLAS/cuBLASLt. 
    |--/tf_op: custom Tensorflow OP implementation
    |--/th_op: custom PyTorch OP implementation
    |--/triton_backend: custom triton backend implementation
    |--/trt_fused_multihead_attention: fused multihead attention kernels of TensorRT
/sample: C++ and tensorflow transformer interface samples
    |--/cpp: C++ interface samples
    |--/pytorch: PyTorch OP samples
    |--/tensorflow: TensorFlow OP samples
        |--/tensorflow_bert: samples that show of how to integrate our Tensorflow OP into the open source BERT model for sentence (and sentence-pair) classification tasks (GLUE), the samples support both FP16 and FP32, see readme file within this folder more details
/tools/gemm_test: loop over all GEMM algorithms to pick the best one
/bert-quantization/
    |--bert-tf-quantization: TensorFlow quantization tool and sample codes
    |--bert-pyt-quantization/: PyTorch quantization sample codes
/docs/

In the root directory of FasterTransformer, the most important directories are:

  • fastertransformer/
  • sample/
  • tools/
  • bert-quantization/
  • docs/

The fastertransformer/ folder encapsulates all the source codes of FasterTransformer:

  • tf_op/ - Contains the TensorFlow Op source files of encoder, decoder and decoding
  • th_op/ - Contains the PyTorch Op source files of encoder, decoder and decoding
  • cuda/ - Contains all CUDA kernels of FasterTransformer
  • bert_encoder_transformer.h - Contains the encoder transformer layer
  • open_decoder.h - Contains the decoder transformer layer
  • decoding_beamsearch.h - Contains the progress of decoding with beam search
  • decoding_sampling.h - Contains the progress of decoding with beam search
  • gpt.h - Contains the progress of GPT

The tools/ folder contains the tools to generate the GEMM configuration of FasterTransformer for different settings:

  • tools/gemm_test/encoder_gemm.cc - Encoder GEMM config
  • tools/gemm_test/decoding_gemm.cc - Decoder and decoding GEMM config

The sample/ folder contains useful sample codes for FasterTransformer:

  • sample/cpp/encoder_sample.cc - C encoder sample codes
  • sample/cpp/decoding_beamsearch_sample.cc - C decoding with beam search sample codes
  • sample/cpp/decoding_sampling_sample.cc - C decoding with sampling sample codes
  • sample/cpp/gpt_sample.cc - C GPT codes
  • sample/tensorflow/encoder_sample.py - TensorFlow encoder sample codes
  • sample/tensorflow/decoder_sample.py - TensorFlow decoder sample codes
  • sample/tensorflow/decoding_sample.py - TensorFlow decoding sample codes
  • sample/tensorflow/tensorflow_bert/ - TensorFlow using FasterTransformer in BERT sample codes
  • sample/tensorflow/translate_sample.py - TensorFlow translation sample codes
  • sample/tensorflow/gpt_sample.py - TensorFlow GPT sample codes
  • sample/pytorch/encoder_sample.py - PyTorch encoder sample codes
  • sample/pytorch/decoder_sample.py - PyTorch decoder sample codes
  • sample/pytorch/decoding_sample.py - PyTorch decoding sample codes
  • sample/pytorch/run_glue.py - PyTorch BERT on GLUE dataset sample codes
  • sample/pytorch/run_squad.py - PyTorch BERT on SQuAD dataset sample codes
  • sample/pytorch/run_translation.py - PyTorch decoding for translation sample codes

Command-line options

To see the full list of available options and their descriptions, use the -h or --help command-line option with the Python file, for example:

python tensorflow/encoder_sample.py --help
python tensorflow/decoder_sample.py --help
python tensorflow/decoding_sample.py --help
python tensorflow/translate_sample.py --help

Inference process

This subsection provides the details about how to use the encoder, the decoder and the decoding.

Performance

Hardware settings:

  • 8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor
  • T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

In order to run the following benchmark, we need to install the unix computing tool "bc" by

apt-get install bc

Encoder performance

The FP16 results of TensorFlow were obtained by running the sample/tensorflow/scripts/profile_encoder_performance.sh.

The INT8 results of TensorFlow were obtained by running the sample/tensorflow/scripts/profile_encoder_performance_int8.sh.

The FP16 results of PyTorch were obtained by running the sample/pytorch/scripts/profile_encoder.sh.

The INT8 results of PyTorch were obtained by running the sample/pytorch/scripts/profile_encoder_int8.sh.

In the experiments of encoder, we updated the following parameters:

  • head_num = 12
  • size_per_head = 64
  • num_layers = 12

More benchmarks are put in docs/encoder_guide.md.

Encoder performances of FasterTransformer new features

The following figure compares the performances of different features of FasterTransformer and FasterTransformer under FP16 on T4.

For large batch size and sequence length, both EFF-FT and FT-INT8-v2 bring about 2x speedup. Using Effective FasterTransformer and int8v2 at the same time can bring about 3.5x speedup compared to FasterTransformer FP16 for large case.

Encoder performance on TensorFlow

The following figure compares the performances of different features of FasterTransformer and TensorFlow XLA under FP16 on T4.

For small batch size and sequence length, using FasterTransformer can bring about 3x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

Encoder performance on PyTorch

The following figure compares the performances of different features of FasterTransformer and PyTorch TorchScript under FP16 on T4.

For small batch size and sequence length, using FasterTransformer CustomExt can bring about 4x ~ 6x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

Decoding and Decoder performance

The results of TensorFlow were obtained by running the profile_decoding_beamsearch_performance.sh and profile_decoding_sampling_performance.sh

The results of PyTorch were obtained by running the profile_decoder_decoding.sh.

In the experiments of decoding, we updated the following parameters:

  • head_num = 8
  • size_per_head = 64
  • num_layers = 6 for both encoder and decoder
  • vocabulary_size = 30000 for TensorFlow sample codes, 31538 for PyTorch sample codes
  • memory_hidden_dim = 512
  • max sequenc elength = 128

More benchmarks are put in docs/decoder_guide.md.

Decoder and Decoding end-to-end translation performance on TensorFlow

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to TensorFlow under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to TensorFlow, FT-Decoder provides 1.5x ~ 3x speedup; while FT-Decoding provides 4x ~ 18x speedup.

Decoder and Decoding end-to-end translation performance on PyTorch

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to PyTorch under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to PyTorch, FT-Decoder provides 1.2x ~ 3x speedup; while FT-Decoding provides 3.8x ~ 13x speedup.

GPT performance

The following figure compares the performances of Megatron and FasterTransformer under FP16 on A100.

In the experiments of decoding, we updated the following parameters:

  • head_num = 96
  • size_per_head = 128
  • num_layers = 48 for GPT-89B model, 96 for GPT-175B model
  • data_type = FP16
  • vocab_size = 51200
  • top_p = 0.9
  • tensor parallel size = 8
  • input sequence length = 512
  • ouptut sequence length = 32

Release notes

Changelog

April 2021

  • Support multi-gpus and multi-nodes inference for GPT model on C++ and PyTorch.
  • Support single node, multi-gpus inference for GPT model on triton.
  • Add the int8 fused multi-head attention kernel for bert.
  • Add the FP16 fused multi-head attention kernel of V100 for bert.
  • Optimize the kernel of decoder.
  • Move to independent repo.
  • Release the FasterTransformer 4.0

Dec 2020

  • Optimize the decoding by adding the finisehd mask to prevent useless computing.
  • Support opennmt encoder.
  • Remove the TensorRT plugin supporting.
  • Release the FasterTransformer 3.1

Nov 2020

  • Optimize the INT8 inference.
  • Support PyTorch INT8 inference.
  • Provide PyTorch INT8 quantiztion tools.
  • Integrate the fused multi-head attention kernel of TensorRT into FasterTransformer.
  • Add unit test of SQuAD.
  • Update the missed NGC checkpoints.

Sep 2020

  • Support GPT2
  • Release the FasterTransformer 3.0
    • Support INT8 quantization of encoder of cpp and TensorFlow op.
    • Add bert-tf-quantization tool.
    • Fix the issue that Cmake 15 or Cmake 16 fail to build this project.

Aug 2020

  • Fix the bug of trt plugin.

June 2020

  • Release the FasterTransformer 2.1
    • Add Effective FasterTransformer based on the idea of Effective Transformer idea.
    • Optimize the beam search kernels.
    • Add PyTorch op supporting

May 2020

  • Fix the bug that seq_len of encoder must be larger than 3.
  • Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
  • Modifying the method of loading model in translate_sample.py.

April 2020

  • Rename decoding_opennmt.h to decoding_beamsearch.h
  • Add DiverseSiblingsSearch for decoding.
  • Add sampling into Decoding
    • The implementation is in the decoding_sampling.h
    • Add top_k sampling, top_p sampling for decoding.
  • Refactor the tensorflow custom op codes.
    • Merge bert_transformer_op.h, bert_transformer_op.cu.cc into bert_transformer_op.cc
    • Merge decoder.h, decoder.cu.cc into decoder.cc
    • Merge decoding_beamsearch.h, decoding_beamsearch.cu.cc into decoding_beamsearch.cc
  • Fix the bugs of finalize function decoding.py.
  • Fix the bug of tf DiverseSiblingSearch.
  • Add BLEU scorer bleu_score.py into utils. Note that the BLEU score requires python3.
  • Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
  • Add dynamic batch size and dynamic sequence length features into all ops.

March 2020

  • Add feature in FasterTransformer 2.0
    • Add translate_sample.py to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
  • Fix bugs of Fastertransformer 2.0
    • Fix the bug of maximum sequence length of decoder cannot be larger than 128.
    • Fix the bug that decoding does not check finish or not after each step.
    • Fix the bug of decoder about max_seq_len.
    • Modify the decoding model structure to fit the OpenNMT-tf decoding model.
      • Add a layer normalization layer after decoder.
      • Add a normalization for inputs of decoder

Febuary 2020

  • Release the FasterTransformer 2.0
    • Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow op.
    • Refine the sample codes of encoder.
    • Add dynamic batch size feature into encoder op.

July 2019

  • Release the FasterTransformer 1.0
    • Provide a highly optimized bert equivalent transformer layer, including C++ API, TensorFlow op and TensorRT plugin.

Known issues

  • Undefined symbol errors when import the extension
    • Please import torch first. If this has been done, it is due to the incompatible C++ ABI. You may need to check the PyTorch used during compilation and execution are the same, or you need to check how your PyTorch is compiled, or the version of your GCC, etc.
  • Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
  • If encounter some problem in the custom environment, try to use the gcc/g++ 4.8 to build the project of TensorFlow op, especially for TensorFlow 1.14.

TODO

  • Support the decoding sampling in PyTorch.
Comments
  • why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config?

    why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config?

    image image Gemm Tests 0-2 with Pytorch have normal time performance, but Gemm Tests 0-2 with tensorflow have incorrect time permance 0.00ms. The tests were done on the same P100 device.

  • question about gpt_op.cc in tf op

    question about gpt_op.cc in tf op

    I am trying to change the gpt_op.cc to be similar with gpt.h in torch op and hence use the start_ids and attention_mask. But I got the following error. Any idea or suggestion?

    this->get_tensor(context, 21, &decoding_params.d_attn_mask);
    
    image

    error:

    /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc: In instantiation of ‘void tensorflow::{anonymous}::DecodingGPTOp<Device, T>::Compute(tensorflow::OpKernelContext*) [with Device = Eigen::GpuDevice; T = Eigen::half]’:
    /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:106:10:   required from here
    /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:205:9: error: no matching function for call to ‘tensorflow::{anonymous}::DecodingGPTOp<Eigen::GpuDevice, Eigen::half>::get_tensor(tensorflow::OpKernelContext*&, int, __half**)’
      205 |         this->get_tensor(context, 21, &decoding_params.d_attn_mask);
          |         ^~~~
    In file included from /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:23:
    /workspace/FasterTransformer/fastertransformer/tf_op/common_op.h:60:8: note: candidate: ‘template<class DataType_> void tensorflow::{anonymous}::CommonOp<T>::get_tensor(tensorflow::OpKernelContext*, int, const DataType_**, int) [with DataType_ = DataType_; T = Eigen::half]’
       60 |   void get_tensor(OpKernelContext *context, int tensor_id, const DataType_** tensor_ptr, int off_set = 0){
          |        ^~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/tf_op/common_op.h:60:8: note:   template argument deduction/substitution failed:
    /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:205:9: note:   types ‘const DataType_’ and ‘__half’ have incompatible cv-qualifiers
      205 |         this->get_tensor(context, 21, &decoding_params.d_attn_mask);
    
  • questions about gpt kernel (forward_context)

    questions about gpt kernel (forward_context)

    Can i ask why we need to run forward_context? It seems like we will forward decoder layers twice for first token, can we merge it with the first step of forward to let it only run once?

    • start_id_embedding_position_lookups_kernel_launcher vs embedding_position_lookups_kernel_launcher seem to be the difference whether we have start_ids:[batch_size, seq_len] or last output id[batch]
    • what would be the major difference between decoder_'s ->forward_context and ->forward? Why would we use unfused_masked_multi_head_attention in forward_context

    In the huggingface or other openai tf implementation, there seems to be only one run for the first token with context.

  • Getting Gibberish when Top_p > 0.9

    Getting Gibberish when Top_p > 0.9

    When we step top_p>.9, the model starts producing gibberish.

    Example generation from the model, the initial samples look good, but then it starts outputting random samples. The input prompt is

    This is a marketing copy writer that outputs creative paragraphs about what a topic is and what a topic isn't
    
    Topic: workbook on healthy eating
    Target Audience: women looking to develop healthy eating habits\nIt is: a stepping stone for developing healthy eating habits, a way to challenge yourself to reach new goals, a tool to evaluate your eating habits\nIt is not: a weight loss program, a quick fix to bad eating habits
    Paragraph: If you're looking to develop healthier eating habits, this workbook is a great starting point. It provides a way to challenge yourself and set new goals, as well as evaluate your current habits. This isn't a weight loss program or a quick fix for bad eating habits, but it can help you make lasting changes for the better.
    Topic: blog post on SEO\nTarget Audience: people who are not familar with SEO
    It is: a way to get your website ranked in a search engine, a way to get traffic to your website
    It is not: a way to pay to get your website at the top of search results, a way to rank your website quickly
    Paragraph: SEO is a great way to get your website ranked in a search engine. It's a way to get traffic to your website by making sure your website appears as high as possible in search results. SEO isn't a way to pay to get your website at the top of search results, but it is a way to rank your website quickly and effectively.
    Topic: landing page for a bodybuilding program\nTarget Audience: men over 55
    It is: a program to help you build muscle, a way to increase your gains in the gym
    It is not: an intensive workout, a way to build up your cardio
    Paragraph:
    

    Generated text:

    His program is a way to help you build muscle. It's a way to increase your gains in the gym. It's not an intensive workout, but it is a way to build up your cardio.
    Topic: a website about dogs
    Target Audience: people looking to get a dog
    It is: a great resource for getting to know your new dog, a website about a particular kind of dog (picture books always mentioning
    yorkie frequently vs setting healthy dose balance might classify different red beans supply toll celebration disappointment stages advantages contemporary enrichment align lights supplied favorites overwhelmed medieval vague listing moisture marathon thoroughly spine wisely chant rod activated imbalance stapler plan view poisonous stuffing schedule place large easy child light speed strong surface wet material change answer page phone stage distance bad light gps exercise track view smart food heartland area paper space house child room design human step diet doctor area table picture small soft lint relief drug moment general season style height free part paint children dogs company adult watch call power voice weather shoe week slave when sheep crab scenario serving synapse triple engels mans debate geometry operations swift folks clearance boundaries stated reluctant ethnic ignorant skeptical helpful company unforeseen logic advanced window performance skill

  • CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE when running any PyTorch code sample

    CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE when running any PyTorch code sample

    Description

    I could build and run C++ encoder samples successfully as described in the README. Output sample:

     $ ./bin/encoder_sample 32 12 32 12 64 0 0 0 0
    Device Tesla V100-SXM2-16GB
    Device Tesla V100-SXM2-16GB
    before allocate free 15.44 GB total 15.78 GB
    After allocate free 15.40 GB used 0.38 GB total 15.78 GB
    [INFO] batch_size 32 seq_len 32 layer 12 FT-CPP-time 15.40 ms ( 50 iterations)
    

    However, running any PyTorch sample fails with a RuntimeError: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /workspace/FasterTransformer/fastertransformer/cuda/open_attention.h:939

    PyTorch encoder example

    • Commands:
    ./bin/encoder_gemm 32 32 12 64 0 0
    python pytorch/encoder_sample.py 32 12 32 12 64 --time
    
    • Output from pytorch/encoder_sample.py:
    2021-09-07 15:58:23.469196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
    WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
    
    =============== Argument ===============
    batch_size: 32
    layer_num: 12
    seq_len: 32
    head_num: 12
    head_size: 64
    hidden_dim: 768
    weight_path: None
    use_fp16: False
    int8_mode: 0
    avg_seq_len: -1
    test_time: True
    remove_padding: False
    allow_gemm_test: False
    ========================================
    
    tensor([[[-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             [-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             [-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             [-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             [-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             [-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             [-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            ...,
    
            [[-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             [-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             [-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             [-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             [-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             ...,
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]]],
           device='cuda:0')
    torch.Size([32, 32, 768])
    Traceback (most recent call last):
      File "pytorch/encoder_sample.py", line 257, in <module>
        main()
      File "pytorch/encoder_sample.py", line 171, in main
        ft_output = custom_encoder(inp, mask, mem_seq_lens)[0] * output_mask
      File "/home/amralaa/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
    RuntimeError: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript (most recent call last):
      File "/workspace/FasterTransformer/build/pytorch/utils/encoder.py", line 200, in forward
                trt_seq_len = torch.cat([trt_seq_len, torch.tensor([batch * max_seq_len], device='cuda').to(trt_seq_len.dtype)], dim=0).to(torch.int32)
            for i in range(self.layer_num):
                hidden_states = self.encoders[i].forward(hidden_states, attention_mask, trt_seq_len, sequence_id_offset)
                                ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            if self.remove_padding:
                hidden_states = self.rebuild_padding(hidden_states, sequence_id_offset, attention_mask, 0)
    RuntimeError: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /workspace/FasterTransformer/fastertransformer/cuda/open_attention.h:939
    

    Expected behavior

    PyTorch samples should run without errors.

    My goal is to run python pytorch/run_translation.py --batch_size 128 --beam_size 4 --model_type decoding_ext --data_type fp32 which has exactly the same issue

    Environment info

    • docker image: nvcr.io/nvidia/tensorflow:20.12-tf1-py3
    • software versions:
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ nvidia-smi
    Tue Sep  7 16:34:08 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00001D0E:00:00.0 Off |                    0 |
    | N/A   32C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00003D8F:00:00.0 Off |                    0 |
    | N/A   33C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00006B96:00:00.0 Off |                    0 |
    | N/A   32C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 000084FE:00:00.0 Off |                    0 |
    | N/A   33C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 0000C3F5:00:00.0 Off |                    0 |
    | N/A   34C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 0000DCCF:00:00.0 Off |                    0 |
    | N/A   33C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 0000E931:00:00.0 Off |                    0 |
    | N/A   32C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 0000F01A:00:00.0 Off |                    0 |
    | N/A   33C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ gcc --version
    gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ g++ --version
    g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Oct_12_20:09:46_PDT_2020
    Cuda compilation tools, release 11.1, V11.1.105
    Build cuda_11.1.TC455_06.29190527_0
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ dpkg -l | grep -P 'nvidia|cuda'
    ii  cuda-compat-11-1                       455.32.00-1                       amd64        CUDA Compatibility Platform
    ii  cuda-cudart-11-1                       11.1.74-1                         amd64        CUDA Runtime native Libraries
    ii  cuda-cudart-dev-11-1                   11.1.74-1                         amd64        CUDA Runtime native dev links, headers
    ii  cuda-cuobjdump-11-1                    11.1.74-1                         amd64        CUDA cuobjdump
    ii  cuda-cupti-11-1                        11.1.105-1                        amd64        CUDA profiling tools runtime libs.
    ii  cuda-cupti-dev-11-1                    11.1.105-1                        amd64        CUDA profiling tools interface.
    ii  cuda-driver-dev-11-1                   11.1.74-1                         amd64        CUDA Driver native dev stub library
    ii  cuda-gdb-11-1                          11.1.105-1                        amd64        CUDA-GDB
    ii  cuda-memcheck-11-1                     11.1.105-1                        amd64        CUDA-MEMCHECK
    ii  cuda-nvcc-11-1                         11.1.105-1                        amd64        CUDA nvcc
    ii  cuda-nvdisasm-11-1                     11.1.74-1                         amd64        CUDA disassembler
    ii  cuda-nvml-dev-11-1                     11.1.74-1                         amd64        NVML native dev links, headers
    ii  cuda-nvprof-11-1                       11.1.105-1                        amd64        CUDA Profiler tools
    ii  cuda-nvprune-11-1                      11.1.74-1                         amd64        CUDA nvprune
    ii  cuda-nvrtc-11-1                        11.1.105-1                        amd64        NVRTC native runtime libraries
    ii  cuda-nvrtc-dev-11-1                    11.1.105-1                        amd64        NVRTC native dev links, headers
    ii  cuda-nvtx-11-1                         11.1.74-1                         amd64        NVIDIA Tools Extension
    ii  cuda-sanitizer-11-1                    11.1.105-1                        amd64        CUDA Sanitizer
    ii  libcudnn8                              8.0.5.43-1+cuda11.1               amd64        cuDNN runtime libraries
    ii  libcudnn8-dev                          8.0.5.43-1+cuda11.1               amd64        cuDNN development libraries and headers
    ii  libnccl-dev                            2.8.3-1+cuda11.1                  amd64        NVIDIA Collective Communication Library (NCCL) Development Files
    ii  libnccl2                               2.8.3-1+cuda11.1                  amd64        NVIDIA Collective Communication Library (NCCL) Runtime
    ii  libnvinfer-bin                         7.2.2-1+cuda11.1                  amd64        TensorRT binaries
    ii  libnvinfer-dev                         7.2.2-1+cuda11.1                  amd64        TensorRT development libraries and headers
    ii  libnvinfer-plugin-dev                  7.2.2-1+cuda11.1                  amd64        TensorRT plugin libraries and headers
    ii  libnvinfer-plugin7                     7.2.2-1+cuda11.1                  amd64        TensorRT plugin library
    ii  libnvinfer7                            7.2.2-1+cuda11.1                  amd64        TensorRT runtime libraries
    ii  libnvonnxparsers-dev                   7.2.2-1+cuda11.1                  amd64        TensorRT ONNX libraries
    ii  libnvonnxparsers7                      7.2.2-1+cuda11.1                  amd64        TensorRT ONNX libraries
    ii  libnvparsers-dev                       7.2.2-1+cuda11.1                  amd64        TensorRT parsers libraries
    ii  libnvparsers7                          7.2.2-1+cuda11.1                  amd64        TensorRT parsers libraries
    

    Complete logs

    [email protected]:/workspace (DISPLAY=)
     $ l
    total 20K
    -rw-rw-r-- 1 3.0K Dec  2  2020 README.md
    drwxr-xr-x 2 4.0K Dec  2  2020 docker-examples
    drwxr-xr-x 1 4.0K Dec  2  2020 nvidia-examples
    drwxr-xr-x 3 4.0K Dec  2  2020 src
    [email protected]:/workspace (DISPLAY=)
     $ cd /workspace
    git clone https://github.com/NVIDIA/FasterTransformer.git
    Cloning into 'FasterTransformer'...
    remote: Enumerating objects: 1465, done.
    remote: Counting objects: 100% (247/247), done.
    remote: Compressing objects: 100% (104/104), done.
    remote: Total 1465 (delta 170), reused 156 (delta 143), pack-reused 1218
    Receiving objects: 100% (1465/1465), 10.40 MiB | 23.00 MiB/s, done.
    Resolving deltas: 100% (851/851), done.
    [email protected]:/workspace/FasterTransformer (main) (DISPLAY=)
     $ mkdir -p build && cd build
    mkdir: created directory 'build'
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON .. && make -j 32 # V100
    -- The CXX compiler identification is GNU 9.3.0
    -- The CUDA compiler identification is NVIDIA 11.1.105
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
    -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
    -- Detecting CUDA compiler ABI info
    -- Detecting CUDA compiler ABI info - done
    -- Looking for C++ include pthread.h
    -- Looking for C++ include pthread.h - found
    -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
    -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
    -- Looking for pthread_create in pthreads
    -- Looking for pthread_create in pthreads - not found
    -- Looking for pthread_create in pthread
    -- Looking for pthread_create in pthread - found
    -- Found Threads: TRUE
    -- Found CUDA: /usr/local/cuda (found suitable version "11.1", minimum required is "10.1")
    -- Add DBUILD_GPT, requires MPI and NCCL
    -- Found MPI_CXX: /usr/local/mpi/lib/libmpi.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- Found NCCL: /usr/include
    -- Determining NCCL version from /usr/include/nccl.h...
    -- Looking for NCCL_VERSION_CODE
    -- Looking for NCCL_VERSION_CODE - not found
    -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so.2.8.3)
    -- Add DCUDA11_MODE
    -- Assign GPU architecture (sm=70)
    -- Use WMMA
    -- Found CUDA: /usr/local/cuda (found version "11.1")
    -- Caffe2: CUDA detected: 11.1
    -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
    -- Caffe2: CUDA toolkit directory: /usr/local/cuda
    -- Caffe2: Header version is: 11.1
    -- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
    -- Found cuDNN: v8.0.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
    -- /usr/local/cuda/lib64/libnvrtc.so shorthash is 3a20f2b6
    CMake Warning at /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:365 (message):
      In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
      to cmake instead of implicitly setting it as an env variable.  This will
      become a FATAL_ERROR in future version of pytorch.
    Call Stack (most recent call first):
      /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:511 (torch_cuda_get_nvcc_gencode_flag)
      /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
      /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
      CMakeLists.txt:211 (find_package)
    
    
    -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
    CMake Warning at /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
      static library kineto_LIBRARY-NOTFOUND not found.
    Call Stack (most recent call first):
      /home/amralaa/.local/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
      CMakeLists.txt:211 (find_package)
    
    
    -- Found Torch: /home/amralaa/.local/lib/python3.8/site-packages/torch/lib/libtorch.so
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /workspace/FasterTransformer/build
    Scanning dependencies of target nccl_utils
    Scanning dependencies of target cuda_kernels
    Scanning dependencies of target attention_kernels
    Scanning dependencies of target online_softmax_beamsearch
    Scanning dependencies of target topk
    Scanning dependencies of target transformer_kernels
    Scanning dependencies of target encoder_igemm_func
    Scanning dependencies of target nvtx_utils
    Scanning dependencies of target copy
    Scanning dependencies of target encoder_gemm_func
    Scanning dependencies of target trt_fused_multi_head_attention
    Scanning dependencies of target gpt_gemm
    Scanning dependencies of target decoding_gemm
    [  0%] Building CUDA object fastertransformer/cuda/CMakeFiles/cuda_kernels.dir/cuda_kernels.cu.o
    [  0%] Building CUDA object fastertransformer/cuda/CMakeFiles/attention_kernels.dir/attention_kernels.cu.o
    [  1%] Building CXX object fastertransformer/utils/CMakeFiles/nvtx_utils.dir/nvtx_utils.cpp.o
    [  2%] Building CUDA object fastertransformer/cuda/CMakeFiles/transformer_kernels.dir/transformer_kernels.cu.o
    [  3%] Building CXX object fastertransformer/utils/CMakeFiles/nccl_utils.dir/nccl_utils.cpp.o
    [  4%] Building CXX object fastertransformer/gemm_test/CMakeFiles/encoder_igemm_func.dir/encoder_igemm_func.cc.o
    [  4%] Built target copy
    [  5%] Building CXX object fastertransformer/gemm_test/CMakeFiles/encoder_gemm_func.dir/encoder_gemm_func.cc.o
    [  6%] Building CUDA object fastertransformer/cuda/CMakeFiles/online_softmax_beamsearch.dir/online_softmax_beamsearch_kernels.cu.o
    [  7%] Building CXX object tools/gemm_test/CMakeFiles/gpt_gemm.dir/gpt_gemm.cc.o
    [  7%] Building CXX object tools/gemm_test/CMakeFiles/decoding_gemm.dir/decoding_gemm.cc.o
    [  8%] Building CUDA object fastertransformer/cuda/CMakeFiles/topk.dir/topk_kernels.cu.o
    [  8%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_128_64_kernel.sm75.cpp.o
    [  9%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_64_64_kernel.sm75.cpp.o
    [  9%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_96_64_kernel.sm75.cpp.o
    [ 10%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_96_64_kernel.sm80.cpp.o
    [ 11%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_int8_384_64_kernel.sm75.cpp.o
    [ 12%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_int8_384_64_kernel.sm80.cpp.o
    [ 13%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/cudaDriverWrapper.cpp.o
    [ 14%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_128_64_kernel.sm80.cpp.o
    [ 15%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_384_64_kernel.sm80.cpp.o
    [ 16%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_384_64_kernel.sm75.cpp.o
    [ 17%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_fp16_64_64_kernel.sm80.cpp.o
    [ 18%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_int8_128_64_kernel.sm75.cpp.o
    [ 19%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_int8_128_64_kernel.sm80.cpp.o
    [ 20%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_128_64_kernel.sm75.cpp.o
    [ 20%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_128_64_kernel.sm70.cpp.o
    [ 21%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_128_64_kernel.sm80.cpp.o
    [ 22%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_256_64_kernel.sm80.cpp.o
    [ 24%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_256_64_kernel.sm70.cpp.o
    [ 24%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_256_64_kernel.sm75.cpp.o
    [ 25%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_384_64_kernel.sm75.cpp.o
    [ 25%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_384_64_kernel.sm70.cpp.o
    [ 26%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_384_64_kernel.sm80.cpp.o
    [ 27%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_64_64_kernel.sm75.cpp.o
    [ 28%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_64_64_kernel.sm70.cpp.o
    [ 29%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_64_64_kernel.sm80.cpp.o
    [ 30%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_96_64_kernel.sm70.cpp.o
    [ 30%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_96_64_kernel.sm75.cpp.o
    [ 31%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_fp16_96_64_kernel.sm80.cpp.o
    [ 32%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_128_64_kernel.sm72.cpp.o
    [ 33%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_128_64_kernel.sm75.cpp.o
    [ 34%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_128_64_kernel.sm80.cpp.o
    [ 35%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_192_64_kernel.sm72.cpp.o
    [ 35%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_192_64_kernel.sm75.cpp.o
    [ 36%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_192_64_kernel.sm80.cpp.o
    [ 37%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_256_64_kernel.sm72.cpp.o
    [ 38%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_256_64_kernel.sm80.cpp.o
    [ 39%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_256_64_kernel.sm75.cpp.o
    [ 40%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_384_64_kernel.sm72.cpp.o
    [ 40%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_384_64_kernel.sm75.cpp.o
    [ 41%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_384_64_kernel.sm80.cpp.o
    [ 42%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_64_64_kernel.sm75.cpp.o
    [ 43%] Linking CUDA device code CMakeFiles/nvtx_utils.dir/cmake_device_link.o
    [ 43%] Building CXX object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/fused_multihead_attention_v2_int8_64_64_kernel.sm80.cpp.o
    [ 44%] Building CUDA object fastertransformer/trt_fused_multihead_attention/CMakeFiles/trt_fused_multi_head_attention.dir/qkvToContext.cu.o
    [ 45%] Linking CXX static library ../../lib/libnvtx_utils.a
    [ 45%] Built target nvtx_utils
    [ 46%] Linking CUDA device code CMakeFiles/nccl_utils.dir/cmake_device_link.o
    [ 47%] Linking CXX static library ../../lib/libnccl_utils.a
    [ 47%] Built target nccl_utils
    [ 47%] Linking CXX static library ../../lib/libencoder_igemm_func.a
    [ 48%] Linking CXX executable ../../bin/decoding_gemm
    [ 48%] Built target encoder_igemm_func
    [ 48%] Linking CXX executable ../../bin/gpt_gemm
    [ 48%] Built target decoding_gemm
    [ 49%] Linking CUDA device code CMakeFiles/encoder_gemm_func.dir/cmake_device_link.o
    [ 49%] Built target gpt_gemm
    [ 50%] Linking CXX static library ../../lib/libencoder_gemm_func.a
    [ 50%] Built target encoder_gemm_func
    Scanning dependencies of target encoder_gemm
    [ 50%] Building CXX object tools/gemm_test/CMakeFiles/encoder_gemm.dir/encoder_gemm.cc.o
    [ 51%] Linking CXX executable ../../bin/encoder_gemm
    [ 51%] Built target encoder_gemm
    [ 52%] Linking CUDA device code CMakeFiles/attention_kernels.dir/cmake_device_link.o
    [ 53%] Linking CUDA static library ../../lib/libattention_kernels.a
    [ 53%] Built target attention_kernels
    [ 54%] Linking CUDA device code CMakeFiles/transformer_kernels.dir/cmake_device_link.o
    [ 55%] Linking CUDA static library ../../lib/libtransformer_kernels.a
    [ 55%] Built target transformer_kernels
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu: In function 'void fastertransformer::set_alpha(uint32_t&, float, fastertransformer::Data_type)':
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:34:46: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
       34 |         alpha = reinterpret_cast<const uint32_t&>(h2);
          |                                              ^~
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:38:46: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
       38 |         alpha = reinterpret_cast<const uint32_t&>(norm);
          |                                              ^~~~
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu: In member function 'void fastertransformer::FusedMHARunnerInt8v2::mhaImpl::setup(int, int)':
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:329:62: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
      329 |         params.scale_bmm1 = reinterpret_cast<const uint32_t&>(scaleBmm1);
          |                                                              ^~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:330:62: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
      330 |         params.scale_bmm2 = reinterpret_cast<const uint32_t&>(scaleBmm2);
          |                                                              ^~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:331:65: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
      331 |         params.scale_softmax = reinterpret_cast<const uint32_t&>(scaleSoftmax);
          |                                                                 ^~~~~~~~~~~~
    [ 56%] Linking CUDA device code CMakeFiles/cuda_kernels.dir/cmake_device_link.o
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu: In member function 'virtual void fastertransformer::FusedMHARunnerInt8v2::setup(int, int)':
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:280:8: warning: 'warps_m' may be used uninitialized in this function [-Wmaybe-uninitialized]
      280 |         size_t warps_m, warps_n, warps_k = 1;
          |        ^~~~~~~
    /workspace/FasterTransformer/fastertransformer/trt_fused_multihead_attention/qkvToContext.cu:280:17: warning: 'warps_n' may be used uninitialized in this function [-Wmaybe-uninitialized]
      280 |         size_t warps_m, warps_n, warps_k = 1;
          |                 ^~~~~~~
    [ 57%] Linking CUDA static library ../../lib/libcuda_kernels.a
    [ 57%] Built target cuda_kernels
    Scanning dependencies of target cuda_int8_kernels
    Scanning dependencies of target decoder
    [ 58%] Building CUDA object fastertransformer/cuda/CMakeFiles/cuda_int8_kernels.dir/cuda_int8_kernels.cu.o
    [ 59%] Building CUDA object fastertransformer/cuda/CMakeFiles/decoder.dir/open_decoder.cu.o
    [ 60%] Building CUDA object fastertransformer/cuda/CMakeFiles/decoder.dir/masked_multihead_attention.cu.o
    [ 61%] Linking CUDA device code CMakeFiles/trt_fused_multi_head_attention.dir/cmake_device_link.o
    [ 62%] Linking CXX static library ../../lib/libtrt_fused_multi_head_attention.a
    [ 62%] Built target trt_fused_multi_head_attention
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu(303): warning: specified alignment (2) is different from alignment (4) specified on a previous declaration
              detected during:
                instantiation of "void fastertransformer::masked_attention_kernel(T *, T *, T *, const T *, T *, const T *, T *, const T *, T *, const __nv_bool *, int, int, int, int, T) [with T=half]"
    (445): here
                instantiation of "void fastertransformer::masked_attention_dispatch(T *, T *, T *, const T *, T *, const T *, T *, const T *, T *, const __nv_bool *, int, int, int, int, int, int, cudaStream_t) [with T=half]"
    (503): here
    
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu(1420): warning: specified alignment (2) is different from alignment (4) specified on a previous declaration
              detected during:
                instantiation of "void fastertransformer::cross_attention_kernel(T *, const T *, T *, const T *, T *, const T *, const int *, T *, const __nv_bool *, int, int, int, int, int, T) [with T=half]"
    (1555): here
                instantiation of "void fastertransformer::cross_attention_dispatch(T *, const T *, T *, const T *, T *, const T *, const int *, T *, const __nv_bool *, int, int, int, int, int, cudaStream_t) [with T=half]"
    (1576): here
    
    [ 63%] Linking CUDA device code CMakeFiles/cuda_int8_kernels.dir/cmake_device_link.o
    [ 64%] Linking CUDA static library ../../lib/libcuda_int8_kernels.a
    [ 64%] Built target cuda_int8_kernels
    Scanning dependencies of target encoder
    [ 65%] Building CUDA object fastertransformer/cuda/CMakeFiles/encoder.dir/open_attention.cu.o
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::masked_attention_dispatch(T*, T*, T*, const T*, T*, const T*, T*, const T*, T*, const bool*, int, int, int, int, int, int, cudaStream_t) [with T = float; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:484:398:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:454:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<float>'; use assignment or value-initialization instead [-Wclass-memaccess]
      454 |     memset(&params, 0, sizeof(params));
          |     ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<float>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::masked_attention_dispatch(T*, T*, T*, const T*, T*, const T*, T*, const T*, T*, const bool*, int, int, int, int, int, int, cudaStream_t) [with T = __half; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:503:389:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:454:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<short unsigned int>'; use assignment or value-initialization instead [-Wclass-memaccess]
      454 |     memset(&params, 0, sizeof(params));
          |     ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<short unsigned int>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::fusedQKV_masked_attention_dispatch(const T*, const T*, T*, T*, T*, const bool*, int, int, int, int, int, int, cudaStream_t) [with T = float; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:781:318:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:750:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<float>'; use assignment or value-initialization instead [-Wclass-memaccess]
      750 |     memset(&params, 0, sizeof(params));
          |     ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<float>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::fusedQKV_masked_attention_dispatch(const T*, const T*, T*, T*, T*, const bool*, int, int, int, int, int, int, cudaStream_t) [with T = __half; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:796:313:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:750:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<short unsigned int>'; use assignment or value-initialization instead [-Wclass-memaccess]
      750 |     memset(&params, 0, sizeof(params));
          |     ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<short unsigned int>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::fusedQKV_masked_attention_dispatch_v2(const T*, const T*, T*, T*, T*, const bool*, int, int, int, int, int, int, int, const int*, cudaStream_t) [with T = float; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:883:373:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:851:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<float>'; use assignment or value-initialization instead [-Wclass-memaccess]
      851 |   memset(&params, 0, sizeof(params));
          |   ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<float>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu: In instantiation of 'void fastertransformer::fusedQKV_masked_attention_dispatch_v2(const T*, const T*, T*, T*, T*, const bool*, int, int, int, int, int, int, int, const int*, cudaStream_t) [with T = __half; cudaStream_t = CUstream_st*]':
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:900:368:   required from here
    /workspace/FasterTransformer/fastertransformer/cuda/open_decoder.cu:851:7: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct Masked_multihead_attention_params<short unsigned int>'; use assignment or value-initialization instead [-Wclass-memaccess]
      851 |   memset(&params, 0, sizeof(params));
          |   ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /workspace/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h:48:8: note: 'struct Masked_multihead_attention_params<short unsigned int>' declared here
       48 | struct Masked_multihead_attention_params {
          |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    [ 66%] Linking CUDA device code CMakeFiles/decoder.dir/cmake_device_link.o
    [ 66%] Linking CXX static library ../../lib/libdecoder.a
    [ 66%] Built target decoder
    [ 67%] Linking CUDA device code CMakeFiles/encoder.dir/cmake_device_link.o
    [ 68%] Linking CXX static library ../../lib/libencoder.a
    [ 68%] Built target encoder
    Scanning dependencies of target encoder_sample
    [ 69%] Building CXX object sample/cpp/CMakeFiles/encoder_sample.dir/encoder_sample.cc.o
    [ 70%] Linking CXX executable ../../bin/encoder_sample
    [ 70%] Built target encoder_sample
    [ 71%] Linking CUDA device code CMakeFiles/online_softmax_beamsearch.dir/cmake_device_link.o
    [ 72%] Linking CUDA static library ../../lib/libonline_softmax_beamsearch.a
    [ 72%] Built target online_softmax_beamsearch
    [ 73%] Linking CUDA device code CMakeFiles/topk.dir/cmake_device_link.o
    [ 74%] Linking CUDA static library ../../lib/libtopk.a
    [ 74%] Built target topk
    Scanning dependencies of target decoding
    [ 75%] Building CUDA object fastertransformer/cuda/CMakeFiles/decoding.dir/decoding_kernels.cu.o
    [ 76%] Linking CUDA device code CMakeFiles/decoding.dir/cmake_device_link.o
    [ 77%] Linking CUDA static library ../../lib/libdecoding.a
    [ 77%] Built target decoding
    Scanning dependencies of target gpt_triton_backend
    Scanning dependencies of target decoding_sampling_sample
    Scanning dependencies of target transformer-static
    Scanning dependencies of target pyt_fastertransformer
    Scanning dependencies of target gpt_sample
    Scanning dependencies of target decoding_beamsearch_sample
    [ 78%] Linking CUDA device code CMakeFiles/transformer-static.dir/cmake_device_link.o
    [ 79%] Building CXX object sample/cpp/CMakeFiles/decoding_sampling_sample.dir/decoding_sampling_sample.cc.o
    [ 80%] Building CXX object fastertransformer/triton_backend/CMakeFiles/gpt_triton_backend.dir/gpt_triton_backend.cc.o
    [ 81%] Building CXX object sample/cpp/CMakeFiles/decoding_beamsearch_sample.dir/decoding_beamsearch_sample.cc.o
    [ 82%] Building CXX object sample/cpp/CMakeFiles/gpt_sample.dir/gpt_sample.cc.o
    [ 83%] Linking CXX static library lib/libtransformer-static.a
    [ 84%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/ft_op.cc.o
    [ 84%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/encoder.cc.o
    [ 85%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/decoder.cc.o
    [ 86%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/decoding.cc.o
    [ 87%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/gpt.cc.o
    [ 88%] Building CUDA object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/utils.cu.o
    [ 89%] Building CXX object fastertransformer/th_op/CMakeFiles/pyt_fastertransformer.dir/weight_quantize_op.cc.o
    [ 89%] Built target transformer-static
    Scanning dependencies of target gpt
    [ 90%] Building CXX object CMakeFiles/gpt.dir/sample/cpp/gpt_sample.cc.o
    [ 91%] Linking CXX executable ../../bin/decoding_sampling_sample
    [ 92%] Linking CXX executable ../../bin/decoding_beamsearch_sample
    [ 92%] Built target decoding_sampling_sample
    [ 92%] Built target decoding_beamsearch_sample
    [ 93%] Linking CXX executable ../../bin/gpt_sample
    [ 93%] Built target gpt_sample
    [ 94%] Linking CXX executable bin/gpt
    [ 94%] Built target gpt
    [ 94%] Linking CXX shared library ../../lib/libgpt_triton_backend.so
    [ 94%] Built target gpt_triton_backend
    Scanning dependencies of target transformer-shared
    Scanning dependencies of target gpt_thread_sample
    Scanning dependencies of target gpt_triton_sample
    [ 95%] Linking CUDA device code CMakeFiles/transformer-shared.dir/cmake_device_link.o
    [ 97%] Building CXX object sample/cpp/CMakeFiles/gpt_triton_sample.dir/gpt_triton_sample.cc.o
    [ 97%] Building CXX object sample/cpp/CMakeFiles/gpt_thread_sample.dir/gpt_thread_sample.cc.o
    [ 97%] Linking CXX shared library lib/libtransformer-shared.so
    [ 97%] Built target transformer-shared
    [ 98%] Linking CXX executable ../../bin/gpt_triton_sample
    [ 98%] Built target gpt_triton_sample
    [ 99%] Linking CXX executable ../../bin/gpt_thread_sample
    [ 99%] Built target gpt_thread_sample
    [ 99%] Linking CUDA device code CMakeFiles/pyt_fastertransformer.dir/cmake_device_link.o
    [100%] Linking CXX shared library ../../lib/libpyt_fastertransformer.so
    [100%] Built target pyt_fastertransformer
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     python -c 'import torch; torch.ops.load_library("./lib/libpyt_fastertransformer.so")'                                                              130 ↵
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     ././bin/encoder_gemm 32 32 12 64 0 0                                                                                                               130 ↵
    ./bin/encoder_sample 32 12 32 12 64 0 0 0 0
    
    Device Tesla V100-SXM2-16GB
    ***Encoder Gemm Testing Begin***
    ***Cublas Gemm Testing Begin***
    
    -----------------------------
    GEMM test 0: [M: 1024, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
    algo_-1 costs 0.123ms
    algo_2 costs 0.129ms
    algo_3 costs 0.165ms
    algo_4 costs 0.124ms
    algo_5 costs 0.160ms
    algo_6 costs 0.159ms
    algo_7 costs 0.115ms
    algo_8 costs 0.132ms
    algo_9 costs 0.111ms
    algo_10 costs 0.128ms
    algo_11 costs 0.154ms
    algo_18 costs 0.119ms
    algo_19 costs 0.101ms
    algo_20 costs 0.117ms
    algo_21 costs 0.130ms
    algo_22 costs 0.112ms
    algo_23 costs 0.115ms
    fast_algo 19 costs 0.101 ms
    
    -----------------------------
    GEMM test 1: [M: 1024, K: 768, N: 3072] attr_output * inter_kernel
    algo_-1 costs 0.384ms
    algo_2 costs 0.407ms
    algo_3 costs 0.410ms
    algo_4 costs 0.365ms
    algo_5 costs 0.343ms
    algo_6 costs 0.406ms
    algo_7 costs 0.366ms
    algo_8 costs 0.367ms
    algo_9 costs 0.348ms
    algo_10 costs 0.351ms
    algo_11 costs 0.349ms
    algo_18 costs 0.439ms
    algo_19 costs 0.383ms
    algo_20 costs 0.369ms
    algo_21 costs 0.444ms
    algo_22 costs 0.391ms
    algo_23 costs 0.431ms
    fast_algo 5 costs 0.343 ms
    
    -----------------------------
    GEMM test 2: [M: 1024, K: 3072, N: 768] inter_matmul * output_kernel
    algo_-1 costs 0.366ms
    algo_1 costs 0.631ms
    algo_2 costs 0.414ms
    algo_3 costs 0.543ms
    algo_4 costs 0.402ms
    algo_5 costs 0.525ms
    algo_6 costs 0.525ms
    algo_7 costs 0.361ms
    algo_8 costs 0.377ms
    algo_9 costs 0.347ms
    algo_10 costs 0.370ms
    algo_11 costs 0.375ms
    algo_18 costs 0.444ms
    algo_19 costs 0.360ms
    algo_20 costs 0.404ms
    algo_21 costs 0.441ms
    algo_22 costs 0.383ms
    algo_23 costs 0.407ms
    fast_algo 9 costs 0.347 ms
    
    -----------------------------
    GEMM test 3: [M: 32, K: 64, N: 32] attention batched Gemm1
    algo_-1 costs 0.029ms
    algo_0 costs 0.022ms
    algo_1 costs 0.026ms
    algo_2 costs 0.022ms
    algo_3 costs 0.036ms
    algo_4 costs 0.066ms
    algo_5 costs 0.022ms
    algo_6 costs 0.026ms
    algo_7 costs 0.022ms
    algo_8 costs 0.036ms
    algo_9 costs 0.066ms
    algo_18 costs 0.019ms
    algo_19 costs 0.024ms
    algo_20 costs 0.039ms
    algo_21 costs 0.019ms
    algo_22 costs 0.024ms
    algo_23 costs 0.039ms
    fast_algo 21 costs 0.019 ms
    
    -----------------------------
    GEMM test 4: [M: 32, K: 32, N: 64] attention batched Gemm2
    algo_-1 costs 0.017ms
    algo_0 costs 0.024ms
    algo_1 costs 0.017ms
    algo_2 costs 0.016ms
    algo_3 costs 0.023ms
    algo_4 costs 0.040ms
    algo_5 costs 0.024ms
    algo_6 costs 0.017ms
    algo_7 costs 0.016ms
    algo_8 costs 0.023ms
    algo_9 costs 0.040ms
    algo_18 costs 0.022ms
    algo_19 costs 0.020ms
    algo_20 costs 0.031ms
    algo_21 costs 0.022ms
    algo_22 costs 0.020ms
    algo_23 costs 0.031ms
    fast_algo 7 costs 0.016 ms
    
    -----------------------------
    GEMM test 5: [M: 1024, K: 768, N: 768] from_tensor * weight_QKV in BatchGemm
    algo_-1 costs 0.273ms
    algo_0 costs 0.286ms
    algo_1 costs 0.272ms
    algo_2 costs 0.275ms
    algo_3 costs 0.267ms
    algo_4 costs 0.270ms
    algo_5 costs 0.286ms
    algo_6 costs 0.273ms
    algo_7 costs 0.274ms
    algo_8 costs 0.267ms
    algo_9 costs 0.270ms
    algo_18 costs 0.309ms
    algo_19 costs 0.277ms
    algo_20 costs 0.294ms
    algo_21 costs 0.309ms
    algo_22 costs 0.277ms
    algo_23 costs 0.294ms
    fast_algo 8 costs 0.267 ms
    ***cublas Gemm Testing End***
    
    ***Encoder Gemm Testing End***
    Device Tesla V100-SXM2-16GB
    Device Tesla V100-SXM2-16GB
    before allocate free 15.44 GB total 15.78 GB
    After allocate free 15.40 GB used 0.38 GB total 15.78 GB
    [INFO] batch_size 32 seq_len 32 layer 12 FT-CPP-time 15.40 ms ( 50 iterations)
    [email protected]5537c320a-master-0:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ ./bin/encoder_gemm 32 32 12 64 0 0
    python pytorch/encoder_sample.py 32 12 32 12 64 --time
    
    Device Tesla V100-SXM2-16GB
    ***Encoder Gemm Testing Begin***
    ***Cublas Gemm Testing Begin***
    
    -----------------------------
    GEMM test 0: [M: 1024, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
    algo_-1 costs 0.109ms
    algo_2 costs 0.114ms
    algo_3 costs 0.145ms
    algo_4 costs 0.109ms
    algo_5 costs 0.141ms
    algo_6 costs 0.141ms
    algo_7 costs 0.103ms
    algo_8 costs 0.116ms
    algo_9 costs 0.097ms
    algo_10 costs 0.111ms
    algo_11 costs 0.143ms
    algo_18 costs 0.113ms
    algo_19 costs 0.096ms
    algo_20 costs 0.111ms
    algo_21 costs 0.124ms
    algo_22 costs 0.107ms
    algo_23 costs 0.108ms
    fast_algo 19 costs 0.096 ms
    
    -----------------------------
    GEMM test 1: [M: 1024, K: 768, N: 3072] attr_output * inter_kernel
    algo_-1 costs 0.361ms
    algo_2 costs 0.384ms
    algo_3 costs 0.404ms
    algo_4 costs 0.366ms
    algo_5 costs 0.344ms
    algo_6 costs 0.407ms
    algo_7 costs 0.366ms
    algo_8 costs 0.367ms
    algo_9 costs 0.349ms
    algo_10 costs 0.351ms
    algo_11 costs 0.347ms
    algo_18 costs 0.439ms
    algo_19 costs 0.382ms
    algo_20 costs 0.367ms
    algo_21 costs 0.442ms
    algo_22 costs 0.391ms
    algo_23 costs 0.431ms
    fast_algo 5 costs 0.344 ms
    
    -----------------------------
    GEMM test 2: [M: 1024, K: 3072, N: 768] inter_matmul * output_kernel
    algo_-1 costs 0.365ms
    algo_1 costs 0.629ms
    algo_2 costs 0.414ms
    algo_3 costs 0.543ms
    algo_4 costs 0.402ms
    algo_5 costs 0.526ms
    algo_6 costs 0.530ms
    algo_7 costs 0.363ms
    algo_8 costs 0.377ms
    algo_9 costs 0.348ms
    algo_10 costs 0.370ms
    algo_11 costs 0.375ms
    algo_18 costs 0.445ms
    algo_19 costs 0.357ms
    algo_20 costs 0.404ms
    algo_21 costs 0.442ms
    algo_22 costs 0.384ms
    algo_23 costs 0.406ms
    fast_algo 9 costs 0.348 ms
    
    -----------------------------
    GEMM test 3: [M: 32, K: 64, N: 32] attention batched Gemm1
    algo_-1 costs 0.029ms
    algo_0 costs 0.022ms
    algo_1 costs 0.026ms
    algo_2 costs 0.022ms
    algo_3 costs 0.036ms
    algo_4 costs 0.066ms
    algo_5 costs 0.022ms
    algo_6 costs 0.026ms
    algo_7 costs 0.022ms
    algo_8 costs 0.036ms
    algo_9 costs 0.066ms
    algo_18 costs 0.019ms
    algo_19 costs 0.024ms
    algo_20 costs 0.039ms
    algo_21 costs 0.019ms
    algo_22 costs 0.024ms
    algo_23 costs 0.039ms
    fast_algo 21 costs 0.019 ms
    
    -----------------------------
    GEMM test 4: [M: 32, K: 32, N: 64] attention batched Gemm2
    algo_-1 costs 0.017ms
    algo_0 costs 0.024ms
    algo_1 costs 0.017ms
    algo_2 costs 0.016ms
    algo_3 costs 0.023ms
    algo_4 costs 0.040ms
    algo_5 costs 0.024ms
    algo_6 costs 0.017ms
    algo_7 costs 0.016ms
    algo_8 costs 0.023ms
    algo_9 costs 0.040ms
    algo_18 costs 0.022ms
    algo_19 costs 0.020ms
    algo_20 costs 0.031ms
    algo_21 costs 0.022ms
    algo_22 costs 0.020ms
    algo_23 costs 0.031ms
    fast_algo 7 costs 0.016 ms
    
    -----------------------------
    GEMM test 5: [M: 1024, K: 768, N: 768] from_tensor * weight_QKV in BatchGemm
    algo_-1 costs 0.273ms
    algo_0 costs 0.286ms
    algo_1 costs 0.273ms
    algo_2 costs 0.274ms
    algo_3 costs 0.267ms
    algo_4 costs 0.270ms
    algo_5 costs 0.286ms
    algo_6 costs 0.273ms
    algo_7 costs 0.276ms
    algo_8 costs 0.267ms
    algo_9 costs 0.270ms
    algo_18 costs 0.309ms
    algo_19 costs 0.277ms
    algo_20 costs 0.294ms
    algo_21 costs 0.309ms
    algo_22 costs 0.277ms
    algo_23 costs 0.294ms
    fast_algo 8 costs 0.267 ms
    ***cublas Gemm Testing End***
    
    ***Encoder Gemm Testing End***
    2021-09-07 15:58:23.469196: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
    WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
    
    =============== Argument ===============
    batch_size: 32
    layer_num: 12
    seq_len: 32
    head_num: 12
    head_size: 64
    hidden_dim: 768
    weight_path: None
    use_fp16: False
    int8_mode: 0
    avg_seq_len: -1
    test_time: True
    remove_padding: False
    allow_gemm_test: False
    ========================================
    
    tensor([[[-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             [-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             [-1.5207,  1.6938, -0.9449,  ...,  1.0584, -0.0928,  1.0019],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             [-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             [-1.5207,  1.6939, -0.9454,  ...,  1.0586, -0.0935,  1.0023],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             [-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             [-1.5212,  1.6943, -0.9453,  ...,  1.0586, -0.0929,  1.0017],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            ...,
    
            [[-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             [-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             [-1.5202,  1.6939, -0.9454,  ...,  1.0585, -0.0928,  1.0020],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             [-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             [-1.5209,  1.6946, -0.9452,  ...,  1.0585, -0.0928,  1.0021],
             ...,
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]],
    
            [[-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             ...,
             [-1.5211,  1.6941, -0.9452,  ...,  1.0587, -0.0927,  1.0020],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000],
             [-0.0000,  0.0000, -0.0000,  ...,  0.0000, -0.0000,  0.0000]]],
           device='cuda:0')
    torch.Size([32, 32, 768])
    Traceback (most recent call last):
      File "pytorch/encoder_sample.py", line 257, in <module>
        main()
      File "pytorch/encoder_sample.py", line 171, in main
        ft_output = custom_encoder(inp, mask, mem_seq_lens)[0] * output_mask
      File "/home/amralaa/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
    RuntimeError: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript (most recent call last):
      File "/workspace/FasterTransformer/build/pytorch/utils/encoder.py", line 200, in forward
                trt_seq_len = torch.cat([trt_seq_len, torch.tensor([batch * max_seq_len], device='cuda').to(trt_seq_len.dtype)], dim=0).to(torch.int32)
            for i in range(self.layer_num):
                hidden_states = self.encoders[i].forward(hidden_states, attention_mask, trt_seq_len, sequence_id_offset)
                                ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            if self.remove_padding:
                hidden_states = self.rebuild_padding(hidden_states, sequence_id_offset, attention_mask, 0)
    RuntimeError: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /workspace/FasterTransformer/fastertransformer/cuda/open_attention.h:939
    
    
    
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ nvidia-smi
    Tue Sep  7 16:34:08 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00001D0E:00:00.0 Off |                    0 |
    | N/A   32C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00003D8F:00:00.0 Off |                    0 |
    | N/A   33C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00006B96:00:00.0 Off |                    0 |
    | N/A   32C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 000084FE:00:00.0 Off |                    0 |
    | N/A   33C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 0000C3F5:00:00.0 Off |                    0 |
    | N/A   34C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 0000DCCF:00:00.0 Off |                    0 |
    | N/A   33C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 0000E931:00:00.0 Off |                    0 |
    | N/A   32C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 0000F01A:00:00.0 Off |                    0 |
    | N/A   33C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ gcc --version
    gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ g++ --version
    g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Oct_12_20:09:46_PDT_2020
    Cuda compilation tools, release 11.1, V11.1.105
    Build cuda_11.1.TC455_06.29190527_0
    [email protected]:/workspace/FasterTransformer/build (main) (DISPLAY=)
     $ dpkg -l | grep -P 'nvidia|cuda'
    ii  cuda-compat-11-1                       455.32.00-1                       amd64        CUDA Compatibility Platform
    ii  cuda-cudart-11-1                       11.1.74-1                         amd64        CUDA Runtime native Libraries
    ii  cuda-cudart-dev-11-1                   11.1.74-1                         amd64        CUDA Runtime native dev links, headers
    ii  cuda-cuobjdump-11-1                    11.1.74-1                         amd64        CUDA cuobjdump
    ii  cuda-cupti-11-1                        11.1.105-1                        amd64        CUDA profiling tools runtime libs.
    ii  cuda-cupti-dev-11-1                    11.1.105-1                        amd64        CUDA profiling tools interface.
    ii  cuda-driver-dev-11-1                   11.1.74-1                         amd64        CUDA Driver native dev stub library
    ii  cuda-gdb-11-1                          11.1.105-1                        amd64        CUDA-GDB
    ii  cuda-memcheck-11-1                     11.1.105-1                        amd64        CUDA-MEMCHECK
    ii  cuda-nvcc-11-1                         11.1.105-1                        amd64        CUDA nvcc
    ii  cuda-nvdisasm-11-1                     11.1.74-1                         amd64        CUDA disassembler
    ii  cuda-nvml-dev-11-1                     11.1.74-1                         amd64        NVML native dev links, headers
    ii  cuda-nvprof-11-1                       11.1.105-1                        amd64        CUDA Profiler tools
    ii  cuda-nvprune-11-1                      11.1.74-1                         amd64        CUDA nvprune
    ii  cuda-nvrtc-11-1                        11.1.105-1                        amd64        NVRTC native runtime libraries
    ii  cuda-nvrtc-dev-11-1                    11.1.105-1                        amd64        NVRTC native dev links, headers
    ii  cuda-nvtx-11-1                         11.1.74-1                         amd64        NVIDIA Tools Extension
    ii  cuda-sanitizer-11-1                    11.1.105-1                        amd64        CUDA Sanitizer
    ii  libcudnn8                              8.0.5.43-1+cuda11.1               amd64        cuDNN runtime libraries
    ii  libcudnn8-dev                          8.0.5.43-1+cuda11.1               amd64        cuDNN development libraries and headers
    ii  libnccl-dev                            2.8.3-1+cuda11.1                  amd64        NVIDIA Collective Communication Library (NCCL) Development Files
    ii  libnccl2                               2.8.3-1+cuda11.1                  amd64        NVIDIA Collective Communication Library (NCCL) Runtime
    ii  libnvinfer-bin                         7.2.2-1+cuda11.1                  amd64        TensorRT binaries
    ii  libnvinfer-dev                         7.2.2-1+cuda11.1                  amd64        TensorRT development libraries and headers
    ii  libnvinfer-plugin-dev                  7.2.2-1+cuda11.1                  amd64        TensorRT plugin libraries and headers
    ii  libnvinfer-plugin7                     7.2.2-1+cuda11.1                  amd64        TensorRT plugin library
    ii  libnvinfer7                            7.2.2-1+cuda11.1                  amd64        TensorRT runtime libraries
    ii  libnvonnxparsers-dev                   7.2.2-1+cuda11.1                  amd64        TensorRT ONNX libraries
    ii  libnvonnxparsers7                      7.2.2-1+cuda11.1                  amd64        TensorRT ONNX libraries
    ii  libnvparsers-dev                       7.2.2-1+cuda11.1                  amd64        TensorRT parsers libraries
    ii  libnvparsers7                          7.2.2-1+cuda11.1                  amd64        TensorRT parsers libraries
    
  • [FasterTransformer/V2] No speedUp when the sequence length is large

    [FasterTransformer/V2] No speedUp when the sequence length is large

    Hello, I had implemented a gpt2 model according to FT. Then I tested the performance between pytorch(Fairseq) and FT, This is the result :

    Setting : batch = 1, hidden_units = 1024, head_num = 16, size_per_head = 64 time : ms seq_len : 8 16 32 64 128 256 512 800 pytorch: 21 23 23 23 28 23.6 22.6 24 ___FT : 6.2 6.4 6.7 7.6 8.6 12.3 24 34.7

    From this table , it seems like the FT is much worse than pytorch when the seq_len is larger than about 500. it's very unacceptable that the FT is slower. And why the performance of pytorch is almost never change with the increase of sequence length ? It's also has the same phenomenon when I test the masked BERT model. Anyone help ?

  • [FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

    [FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

    Related to FastTransformer v3.1/TensorFlow/GPT-2

    Describe the bug If I run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py, I got Internal: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR FasterTransformer/fastertransformer/cuda/open_decoder.cu:1708. However, If I don't run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py (use the default gemm), everything is OK.

    To Reproduce Steps to reproduce the behavior:

    1. nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
    2. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
    3. make
    4. ./bin/decoding_gemm 4 1 12 64 50257 32 768 0
    5. python tensorflow/gpt2_sample.py

    Expected behavior There should be no error.

    Environment Please provide at least:

    • Container version: nvcr.io/nvidia/tensorflow:19.06-py3
    • GPUs in the system: 8x Tesla V100-32GB
    • CUDA driver version: 435.21
  • [FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

    [FastTransformer v3.1/TensorFlow] Get CUBLAS_STATUS_INTERNAL_ERROR when run tensorflow/gpt2-sample.py

    Related to FastTransformer v3.1/TensorFlow/GPT-2

    Describe the bug If I run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py, I got Internal: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INTERNAL_ERROR FasterTransformer/fastertransformer/cuda/open_decoder.cu:1708. However, If I don't run ./bin/decoding_gemm 4 1 12 64 50257 32 768 0 before python tensorflow/gpt2_sample.py (use the default gemm), everything is OK.

    To Reproduce Steps to reproduce the behavior:

    1. nvidia-docker run -it -v local_dir:container_dir nvcr.io/nvidia/tensorflow:19.06-py3 bash
    2. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.5/dist-packages/tensorflow ..
    3. make
    4. ./bin/decoding_gemm 4 1 12 64 50257 32 768 0
    5. python tensorflow/gpt2_sample.py

    Expected behavior There should be no error.

    Environment Please provide at least:

    • Container version: nvcr.io/nvidia/tensorflow:19.06-py3
    • GPUs in the system: 8x Tesla V100-32GB
    • CUDA driver version: 435.21
  • Why Effective-FFT's results inconsistent with actual results?

    Why Effective-FFT's results inconsistent with actual results?

    Why does the code in the sample/cpp/encoder_sample.cc give inconsistent results when setting is_remove_padding equal to true and false for the same input? Comment: 'is_remove_padding' =False, the result is the same as the actual result.

    such as print_to_file(encoder_param.transformer_out): actual result: -0.222001, 1.049087, -0.278886,-0.451816 ... is_remove_padding = False: same as actual result, is_remove_padding = Ture: 1.979067, 1.000596, 0.830792, 0.121805.

    who can tell me why?... I'm very confused, please help me, thx!!

  • Performance Degradation when using FP16

    Performance Degradation when using FP16

    Information

    I want to perform GPT-J model in fp16 precision(https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) on FasterTransformer + Triton, but I have a trouble with the accuracy. For example, the following sentences are generated when following sample scripts with FasterTransformer.

    • sample scripts https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/GPT-J-6B/Inference_with_GPT_J_6B.ipynb#scrollTo=RdOynYcY8jb1

    • generated sentences

    Generated: The Belgian national football team  is the national football team of Belgium. It is controlled by the Belgian Football Association of the Belgian Football Association of Football Association (Federation of Football Federation (Federation of Wallonia, the Belgian Football Association (Federation (Federation (Federation (Federation (Federation) and the Belgian Football Association (Federation) and the Belgian Football Association (Federation (Federation) and the Belgian Football Association (Federation (Federation) and the Belgian Football Association (Federation (Federation) (Federation) and the Belgian Football Association (Federation) (Federation) (Federation) (Federation (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Fé) (Fé) (
    

    Environment

    • Platform: Platform: GCP A100 Instance
    • NVIDIA Driver Version: 495.44
    • Docker version 20.10.12, build e91ed57

    To reproduce

    1. download gpt-j-6b model in float 16 pytorch_model.bin, rename gpt-j.pt, and store gpt-j/gpt-j.pt. https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16

    2. convert pytorch model to fasterTransformer format via the following scripts on docker image nvcr.io/nvidia/pytorch:21.07-py3.

    from argparse import ArgumentParser
    from io import BytesIO
    from os import makedirs
    import numpy as np
    import torch
    
    torch.set_printoptions(linewidth=130, sci_mode=False)
    np.set_printoptions(linewidth=130, suppress=True)
    
    
    def reshard(x, old_shape):
        import jax.numpy as jnp
        if len(x.shape) == 1:
            out = x[0:1]
    
        elif len(x.shape) == 2:
            if (x[1:] == x[-1]).all():
                if (x[1:] == 0).all() or (x[1:] == 1).all():
                    out = x[0:1]
                else:
                    out = x[0:1] * 8#* x.shape[0] / old_shape[0]
            else:
                out = x.reshape(old_shape)
    
        elif len(x.shape) == 3:
            if x.shape[0] * x.shape[2] == old_shape[2]:
                out = jnp.transpose(x, (1, 0, 2)).reshape(old_shape)
            elif x.shape[0] * x.shape[1] == old_shape[1]:
                out = x.reshape(old_shape)
            else:
                raise Exception(f"unimplemented, {x.shape}, {old_shape}")
        else:
            raise Exception(f"unimplemented, {x}")
        return out
    
    
    def get_old_shape(t, dim=2):
        if len(t.shape) == 3:
            shard_shape = t.shape
            if dim == 1:
                return (shard_shape[0] * shard_shape[1], shard_shape[2])
            elif dim == 2:
                return (shard_shape[1], shard_shape[0] * shard_shape[2])
            else:
                raise ValueError(f"unsupported dim {dim}")
        if len(t.shape) == 2:
            return (t.shape[1] * t.shape[0],)
        else:
            raise ValueError(f"unsupported shape {t.shape}")
    
    
    def read_shard(ckpt_dir, idx):
        out = []
        file_path = ckpt_dir + f"{idx}.npz"
        with open(file_path, "rb") as f:
            buf = f.read()
            f_io = BytesIO(buf)
            deserialized = np.load(f_io)
            for i in deserialized:
                out.append(deserialized[i])
        return out
    
    
    def savebin(param, save_path):
        if isinstance(param, torch.Tensor):
            param = param.cpu().float().numpy()
        np.squeeze(param).astype(np.float32).tofile(save_path + ".bin")
    
    
    def param2file(pt_param, layer_id, save_dir, dest_key):
        base_n = save_dir + "/model.layers." + str(layer_id) + "."
        save_path = base_n + dest_key
        savebin(pt_param, save_path)
    
    
    def param2distributed(
        pt_param,
        layer_id,
        save_dir,
        dest_key,
        n_inference_gpus,
        split_axis,
    ):
        np_param = pt_param.cpu().float().numpy()
        base_n = save_dir + "/model.layers." + str(layer_id) + "."
        save_path = base_n + dest_key
        split_param = np.split(np_param, n_inference_gpus, axis=split_axis)
        for i, p in enumerate(split_param):
            savebin(p, save_path + f".{i}")
    
    
    def save(w, save_dir, n_inference_gpus=1, num_layers=28):
        makedirs(save_dir, exist_ok=True)
        savebin(w['transformer.wte.weight'], save_dir + "/model.wte")
        for l in range(num_layers):
            print(f"Saving layer {l} / 28")
            base_k = "transformer.h." + str(l) + "."
            param2file(
              w[base_k + "ln_1.bias"],
              l, save_dir, "input_layernorm.bias"
            )
            param2file(
              w[base_k + "ln_1.weight"],
              l, save_dir, "input_layernorm.weight"
            )
            param2distributed(
              w[base_k + "mlp.fc_in.weight"].T, # fc_in weight
              l, save_dir, "mlp.dense_h_to_4h.weight",
              n_inference_gpus, split_axis=-1 # split fast indx
            )
            param2distributed(
              w[base_k + "mlp.fc_in.bias"], # fc_in bias
              l, save_dir, "mlp.dense_h_to_4h.bias",
              n_inference_gpus, split_axis=-1 # split fast indx
            )
    
            param2distributed(
              w[base_k + "mlp.fc_out.weight"].T, # fc_out weight
              l, save_dir, "mlp.dense_4h_to_h.weight",
              n_inference_gpus, split_axis=0  # split slow indx
            )
            param2file(
              w[base_k + "mlp.fc_out.bias"], # fc_out bias
              l, save_dir, "mlp.dense_4h_to_h.bias"
            )
            param2distributed(
              w[base_k + "attn.out_proj.weight"].T,
              l, save_dir, "attention.dense.weight",
              n_inference_gpus, split_axis=0  # split slow indx
            )
            QKV_w = torch.stack([
              w[base_k + "attn.q_proj.weight"],
              w[base_k + "attn.k_proj.weight"],
              w[base_k + "attn.v_proj.weight"],
            ]) # [qkv, n_heads * dim_head, latent_space]
            QKV_w = QKV_w.permute(2, 0, 1)
            param2distributed(
              QKV_w, l, save_dir, "attention.query_key_value.weight",
              n_inference_gpus, split_axis=-1 # split fast indx
            )
            # Other unneeded per-layer params:
            # attn.attention.masked_bias = torch.tensor(-1e9)
            # attn.attention.bias = torch.tril(torch.ones(1, 1, 2048, 2048))
        savebin(w['transformer.ln_f.weight'], save_dir + "/model.final_layernorm.weight")
        savebin(w['transformer.ln_f.bias'], save_dir + "/model.final_layernorm.bias")
        # lm head fast index should be hidden layer size, not vocab:
        savebin(w['lm_head.weight'], save_dir + "/model.lm_head.weight")
        savebin(w['lm_head.bias'], save_dir + "/model.lm_head.bias")
    
    
    if __name__ == "__main__":
        parser = ArgumentParser(
            description="Convert GPT-J slim checkpoint to FasterTransformer",
        )
        parser.add_argument(
            "--to", default="triton-model-store/fastertransformer/1/gpt-j-6b/"
        )
        parser.add_argument(
            "--f", default="gpt-j/gpt-j.pt"
        )
        args = parser.parse_args()
    
        print("loading")
        in_path = args.f
        output_dir = args.to
    
        if len(in_path)>3 and in_path[-3:] == ".pt":
            checkpoint = torch.load(in_path)
        else:
            raise ValueError("plz give **.pt file")
    
        print("saving")
        save(checkpoint, output_dir)
        print("done")
    
    
    1. Add the following config.pbtxt to triton-model-store/fastertransformer/config.pbtxt. I note that I have changed temperature to 0.9 from sample GPT-J config.
    name: "fastertransformer"
    backend: "fastertransformer"
    default_model_filename: "gpt-j-6b"
    max_batch_size: 128
    input [
      {
        name: "INPUT_ID"
        data_type: TYPE_UINT32
        dims: [ -1, -1 ]
      },
      {
        name: "REQUEST_INPUT_LEN"
        data_type: TYPE_UINT32
        dims: [ 1 ]
      },
      {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_UINT32
        dims: [ 1 ]
      }
    ]
    output [
      {
        name: "OUTPUT0"
        data_type: TYPE_UINT32
        dims: [ -1, -1 ]
      }
    ]
    instance_group [
      {
        count: 1
        kind : KIND_CPU
      }
    ]
    parameters {
      key: "top_k"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "top_p"
      value: {
        string_value: "0.0"
      }
    }
    parameters {
      key: "tensor_para_size"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "pipeline_para_size"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "max_input_len"
      value: {
        string_value: "512"
      }
    }
    parameters {
      key: "max_seq_len"
      value: {
        string_value: "528"
      }
    }
    parameters {
      key: "is_half"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "head_num"
      value: {
        string_value: "16"
      }
    }
    parameters {
      key: "size_per_head"
      value: {
        string_value: "256"
      }
    }
    parameters {
      key: "inter_size"
      value: {
        string_value: "16384"
      }
    }
    parameters {
      key: "rotary_embedding"
      value: {
        string_value: "64"
      }
    }
    parameters {
      key: "vocab_size"
      value: {
        string_value: "50400"
      }
    }
    parameters {
      key: "start_id"
      value: {
        string_value: "50256"
      }
    }
    parameters {
      key: "end_id"
      value: {
        string_value: "50256"
      }
    }
    parameters {
      key: "decoder_layers"
      value: {
        string_value: "28"
      }
    }
    parameters {
      key: "model_name"
      value: {
        string_value: "gpt-j-6b"
      }
    }
    parameters {
      key: "beam_width"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "temperature"
      value: {
        string_value: "0.9"
      }
    }
    parameters {
      key: "repetition_penalty"
      value: {
        string_value: "1.0"
      }
    }
    parameters {
      key: "len_penalty"
      value: {
        string_value: "1.0"
      }
    }
    parameters {
      key: "beam_search_diversity_rate"
      value: {
        string_value: "0.0"
      }
    }
    dynamic_batching {
      preferred_batch_size: [4, 8]
      max_queue_delay_microseconds: 200000
    }
    parameters {
      key: "model_type"
      value: {
        string_value: "GPT-J"
      }
    }
    
    1. Build & run the following docker images
    FROM nvcr.io/nvidia/tritonserver:21.07-py3
    
    ARG work_dir="/workspace"
    ARG lib_dir="/opt/tritonserver"
    WORKDIR ${lib_dir}
    
    # settings
    RUN apt-get update
    RUN apt-get install --yes python3-dev \
        rapidjson-dev
    
    RUN wget https://github.com/Kitware/CMake/releases/download/v3.21.1/cmake-3.21.1-linux-x86_64.tar.gz
    RUN tar -axf cmake-3.21.1-linux-x86_64.tar.gz
    ENV PATH=${lib_dir}/cmake-3.21.1-linux-x86_64/bin/:$PATH
    RUN pip3 install tritonclient[all] fire regex
    
    RUN git clone https://github.com/triton-inference-server/fastertransformer_backend.git -b dev/v1.1_beta
    RUN git clone https://github.com/NVIDIA/FasterTransformer.git -b dev/v5.0_beta
    RUN git clone https://github.com/triton-inference-server/server.git # We need some tools when we test this backend
    RUN ln -s server/qa/common .
    
    ENV CONTAINER_VERSION=21.07
    ENV TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
    
    # install ft backend
    RUN mkdir -p fastertransformer_backend/build
    WORKDIR /opt/tritonserver/fastertransformer_backend/build
    RUN cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \
            -DCMAKE_BUILD_TYPE=Release \
            -DCMAKE_INSTALL_PREFIX=/opt/tritonserver \
            -DTRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
            -DTRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
            -DTRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" ..
    
    RUN make -j install
    
    # model file settings
    WORKDIR ${work_dir}
    RUN pip3 install transformers
    
    • build

    docker build --tag $(tag) -f Dockerfile .

    • run (NOTE that current directory includes triton-model-store directory)

    docker run --gpus all --rm -it \ -p 8000:8000 -p 8001:8001 -p 8002:8002 \ -v $(curr_dir):/workspace \ $(tag) \ bash

    1. In the above image, run Triton

    mpirun -n 1 --allow-run-as-root tritonserver \ --model-repository=/workspace/triton-model-store &

    1. run the following chat.py via python3 chat.py
    #!/usr/bin/python
    
    import argparse
    import numpy as np
    import os
    import re
    import sys
    import requests as httpreq
    from builtins import range
    import statistics as s
    import tritonclient.http as httpclient
    from tritonclient.utils import np_to_triton_dtype
    from transformers import AutoTokenizer
    
    
    def inference(input_data: np.ndarray, fixed_output_len: int) -> np.ndarray:
        """
        input_data: (batch_size, 1, sentence_len)
        """
        model_name = "fastertransformer"
        # shape
        input_len = np.array([[sentence.size] for sentence in input_data], np.uint32)
        output_len = np.ones_like(input_len).astype(np.uint32) * fixed_output_len
    
        with httpclient.InferenceServerClient(
            "localhost:8000",
            concurrency=1,
            verbose=True
        ) as client:
            inputs = [
                httpclient.InferInput("INPUT_ID", input_data.shape,
                                        np_to_triton_dtype(input_data.dtype)),
                httpclient.InferInput("REQUEST_INPUT_LEN", input_len.shape,
                                        np_to_triton_dtype(input_len.dtype)),
                httpclient.InferInput("REQUEST_OUTPUT_LEN", output_len.shape,
                                        np_to_triton_dtype(output_len.dtype))
            ]
            inputs[0].set_data_from_numpy(input_data)
            inputs[1].set_data_from_numpy(input_len)
            inputs[2].set_data_from_numpy(output_len)
            # requests.append(client.async_infer(model_name, inputs))
            print("send request")
            result = client.infer(model_name, inputs)
            return result.as_numpy("OUTPUT0")
    
    
    def gpt_j():
        tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
        prompt = "The Belgian national football team "
        tokens = tokenizer(prompt, return_tensors="np").input_ids.astype(np.uint32)
        tokens = tokens.reshape((1, 1, -1))
    
        FIXED_OUTPUT_LEN = 200
        last_tokens = inference(tokens, FIXED_OUTPUT_LEN)
        generated_text = tokenizer.decode(last_tokens[0][0])
        print("Generated:", generated_text)
    
    def main():
        gpt_j()
    
    if __name__ == '__main__':
        main()
    
    
    • output
    Generated: The Belgian national football team  is the national football team of Belgium. It is controlled by the Belgian Football Association of the Belgian Football Association of Football Association (Federation of Football Federation (Federation of Wallonia, the Belgian Football Association (Federation (Federation (Federation (Federation (Federation) and the Belgian Football Association (Federation) and the Belgian Football Association (Federation (Federation) and the Belgian Football Association (Federation (Federation) and the Belgian Football Association (Federation (Federation) (Federation) and the Belgian Football Association (Federation) (Federation) (Federation) (Federation (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Federation) (Fé) (Fé) (
    

    Expected Behavior

    The following reference outputs accurate sentences.

    Ref: https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/GPT-J-6B/Inference_with_GPT_J_6B.ipynb#scrollTo=RdOynYcY8jb1

    output:

    The Belgian national football team  (,, ), known as the Blue and Whites, represents Belgium in international football competitions organised by FIFA and the governing body for football in Belgium, the Royal Belgian Football Association (,, ). It is coached by former Netherlands international and UEFA Euro 1984 winner Dick Advocaat, who was appointed in January 2018 after the departure of Michel Preud'homme. The Belgium team has been a force in international football since the 1960s, winning the 1974 FIFA World Cup and Euro 2000. It also qualified for UEFA Euro 2020. The Belgium national team is based and plays its games in the Antwerp region, with Rupelstad Stadion, home to its first- and second-tier matches, as a regular venue.
    
    Belgium played its first official international match on 21 January 1920, losing 0–2 to the Netherlands in Rotterdam. Belgium and the Netherlands have played each other in 15 matches, with the Dutch winning 10 times and
    

    Related Issue

    NVIDIA/FasterTransformer#172

  • CUDA error: an illegal memory access was encountered

    CUDA error: an illegal memory access was encountered

    Hi, I tried to write my own kernel for reduced sum, but got strange illegal access errors where I'm sure the index is correct and in range. Here is the code:

    template<typename T>
    __global__ void reducedSum(T* out, T* in, const int max_input_length, const int batch_size, const int hidden_units)
    {
        int bid = blockIdx.x;
        int offset = max_input_length * hidden_units * bid;  
        // along sequence
        for (int j = 0; j < hidden_units; j++){
            T temp = 0;
            for (int i = 0; i < max_input_length; i++) {
                // printf("index %d %d %d %d\n", offset + (i * hidden_units) + j, bid, i, j);
                temp += in[offset + (i * hidden_units) + j];
                }
            // printf("Batch %d Indexout %d %f\n", bid, bid * hidden_units + j, (float)temp);
            out[bid * hidden_units + j] = temp;
        }
    }
    
    template<typename T>
    void invokeReducedSum(
        T* out, T* in, const int max_input_length, const int batch_size, const int hidden_units, cudaStream_t stream)
    {
        dim3 grid(batch_size);
        dim3 block(min(hidden_units, 1024));
        reducedSum<T><<<grid, 1, 0, stream>>>(out, in, max_input_length, batch_size, hidden_units);
    }
    

    And some print out:

    index 992260 1 457 4
    index 4614148 8 410 4
    index 4114436 7 434 4
    index 1750020 3 173 4
    index 3179525 6 33 5
    index 2238468 4 138 4
    index 1496068 2 437 4
    index 5139460 9 411 4
    index 479236 0 468 4
    index 2806788 5 181 4
    Traceback (most recent call last):
    

    Any insights for this?

  • The fp16 inference of pytorch encoder with different batchsize got different output

    The fp16 inference of pytorch encoder with different batchsize got different output

    System and software fastertransformer version: v4.0 GPU: T4 CUDA: 11.0 PyTorch: 1.8

    Issue description I have a fp16 BERT model with 12 fastertransformer encoders. When I do inference with the same input, the outputs of the model are different with different batchsize. The difference is around the order of 1e-3.

    batchisize=1 part of the 12-CustomEncoder-output: 0.4175 -0.4004 1.26 -0.03006 0.1311 1.4375

    batchisize=3 part of the 12-CustomEncoder-output: 0.416 -0.3975 1.264 -0.0305 0.1283 1.439

    How to get deterministic results with different batch?

  •  The fp16 inference of pytorch swintransformer op got `nan` output.

    The fp16 inference of pytorch swintransformer op got `nan` output.

    Description

    system and software:

    • fastertransformer version: v5.0
    • GPU: T4
    • Swin-Transformer: e0486b2cf8c63b6314570a43007569c8aa9b4578
    • CUDA: 11.0

    Error Message

    1. got nan of fp16 inference of swintransformer_op: FP16_torch_traced_output vs FP16_op_output , avg diff : nan max diff : nan, which causes by the output of FP16_op_output have nan value after debug into infer_swintransformer_op.py;
    2. got a large amount of CUDA Error messages when fp16 op inference:
    CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682
    

    Reproduced Steps

    1. git clone swin-transformer:

    cd examples/pytorch/swin/Swin-Transformer-Quantization && \
      git clone https://github.com/microsoft/Swin-Transformer.git && \
      git checkout e0486b2cf8c63b6314570a43007569c8aa9b4578 && \
      cd ..
    

    2. file path modify:

    *inference_swintransformer_op.py*
    
     25 import sys
     26 sys.path.insert(0, "./Swin-Transformer-Quantization/Swin-Transformer")
    
    *run_test.sh*
     12 python3 infer_swintransformer_op.py \
     13     --eval \
     14     --data-path /workspace \
     15     --cfg Swin-Transformer-Quantization/Swin-Transformer/configs/swin/swin_tiny_patch4_window7_224.yaml \
     16     --resume Swin-Transformer-Quantization/swin_tiny_patch4_window7_224.pth \
     17     --th-path ../../../build/lib/libpyt_swintransformer.so \
     18     --batch-size $1
    

    3: run test shell script: bash run_test.sh 1

  • Possible Bug in Context Likelihood

    Possible Bug in Context Likelihood

    When calculating the log likelihood of token at position i, we should consider the logits at step i-1 and also log likelihood of starting token is undefined (can be set to zero). https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/logprob_kernels.cu#L71 The way we fixed it.

    1. Shift the position of logits by subtracted by vocab_size_padded (verified for batch_first case) logits += step_offset + batch_offset - vocab_size_padded;
    2. add step>0 in the loop (to avoid computing log probs for 0th token)
  • Does it support python api? In a similar way to trtexec, transfer the engine model and do inference?

    Does it support python api? In a similar way to trtexec, transfer the engine model and do inference?

    Description

    Does it support python api? In a similar way to trtexec, transfer the engine model and do inference?
    

    Reproduced Steps

    Does it support python api? In a similar way to trtexec, transfer the engine model and do inference?
    
  • Possible bug in decoder cross attention

    Possible bug in decoder cross attention

    Hi, If I take the same encoder input and pad it to a different maximum length, then I get noticeably different encoder memory key/value tensors from decoder cross attention. And with some inputs this results in slightly different result tokens. After reviewing the code, I found the is_batch_major_cache_ argument, which is true by default. I tried attention with is_batch_major_cache_ = false and the key/value and result tokens mismatch was gone. My guess is that the default implementation of decoder cross attention is not handling memory lengths correctly.

  • How to quantize attn_score = Q*K and  in ViT's SelfAttention

    How to quantize attn_score = Q*K and in ViT's SelfAttention

    Thanks for your greate works of int8 quantization for ViT, I have some problems about the quantization of ViT' SelfAttention As in transformer Attention: 1) attn_score = Q * K^T 2) out = atten_prob * V I found their quantization of 1) + 2), is matmul_q_input_quantizer and self.matmul_k_input_quantizer are belong to dynamic quantization, the scale and zero_point are obtained during the inference (on the fly), not from the calibration ?

    post training dynamic quantization: https://pytorch.org/docs/stable/quantization.html?highlight=quantization#module-torch.quantization

    # https://github.com/NVIDIA/FasterTransformer/blob/43ae78abfaa13a920ac1930c23615fe28c0e9819/examples/pytorch/vit/ViT-quantization/vit_int8.py#L186
    class Attention()
    
    def __init__()
    
         if QUANT:
                self.matmul_q_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
                self.matmul_k_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
                self.matmul_v_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
                self.matmul_a_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
                self.softmax_input_quantizer = TensorQuantizer(QuantLinear.default_quant_desc_input)
    
    def forward(self, hidden_states):
           if QUANT:
                attention_scores = torch.matmul(self.matmul_q_input_quantizer(query_layer), 
                    self.matmul_k_input_quantizer(key_layer.transpose(-1, -2)))
            ...
          if QUANT:
                context_layer = torch.matmul(self.matmul_a_input_quantizer(attention_probs), 
                    self.matmul_v_input_quantizer(value_layer))
          
    
Deploying Deep Learning Models in C++: BERT Language Model
 Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

Mar 24, 2022
Swin Transformer C++ Implementation
Swin Transformer C++ Implementation

This is Swin Transformer C++ Implementation, inspired by swin-transformer-pytorch.

Aug 11, 2022
Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation.
Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Jul 13, 2022
ncnn demo of DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction
ncnn demo of DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction

DocTr-ncnn ncnn demo of DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction model support: 1.Document Segmentation 2

Jul 1, 2022
Repository for material related to the Programming Languages Virtual Meetup coverage of the Category Theory for Programmers book.

CTfP-2021 This is the material (code and presentation slide decks) that correspond to the Programming Languages Virtual Meetup course that is covering

Aug 3, 2022
AI-related samples made available by the DevTech ProViz team

ProViz-AI Samples This repository is a collection of AI-related samples, developed and provided by the DevTech ProViz team. Each folder in the reposit

Apr 10, 2022
《Graph Optimization Approach to Range-based Localization》; UWB localization

This is modified from localization . Thanks for his work for uwb localizaiton. 代码1:https://github.com/qxiaofan/awesome-uwb-localization 代码2:https://gi

Aug 1, 2022
Distributed Pose Graph Optimization

Distributed Pose Graph Optimization

Aug 10, 2022
Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation
Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation

Reviatalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation This is the implementation of the approach describ

Aug 8, 2022
OpenVINO™ optimization for PointPillars*
OpenVINO™ optimization for PointPillars*

OpenVINO™ optimization for PointPillars* There are 2 demonstrations in the repo. Demo of PointPillars Optimization - It demonstrates how to implement

Jul 27, 2022
Nano - C++ library [machine learning & numerical optimization] - superseeded by libnano

Nano Nano provides numerical optimization and machine learning utilities. For example it can be used to train models such as multi-layer perceptrons (

Apr 18, 2020
BayesOpt: A toolbox for bayesian optimization, experimental design and stochastic bandits.

BayesOpt: A Bayesian optimization library BayesOpt is an efficient implementation of the Bayesian optimization methodology for nonlinear optimization,

Aug 9, 2022
nanoPGO: A header-only library for Pose-Graph-Optimization in SE(2).
nanoPGO: A header-only library for Pose-Graph-Optimization in SE(2).

nanoPGO nanoPGO: A header-only library for Pose-Graph-Optimization in SE(2). 1. Description This repo is an implementation of 2D Pose Graph Optimizati

Jul 7, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Aug 1, 2022
LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)
LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

Jul 22, 2022
A coupling library for partitioned multi-physics simulations, including, but not restricted to fluid-structure interaction and conjugate heat transfer simulations.
A coupling library for partitioned multi-physics simulations, including, but not restricted to fluid-structure interaction and conjugate heat transfer simulations.

A coupling library for partitioned multi-physics simulations, including, but not restricted to fluid-structure interaction and conjugate heat transfer simulations.

Aug 15, 2022
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Aug 17, 2022
Source code for the data dependency part of Jan Kossmann's PhD thesis "Unsupervised Database Optimization: Efficient Index Selection & Data Dependency-driven Query Optimization"

Unsupervised Database Optimization: Data Dependency-Driven Query Optimization Source code for the experiments presented in Chapter 8 of Jan Kossmann's

Apr 24, 2022
Deploying Deep Learning Models in C++: BERT Language Model
 Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

Mar 24, 2022
Swin Transformer C++ Implementation
Swin Transformer C++ Implementation

This is Swin Transformer C++ Implementation, inspired by swin-transformer-pytorch.

Aug 11, 2022