oneAPI Deep Neural Network Library (oneDNN)

oneAPI Deep Neural Network Library (oneDNN)

This software was previously known as Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) and Deep Neural Network Library (DNNL).

With the launch of oneAPI we changed the project name and repository location to be consistent with the rest of oneAPI libraries:

  • Short library name changed to oneDNN.
  • Repository moved from intel/mkl-dnn to oneapi-src/oneDNN. Existing links to the code and documentation will continue to work.

There are no changes to the API, environment variables, or build options planned at this point.

oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. The library is optimized for Intel Architecture Processors, Intel Processor Graphics and Xe architecture-based Graphics. oneDNN has experimental support for the following architectures:

  • Arm* 64-bit Architecture (AArch64)
  • NVIDIA* GPU
  • OpenPOWER* Power ISA (PPC64)
  • IBMz* (s390x)

oneDNN is intended for deep learning applications and framework developers interested in improving application performance on Intel CPUs and GPUs. Deep learning practitioners should use one of the applications enabled with oneDNN.

Table of Contents

Documentation

  • Developer guide explains programming model, supported functionality, and implementation details, and includes annotated examples.
  • API reference provides a comprehensive reference of the library API.

Installation

Binary distribution of this software is available as Intel oneAPI Deep Neural Network Library in Intel oneAPI.

Pre-built binaries for Linux*, Windows*, and macOS* are available for download in the releases section. Package names use the following convention:

OS Package name
Linux dnnl_lnx_<version>_cpu_<cpu runtime>[_gpu_<gpu runtime>].tgz
Windows dnnl_win_<version>_cpu_<cpu runtime>[_gpu_<gpu runtime>].zip
macOS dnnl_mac_<version>_cpu_<cpu runtime>.tgz

Several packages are available for each operating system to ensure interoperability with CPU or GPU runtime libraries used by the application.

Configuration Dependency
cpu_iomp Intel OpenMP runtime
cpu_gomp GNU* OpenMP runtime
cpu_vcomp Microsoft Visual C OpenMP runtime
cpu_tbb Threading Building Blocks (TBB)
cpu_dpcpp_gpu_dpcpp Intel oneAPI DPC++ Compiler, TBB, OpenCL runtime, oneAPI Level Zero runtime

The packages do not include library dependencies and these need to be resolved in the application at build time. See the System Requirements section below and the Build Options section in the developer guide for more details on CPU and GPU runtimes.

If the configuration you need is not available, you can build the library from source.

System Requirements

oneDNN supports platforms based on the following architectures:

WARNING

Arm 64-bit Architecture (AArch64), Power ISA (PPC64) and IBMz (s390x) support is experimental with limited testing validation.

The library is optimized for the following CPUs:

  • Intel Atom processor with Intel SSE4.1 support
  • 4th, 5th, 6th, 7th, and 8th generation Intel(R) Core(TM) processor
  • Intel(R) Xeon(R) processor E3, E5, and E7 family (formerly Sandy Bridge, Ivy Bridge, Haswell, and Broadwell)
  • Intel(R) Xeon Phi(TM) processor (formerly Knights Landing and Knights Mill)
  • Intel Xeon Scalable processor (formerly Skylake, Cascade Lake, and Cooper Lake)
  • future Intel Xeon Scalable processor (code name Sapphire Rapids)

On a CPU based on Intel 64 or on AMD64 architecture, oneDNN detects the instruction set architecture (ISA) at runtime and uses just-in-time (JIT) code generation to deploy the code optimized for the latest supported ISA. Future ISAs may have initial support in the library disabled by default and require the use of run-time controls to enable them. See CPU dispatcher control for more details.

On a CPU based on Arm AArch64 architecture, oneDNN can be built with Arm Compute Library integration. Arm Compute Library is an open-source library for machine learning applications and provides AArch64 optimized implementations of core functions. This functionality currently requires that Arm Compute Library is downloaded and built separately, see Build from Source.

WARNING

On macOS, applications that use oneDNN may need to request special entitlements if they use the hardened runtime. See the linking guide for more details.

The library is optimized for the following GPUs:

  • Intel HD Graphics
  • Intel UHD Graphics
  • Intel Iris Plus Graphics
  • Xe architecture-based Graphics (code named DG1 and Tiger Lake)

Requirements for Building from Source

oneDNN supports systems meeting the following requirements:

  • Operating system with Intel 64 / Arm 64 / Power / IBMz architecture support
  • C++ compiler with C++11 standard support
  • CMake 2.8.11 or later
  • Doxygen 1.8.5 or later to build the documentation
  • Arm Compute Library for builds using Compute Library on AArch64.

Configurations of CPU and GPU engines may introduce additional build time dependencies.

CPU Engine

oneDNN CPU engine is used to execute primitives on Intel Architecture Processors, 64-bit Arm Architecture (AArch64) processors, 64-bit Power ISA (PPC64) processors, IBMz (s390x), and compatible devices.

The CPU engine is built by default and cannot be disabled at build time. The engine can be configured to use the OpenMP, TBB or DPCPP runtime. The following additional requirements apply:

Some implementations rely on OpenMP 4.0 SIMD extensions. For the best performance results on Intel Architecture Processors we recommend using the Intel C++ Compiler.

GPU Engine

Intel Processor Graphics and Xe architecture-based Graphics are supported by the oneDNN GPU engine. The GPU engine is disabled in the default build configuration. The following additional requirements apply when GPU engine is enabled:

  • OpenCL runtime requires
    • OpenCL* runtime library (OpenCL version 1.2 or later)
    • OpenCL driver (with kernel language support for OpenCL C 2.0 or later) with Intel subgroups extension support
  • DPCPP runtime requires
  • DPCPP runtime with NVIDIA GPU support requires
    • oneAPI DPC++ Compiler
    • OpenCL runtime library (OpenCL version 1.2 or later)
    • NVIDIA CUDA* driver
    • cuBLAS 10.1 or later
    • cuDNN 7.6 or later

WARNING

NVIDIA GPU support is experimental. General information, build instructions and implementation limitations is available in NVIDIA backend readme.

Runtime Dependencies

When oneDNN is built from source, the library runtime dependencies and specific versions are defined by the build environment.

Linux

Common dependencies:

  • GNU C Library (libc.so)
  • GNU Standard C++ Library v3 (libstdc++.so)
  • Dynamic Linking Library (libdl.so)
  • C Math Library (libm.so)
  • POSIX Threads Library (libpthread.so)

Runtime-specific dependencies:

Runtime configuration Compiler Dependency
DNNL_CPU_RUNTIME=OMP GCC GNU OpenMP runtime (libgomp.so)
DNNL_CPU_RUNTIME=OMP Intel C/C++ Compiler Intel OpenMP runtime (libiomp5.so)
DNNL_CPU_RUNTIME=OMP Clang Intel OpenMP runtime (libiomp5.so)
DNNL_CPU_RUNTIME=TBB any TBB (libtbb.so)
DNNL_CPU_RUNTIME=DPCPP Intel oneAPI DPC++ Compiler Intel oneAPI DPC++ Compiler runtime (libsycl.so), TBB (libtbb.so), OpenCL loader (libOpenCL.so)
DNNL_GPU_RUNTIME=OCL any OpenCL loader (libOpenCL.so)
DNNL_GPU_RUNTIME=DPCPP Intel oneAPI DPC++ Compiler Intel oneAPI DPC++ Compiler runtime (libsycl.so), OpenCL loader (libOpenCL.so), oneAPI Level Zero loader (libze_loader.so)

Windows

Common dependencies:

  • Microsoft Visual C++ Redistributable (msvcrt.dll)

Runtime-specific dependencies:

Runtime configuration Compiler Dependency
DNNL_CPU_RUNTIME=OMP Microsoft Visual C++ Compiler No additional requirements
DNNL_CPU_RUNTIME=OMP Intel C/C++ Compiler Intel OpenMP runtime (iomp5.dll)
DNNL_CPU_RUNTIME=TBB any TBB (tbb.dll)
DNNL_CPU_RUNTIME=DPCPP Intel oneAPI DPC++ Compiler Intel oneAPI DPC++ Compiler runtime (sycl.dll), TBB (tbb.dll), OpenCL loader (OpenCL.dll)
DNNL_GPU_RUNTIME=OCL any OpenCL loader (OpenCL.dll)
DNNL_GPU_RUNTIME=DPCPP Intel oneAPI DPC++ Compiler Intel oneAPI DPC++ Compiler runtime (sycl.dll), OpenCL loader (OpenCL.dll), oneAPI Level Zero loader (ze_loader.dll)

macOS

Common dependencies:

  • System C/C++ runtime (libc++.dylib, libSystem.dylib)

Runtime-specific dependencies:

Runtime configuration Compiler Dependency
DNNL_CPU_RUNTIME=OMP Intel C/C++ Compiler Intel OpenMP runtime (libiomp5.dylib)
DNNL_CPU_RUNTIME=TBB any TBB (libtbb.dylib)

Validated Configurations

CPU engine was validated on RedHat* Enterprise Linux 7 with

on Windows Server* 2012 R2 with

on macOS 10.13 (High Sierra) with

GPU engine was validated on Ubuntu* 18.04 with

on Windows Server 2019 with

Requirements for Pre-built Binaries

See the README included in the corresponding binary package.

Applications Enabled with oneDNN

Support

Please submit your questions, feature requests, and bug reports on the GitHub issues page.

You may reach out to project maintainers privately at [email protected].

WARNING

This is pre-production software and functionality may change without prior notice.

Contributing

We welcome community contributions to oneDNN. If you have an idea on how to improve the library:

  • For changes impacting the public API or library overall, such as adding new primitives or changes to the architecture, submit an RFC pull request.
  • Ensure that the changes are consistent with the code contribution guidelines and coding style.
  • Ensure that you can build the product and run all the examples with your patch.
  • Submit a pull request.

For additional details, see contribution guidelines.

This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

oneDNN is licensed under Apache License Version 2.0. Refer to the "LICENSE" file for the full license text and copyright notice.

This distribution includes third party software governed by separate license terms.

3-clause BSD license:

Apache License Version 2.0:

Boost Software License, Version 1.0:

MIT License:

SIL Open Font License (OFL):

This third party software, even if included with the distribution of the Intel software, may be governed by separate license terms, including without limitation, third party license terms, other Intel software license terms, and open source software license terms. These separate license terms govern your use of the third party programs as set forth in the "THIRD-PARTY-PROGRAMS" file.

Security

See Intel's Security Center for information on how to report a potential security issue or vulnerability.

See also: Security Policy

Trademark Information

Intel, the Intel logo, Intel Atom, Intel Core, Intel Xeon Phi, Iris, OpenVINO, the OpenVINO logo, Pentium, VTune, and Xeon are trademarks of Intel Corporation or its subsidiaries.

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

(C) Intel Corporation

Owner
oneAPI-SRC
oneAPI open source projects
oneAPI-SRC
Comments
  • Intend to package mkl-dnn for Debian (and Ubuntu)

    Intend to package mkl-dnn for Debian (and Ubuntu)

    FYI: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=894411

    It seems that the Apache-2 licensed mkl-dnn can be built and used without MKL, despite of a suboptimal performance. In this case we can make packages.

  • I can not use gpu

    I can not use gpu

    I build the dnnl with command:

    mkdir -p build && cd build
    cmake ..
    make -j16
    make install
    

    the ctest is run successful

    the small part output of nvidia-smi:

    NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4   
    

    but the code dnnl::engine::get_count(dnnl::engine::kind::gpu) is equal 0.

  • cpu: matmul: enable optimized 8-bit gemm for PPC64

    cpu: matmul: enable optimized 8-bit gemm for PPC64

    Description

    Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

    Greetings! This change allows users running on the Power10 processor, with its "MMA engine," to access those instructions when performing eight bit matrix multiplications. [we had previously enabled such functionality for bfloat16 as well as float32 and float64 in the OpenBLAS repository, but OpenBLAS does not support integer operations, so we are inserting that code here. We noted that you wanted a short 50-character commit line, and we've done that. We noted that you wanted the source code run through a "clang formatter" and we have also done that. We noted that you want a pre-process if our new code changes the API or adds any new functions, but our code does neither, so we're hoping that proceeding with the pull request directly is ok...

    Fixes # (github issue)

    Checklist

    General

    • [ YES] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
    • [ YES] Have you formatted the code using clang-format?

    Performance improvements

    • [ N/A] Have you submitted performance data that demonstrates performance improvements?

    New features

    • [ N/A] Have you published an RFC for the new feature?
    • [ N/A] Was the RFC approved?
    • [ N/A] Have you added relevant tests?

    Bug fixes

    • [ N/A] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
    • [ N/A] Have you added relevant regression tests?

    RFC PR

    • [ N/A] Does RFC document follow the template?
    • [ N/A] Have you added a link to the rendered document?
  • build: adds option to link against OpenBLAS or ArmPL on AArch64.

    build: adds option to link against OpenBLAS or ArmPL on AArch64.

    Description

    This PR adds FindBLAS.cmake (derived from the most recent version, which supports Arm Performance Libraries) and adds build options to support builds using the CBLAS interface to link to an existing BLAS library; either OpenBLAS or ArmPL.

    At present these BLAS libraries, where available, provide improved performance on ResNet50 benchmarks, for example. This PR makes building against these BLAS libraries, on AArch64, easier.

    DNNL_AARCH64_USE_ARMPL=on will use FindBLAS to find the ArmPL lib and fail if a working lib cannot be found. If found, the build will progress with -DUSE_CBLAS set. DNNL_AARCH64_USE_OPENBLAS=on will use FindBLAS to find the ArmPL lib and fail if a working lib cannot be found. If found, the build will progress with -DUSE_CBLAS set.

    Both cases require DNNL_TARGET_ARCH="AARCH64", and so do not, as written, enable the CBLAS interface for any other ISA.

    Please let me know if this should be raised as an RFC rather than a PR.

    Fixes # (github issue) None.

    Checklist

    Code-change submissions

    • [x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally? Tested with GCC9.2 + OpenBLAS 0.3.7 or ArmPL 20.1, and without a vendor BLAS library, on AArch64.
    • [ ] Have you formatted the code using clang-format? Not applicable - no changes to C++ src.

    New features

    • [x] Have you added relevant tests? Functionality of BLAS and CBLAS interface tested during build. Functional testing of oneDNN build with OpenBLAS / ArmPL covered by existing tests.
    • [x] Have you provided motivation for adding a new feature?
  • Question: is it possible to control memory usage?

    Question: is it possible to control memory usage?

    When using mkldnn from PyTorch for ResNet inference with various input shapes, we observe memory usage growing quickly by around 7 GB, and then growth mostly stops, or slows down significantly, after around 1000 calls to inference, with input shapes ranging from (3, 320, 200) to (3, 320, 7680). MKLDNN provides nice speedup, but this extra memory usage is a little concerning.

    The questions are:

    • is such memory usage normal in this case?
    • are there any settings or environment variables that can keep memory usage lower, even at the cost of some performance? (I imagine it must be caching some allocations?)

    Thank you!

  • Possible performance regression with 1x1 conv on Windows?

    Possible performance regression with 1x1 conv on Windows?

    I notice a performance drop when executing a 1x1 Convolution using mkl-dnn under Windows 10 (didn't test other platforms). The performance was better with the privious commit. Is this expected behaviour because I didn't change any of my code.

    thanks

    • Intel Haswell CPU
    • Windows 10 x64 Pro
    • Visual Studio 2017 latest update
  • help wanted for deconv

    help wanted for deconv

    Hi,

    I used deconv with 2x2 kernel and 2x2 stride for a 2x upscaling. I also created a wrapper function to build deconv layers (because my model has several of thos)

    For whatever reasons, I could not get the same results as tensorflow.

    my code, input and weight for your testing are available here. The sum of output should be 62098350.0(TF gives me), but my DNNL code gave me quite different value.

    I have spent more than a day on this code. help wanted

  • Can we force MKLDNN to use cblas always instead of JIT kernels generated at runtime

    Can we force MKLDNN to use cblas always instead of JIT kernels generated at runtime

    Can we force MKLDNN to use cblas functions always instead of JIT kernels generated at runtime? i am using external library openBLAS and want to use that for all gemm related work

  • Using benchdnn test cuda failed.

    Using benchdnn test cuda failed.

    I build oneDNN on branch dev-v2 to test oneDNN example with Nvidia GPU. I find the ask, which said conv not support Nvidia but you can run benchdnn with Nvidia GPU? I try to run benchdnn but I get error like this: run command: "./benchdnn --conv --cfg=f32 --dir=FWD_B --batch=inputs/conv/test_conv_all"

    result is : "error [engine_t::engine_t(dnnl_engine_kind_t):1231]: 'dnnl_engine_create(&inst, engine_kind, 0)' -> invalid_arguments(2)" How can I run benchdnn with Nvidia GPU? thx very much.

  • CMake can't find OpenMP with Apple Clang

    CMake can't find OpenMP with Apple Clang

    Summary

    I'm trying to build DNNL on macOS with Apple Clang, but it won't pick up the OpenMP library I have installed.

    Version

    1.1.1

    Environment

    • macOS 10.15.1 Catalina
    • OS version (Darwin Vesuvius 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64)
    • Compiler version (Clang 11.0.0)
    • CMake version (3.15.4)

    Steps to reproduce

    $ cmake /var/folders/21/hwq39zyj4g36x6zjfyl5l8080000gn/T/Adam/spack-stage/spack-stage-intel-mkl-dnn-1.1.1-6oqj5ixj3u5v24mm4lo4vsbgyc3rmnsq/spack-src -G Unix Makefiles -DCMAKE_INSTALL_PREFIX:PATH=/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/intel-mkl-dnn-1.1.1-6oqj5ixj3u5v24mm4lo4vsbgyc3rmnsq -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_FIND_FRAMEWORK:STRING=LAST -DCMAKE_FIND_APPBUNDLE:STRING=LAST -DCMAKE_INSTALL_RPATH_USE_LINK_PATH:BOOL=FALSE -DCMAKE_INSTALL_RPATH:STRING=/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/intel-mkl-dnn-1.1.1-6oqj5ixj3u5v24mm4lo4vsbgyc3rmnsq/lib;/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/intel-mkl-dnn-1.1.1-6oqj5ixj3u5v24mm4lo4vsbgyc3rmnsq/lib64;/opt/intel/lib;/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/llvm-openmp-9.0.0-kkckpcgbwofd7nmpry7nq2bdgq3gvdec/lib -DCMAKE_PREFIX_PATH:STRING=/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/cmake-3.15.4-o4vw4x7hc37q6wevpxbijto6fwhwpeiu;/opt/intel;/Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/llvm-openmp-9.0.0-kkckpcgbwofd7nmpry7nq2bdgq3gvdec
    

    Observed behavior

    -- The C compiler identification is AppleClang 11.0.0.11000033
    -- The CXX compiler identification is AppleClang 11.0.0.11000033
    -- Check for working C compiler: /Users/Adam/spack/lib/spack/env/clang/clang
    -- Check for working C compiler: /Users/Adam/spack/lib/spack/env/clang/clang -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /Users/Adam/spack/lib/spack/env/clang/clang++
    -- Check for working CXX compiler: /Users/Adam/spack/lib/spack/env/clang/clang++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Looking for pthread.h
    -- Looking for pthread.h - found
    -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
    -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
    -- Found Threads: TRUE  
    -- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) 
    -- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES) 
    -- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND) 
    CMake Warning at cmake/OpenMP.cmake:77 (message):
      OpenMP library could not be found.  Proceeding might lead to highly
      sub-optimal performance.
    Call Stack (most recent call first):
      CMakeLists.txt:82 (include)
    

    Expected behavior

    I would expect the build to pick up the OpenMP installation I have:

    $ ls /Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/llvm-openmp-9.0.0-kkckpcgbwofd7nmpry7nq2bdgq3gvdec/lib/libomp.dylib 
    /Users/Adam/spack/opt/spack/darwin-catalina-x86_64/clang-11.0.0-apple/llvm-openmp-9.0.0-kkckpcgbwofd7nmpry7nq2bdgq3gvdec/lib/libomp.dylib
    
  • Perf of intel-tensorflow vs default tensorflow on BERT

    Perf of intel-tensorflow vs default tensorflow on BERT

    This is a call to confirm or not the perf of "intel-tensorflow" as described here https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide vs the default/naive tensorflow as given in the official pip repo, for a SQuAD BERT inference : https://github.com/google-research/bert


    Environment

    • CPU make and model (try lscpu; if your lscpu does not list CPU flags, try running cat /proc/cpuinfo | grep flags | sort -u) Model name: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz

    • OS version (uname -a) CentOS 7

    • Compiler version (gcc --version) Whatever compiler used to build TF by the python pip maintainers.

    • MKLDNN version: hard to say : from what I see in /usr/lib/python3.6/site-packages/tensorflow/include/external/mkl_dnn/include/ the mkldnn headers copyrights are still from 2018. $ ll /usr/lib/python3.6/site-packages/tensorflow/include/external/mkl_dnn/include -rw-r--r--. 1 root root 2376 Aug 14 16:50 mkldnn_debug.h -rw-r--r--. 1 root root 75949 Aug 14 16:50 mkldnn.h -rw-r--r--. 1 root root 142047 Aug 14 16:50 mkldnn.hpp -rw-r--r--. 1 root root 48357 Aug 14 16:50 mkldnn_types.h $ head /usr/lib/python3.6/site-packages/tensorflow/include/external/mkl_dnn/include/mkldnn.h Copyright 2016-2018 Intel Corporation and there is no mkldnn_version symbol in the tf lib : $ nm /usr/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so | grep mkldnn_v 0000000000abfb00 T mkldnn_verbose_set 0000000000ab57a0 T mkldnn_view_primitive_desc_create 0000000000abfa40 T _ZN6mkldnn4impl14mkldnn_verboseEv 0000000001882aa0 b _ZZN6mkldnn4impl14mkldnn_verboseEvE11initialized

    Steps to reproduce

    • train/finetune BERT on the SQuAD task
      https://github.com/google-research/bert#squad-11

    • run inference (a simple batch of 10 questions on 1 paragraph, run twice for warmup) with the default tensorflow as intalled with pip:

    pip3 install tensorflow==1.13.1 $ OMP_NUM_THREADS=1 python3 bert_inference.py Found TensortFlow version 1.13.1 Intel MKLDNN enabled ? False 2019-08-20 17:46:52.979731: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Prediction 1 run in: 3.614552957005799 secs Prediction 2 run in: 3.1976517980219796 secs

    • Uninstall tf and install intel-tensorflow : pip3 install intel-tensorflow and rerun the very same inference:

    $ OMP_NUM_THREADS=1 KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 python3 bert_inference.py Found TensortFlow version 1.13.1 Intel MKLDNN enabled ? True OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3 OMP: Info #156: KMP_AFFINITY: 4 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores) OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1 OMP: Info #250: KMP_AFFINITY: pid 7205 tid 7223 thread 0 bound to OS proc set 0 OMP: Info #250: KMP_AFFINITY: pid 7205 tid 7223 thread 1 bound to OS proc set 1 Prediction 1 run in: 5.119017199991504 secs Prediction 2 run in: 4.5127738649898674 secs

    Note: add print('Intel MKLDNN enabled ? ', tf.pywrap_tensorflow.IsMklEnabled())

    Actual behavior

    intel-tensorflow slower than default tensorflow

    Expected behavior

    intel-tensorflow faster than default tensorflow

  •  Issue with has_training_support API on AArch64 for FP16

    Issue with has_training_support API on AArch64 for FP16

    Summary

    Commit https://github.com/oneapi-src/oneDNN/commit/64198427cc3a84fb15e19691cf6477dbe1c4a3d0 is causing an issue on AArch64 . The failing test appears for test_batch_normalization/ BatchNormalizationSimpleF16 cases which are calling into reference batch normalisation kernels for oneDNN builds with ACL.

    Version

    https://github.com/oneapi-src/oneDNN/commit/64198427cc3a84fb15e19691cf6477dbe1c4a3d0

    Environment

    CPU aarch64 OS version Ubuntu 20.04.4 Compiler version 10.3.0 CMake version 3.24.1

    Steps to reproduce

    ./tests/gtests/test_batch_normalization  --gtest_filter="BatchNormalizationSimpleF16*" 
    
    

    Observed behavior

    This calls into ref/ncsp/nscp implementation and returns false for has_training_support https://github.com/oneapi-src/oneDNN/blob/b4b2d5265ff974779cc7ed2ba878600524366eba/src/cpu/platform.cpp#L154 and this causes the ctest to fail with it not being able to create a primitive descriptor for this case.

    unknown file: Failure
    C++ exception with description "could not create a primitive descriptor for a batch normalization forward propagation primitive" thrown in SetUp().
    [  FAILED  ] BatchNormalizationSimpleF16/batch_normalization_test_t.TestsBatchNormalization/4, where GetParam() = 56-byte object <01-00 00-00 01-00 00-00 00-00 00-00 05-00 00-00 05-00 00-00 05-00 00-00 60-37 3D-E2 AA-AA 00-00 80-37 3D-E2 AA-AA 00-00 80-37 3D-E2 AA-AA 00-00 00-00 00-00 00-00 00-00> (8 ms)
    [ RUN      ] BatchNormalizationSimpleF16/batch_normalization_test_t.TestsBatchNormalization/5
    

    bnorm-ctest.log

    ACL has FP16 support and we have added a check for it to (has_data_type_support) https://github.com/oneapi-src/oneDNN/blob/b4b2d5265ff974779cc7ed2ba878600524366eba/src/cpu/platform.cpp#L127 but it does not have support for training in FP16.

    Should the whole test be skipped on non-x86 builds? For e.g by adding a check here https://github.com/oneapi-src/oneDNN/blob/b4b2d5265ff974779cc7ed2ba878600524366eba/tests/gtests/test_batch_normalization.cpp#L63?

    Also, what is the difference of has_training_support and has_data_type_support? Could you clarify this? Is it just some hardware supports FP16 on Intel?

  • Fatal Error when trying to compile example/getting_started.cpp: fatal error: '/CL/sycl.hpp' file not found

    Fatal Error when trying to compile example/getting_started.cpp: fatal error: '/CL/sycl.hpp' file not found

    getting_started

    Summary

    Provide a short summary of the issue. Sections below provide guidance on what factors are considered important to reproduce an issue.

    Version

    I followed the Installation through package manager guide in the website

    • CPU : AMD Ryzen 5700x
    • OS version : Ubuntu 20.04.5 LTS
    • Compiler version icpx
    • CMake version GNU Make 4.2.1

    Observed behavior

    Upon executing the command: ~/MyComputer/OneDNN/oneDNN/examples$ icpx -I${DNNLROOT}/include -L${DNNLROOT}/lib test.cpp -ldnnl

    the out put is: In file included from test.cpp:30: In file included from ./example_utils.hpp:36: In file included from /opt/intel/oneapi/dnnl/2022.2.1/cpu_dpcpp_gpu_dpcpp/include/dnnl_sycl.hpp:20: /opt/intel/oneapi/dnnl/2022.2.1/cpu_dpcpp_gpu_dpcpp/include/oneapi/dnnl/dnnl_sycl.hpp:29:10: fatal error: '/CL/sycl.hpp' file not found

    Expected behavior

    Compile the getting_started.cpp file

  • cpu: aarch64: add clip using ACL bounded relu

    cpu: aarch64: add clip using ACL bounded relu

    Description

    This PR adds support for oneDNN's clip algorithm using the Bounded Relu from Compute Library for the Arm® architecture (ACL). Benchmarks show that performance improvements compared to ref are similar to other algs.

    Checklist

    General

    • [X] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
    • [X] Have you formatted the code using clang-format?

    Performance improvements

    • [N/A] Have you submitted performance data that demonstrates performance improvements?

    New features

    • [N/A] Have you published an RFC for the new feature?
    • [N/A] Was the RFC approved?
    • [N/A] Have you added relevant tests?

    Bug fixes

    • [N/A] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
    • [N/A] Have you added relevant regression tests?

    RFC PR

    • [N/A] Does RFC document follow the template?
    • [N/A] Have you added a link to the rendered document?
  • What is the status of support for deconvolution in TensorFlow and PyTorch

    What is the status of support for deconvolution in TensorFlow and PyTorch

    For TensorFlow I have tried running a simple script:

    import tensorflow as tf
    output_shape = [3, 8, 8, 128]
    strides = [1, 2, 2, 1]
    l = tf.constant(0.1, shape=[3, 4, 4, 4])
    w = tf.constant(0.1, shape=[7, 7, 128, 4])
    h1 = tf.nn.conv2d_transpose(l, w, output_shape=output_shape, strides=strides, padding='SAME')
    

    and run with ONEDNN_VERBOSE=1. I have seen #193 and it looks like deconvolution is still currently implemented as backward convolution in TF since I see

    onednn_verbose,exec,cpu,convolution,gemm:ref,backward_data,src_f32::blocked:acdb:f0 wei_f32::blocked:cdba:f0 bia_undef::undef::f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb3_ic128oc4_ih8oh4kh7sh2dh0ph2_iw8ow4kw7sw2dw0pw2,0.666992
    

    For PyTorch I have tried running

    import torch
    import torch.nn as nn
    m = nn.ConvTranspose2d(16, 33, 3, stride=2)
    input = torch.randn(20, 16, 50, 100)
    output = m(input)
    

    and see no message from oneDNN at all.

    I have been trying to understand support for oneDNN deconvolution at the framework level. Thank you.

  • Relative install dir breaks installations from source

    Relative install dir breaks installations from source

    I'm trying to use oneDNN as a dependency and trying to install it from source. The goal I'm trying to achieve is package Flashlight (which has oneDNN as a dependency) via the Nix package manager.

    The problem I'm facing is that I can't have Flashlight consume oneDNN as a source dependency due to having a relative install dir at https://github.com/oneapi-src/oneDNN/blob/8859a1a0c17679799a1530dbc822f59100ef0721/src/CMakeLists.txt#L169.

    The issue with this approach is explained with more detail in this document.

    Could this problem be addressed, so the users downstream aren't forced to keep repatching?

  • Add Accelerate support as a BLAS vendor

    Add Accelerate support as a BLAS vendor

    Description

    Add macOS Accelerate as a BLAS vendor. This significantly improves performance for matmul ops.

    CMake links macOS frameworks via find_library, which does a bunch of other machinery -- having INTERFACE_LINK_LIBRARIES in the exported DNNL::dnnl target actually breaks downstream cmake projects because compilers don't like -lAccelerate.framework on macOS.

    The solution here is to create a distinct set of exported shared libs, EXTRA_SHARED_LIBS_BUILD, which is only used in the build interface, since the install interface generated by CMake will incorrectly use INTERFACE_LINK_LIBRARIES as described above. In the case of downstream projects that use oneDNN via dnnl-config.cmake, call find_library for Accelerate to ensure the framework is present as a prereq to importing targets.

    Checklist

    General

    • [x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
    • [x] Have you formatted the code using clang-format?

    Performance improvements

    • [x] Have you submitted performance data that demonstrates performance improvements?

    @bwasti has full benchdnn results — showing 2.28 TFlops for n = 1024 on matmul on an M1, a huge improvement over MKL impls.

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Nov 30, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

Nov 24, 2022
Implementing Deep Convolutional Neural Network in C without External Libraries for YUV video Super-Resolution
Implementing Deep Convolutional Neural Network in C without External Libraries for YUV video Super-Resolution

DeepC: Implementing Deep Convolutional Neural Network in C without External Libraries for YUV video Super-Resolution This code uses FSRCNN algorithm t

Nov 28, 2022
A header-only C++ library for deep neural networks

MiniDNN MiniDNN is a C++ library that implements a number of popular deep neural network (DNN) models. It has a mini codebase but is fully functional

Nov 26, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
oneAPI Data Analytics Library (oneDAL)
oneAPI Data Analytics Library (oneDAL)

Intel® oneAPI Data Analytics Library Installation | Documentation | Support | Examples | Samples | How to Contribute Intel® oneAPI Data Analytics Libr

Nov 16, 2022
Low dependency(C++11 STL only), good portability, header-only, deep neural networks for embedded
Low dependency(C++11 STL only), good portability, header-only, deep neural networks for embedded

LKYDeepNN LKYDeepNN 可訓練的深度類神經網路 (Deep Neural Network) 函式庫。 輕量,核心部份只依賴 C++11 標準函式庫,低相依性、好移植,方便在嵌入式系統上使用。 Class diagram 附有訓練視覺化 demo 程式 訓練視覺化程式以 OpenCV

Nov 7, 2022
A GPU (CUDA) based Artificial Neural Network library
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Sep 27, 2022
simple neural network library in ANSI C
simple neural network library in ANSI C

Genann Genann is a minimal, well-tested library for training and using feedforward artificial neural networks (ANN) in C. Its primary focus is on bein

Nov 28, 2022
Cranium - 🤖 A portable, header-only, artificial neural network library written in C99
Cranium - 🤖   A portable, header-only, artificial neural network library written in C99

Cranium is a portable, header-only, feedforward artificial neural network library written in vanilla C99. It supports fully-connected networks of arbi

Dec 4, 2022
ESP32/8266 Arduino/PlatformIO library that painlessly enables incredibly fast re-connect to the previous wireless network after deep sleep.

WiFiQuick ESP32/8266 Platformio/Arduino library that painlessly enables incredibly fast re-connect to the previous wireless network after deep sleep.

Apr 3, 2022
DyNet: The Dynamic Neural Network Toolkit
DyNet: The Dynamic Neural Network Toolkit

The Dynamic Neural Network Toolkit General Installation C++ Python Getting Started Citing Releases and Contributing General DyNet is a neural network

Dec 3, 2022
ffcnn is a cnn neural network inference framework, written in 600 lines C language.

+----------------------------+ ffcnn 卷积神经网络前向推理库 +----------------------------+ ffcnn 是一个 c 语言编写的卷积神经网络前向推理库 只用了 500 多行代码就实现了完整的 yolov3、yolo-fastes

Oct 4, 2022
Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.
Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.

Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.

Dec 4, 2022
Ncnn version demo of [CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search (ncnn) The official implementation by pytorch: ht

Nov 16, 2022
ncnn is a high-performance neural network inference framework optimized for the mobile platform
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

Dec 3, 2022
ORB-SLAM3-Monodepth is an extended version of ORB-SLAM3 that utilizes a deep monocular depth estimation network
ORB-SLAM3-Monodepth is an extended version of ORB-SLAM3 that utilizes a deep monocular depth estimation network

ORB_SLAM3_Monodepth Introduction This repository was forked from [ORB-SLAM3] (https://github.com/UZ-SLAMLab/ORB_SLAM3). ORB-SLAM3-Monodepth is an exte

Nov 8, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

Nov 7, 2022
A lightweight C library for artificial neural networks

Getting Started # acquire source code and compile git clone https://github.com/attractivechaos/kann cd kann; make # learn unsigned addition (30000 sam

Nov 21, 2022