Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application

Documentation Get help at the community forum javadoc javadoc License GitHub commit activity

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This means starting with the raw data, loading and preprocessing it from wherever and whatever format it is in to building and tuning a wide variety of simple and complex deep learning networks.

Because Deeplearning4J runs on the JVM you can use it with a wide variety of JVM based languages other than Java, like Scala, Kotlin, Clojure and many more.

The DL4J stack comprises of:

  • DL4J: High level API to build MultiLayerNetworks and ComputationGraphs with a variety of layers, including custom ones. Supports importing Keras models from h5, including tf.keras models (as of 1.0.0-beta7) and also supports distributed training on Apache Spark
  • ND4J: General purpose linear algebra library with over 500 mathematical, linear algebra and deep learning operations. ND4J is based on the highly-optimized C++ codebase LibND4J that provides CPU (AVX2/512) and GPU (CUDA) support and acceleration by libraries such as OpenBLAS, OneDNN (MKL-DNN), cuDNN, cuBLAS, etc
  • SameDiff : Part of the ND4J library, SameDiff is our automatic differentiation / deep learning framework. SameDiff uses a graph-based (define then run) approach, similar to TensorFlow graph mode. Eager graph (TensorFlow 2.x eager/PyTorch) graph execution is planned. SameDiff supports importing TensorFlow frozen model format .pb (protobuf) models. Import for ONNX, TensorFlow SavedModel and Keras models are planned. Deeplearning4j also has full SameDiff support for easily writing custom layers and loss functions.
  • DataVec: ETL for machine learning data in a wide variety of formats and files (HDFS, Spark, Images, Video, Audio, CSV, Excel etc)
  • LibND4J : C++ library that underpins everything. For more information on how the JVM acceses native arrays and operations refer to JavaCPP

All projects in the DL4J ecosystem support Windows, Linux and macOS. Hardware support includes CUDA GPUs (10.0, 10.1, 10.2 except OSX), x86 CPU (x86_64, avx2, avx512), ARM CPU (arm, arm64, armhf) and PowerPC (ppc64le).

Community Support

For support for the project, please go over to https://community.konduit.ai/

Using Eclipse Deeplearning4J in your project

Deeplearning4J has quite a few dependencies. For this reason we only support usage with a build tool.

<dependencies>
  <dependency>
      <groupId>org.deeplearning4jgroupId>
      <artifactId>deeplearning4j-coreartifactId>
      <version>1.0.0-M1.1version>
  dependency>
  <dependency>
      <groupId>org.nd4jgroupId>
      <artifactId>nd4j-native-platformartifactId>
      <version>1.0.0-M1.1version>
  dependency>
dependencies>

Add these dependencies to your pom.xml file to use Deeplearning4J with the CPU backend. A full standalone project example is available in the example repository, if you want to start a new Maven project from scratch.

A taste of code

Deeplearning4J offers a very high level API for defining even complex neural networks. The following example code shows you how LeNet, a convolutional neural network, is defined in DL4J.

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
                .seed(seed)
                .l2(0.0005)
                .weightInit(WeightInit.XAVIER)
                .updater(new Adam(1e-3))
                .list()
                .layer(new ConvolutionLayer.Builder(5, 5)
                        .stride(1,1)
                        .nOut(20)
                        .activation(Activation.IDENTITY)
                        .build())
                .layer(new SubsamplingLayer.Builder(PoolingType.MAX)
                        .kernelSize(2,2)
                        .stride(2,2)
                        .build())
                .layer(new ConvolutionLayer.Builder(5, 5)
                        .stride(1,1)
                        .nOut(50)
                        .activation(Activation.IDENTITY)
                        .build())
                .layer(new SubsamplingLayer.Builder(PoolingType.MAX)
                        .kernelSize(2,2)
                        .stride(2,2)
                        .build())
                .layer(new DenseLayer.Builder().activation(Activation.RELU)
                        .nOut(500).build())
                .layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                        .nOut(outputNum)
                        .activation(Activation.SOFTMAX)
                        .build())
                .setInputType(InputType.convolutionalFlat(28,28,1))
                .build();

Documentation, Guides and Tutorials

You can find the official documentation for Deeplearning4J and the other libraries of its ecosystem at http://deeplearning4j.konduit.ai/.

Want some examples?

We have separate repository with various examples available: https://github.com/eclipse/deeplearning4j-examples

Building from source

It is preferred to use the official pre-compiled releases (see above). But if you want to build from source, first take a look at the prerequisites for building from source here: https://deeplearning4j.konduit.ai/multi-project/how-to-guides/build-from-source.

To build everything, we can use commands like

./change-cuda-versions.sh x.x
./change-scala-versions.sh 2.xx
./change-spark-versions.sh x
mvn clean install -Dmaven.test.skip -Dlibnd4j.cuda=x.x -Dlibnd4j.compute=xx

or

mvn -B -V -U clean install -pl  -Dlibnd4j.platform=linux-x86_64 -Dlibnd4j.chip=cuda -Dlibnd4j.cuda=11.0 -Dlibnd4j.compute=
   
     -Djavacpp.platform=linux-x86_64 -Dmaven.test.skip=true

   

An example of GPU "CC" or compute capability is 61 for Titan X Pascal.

License

Apache License 2.0

Commercial Support

Deeplearning4J is actively developed by the team at Konduit K.K..

[If you need any commercial support feel free to reach out to us. at [email protected]

Comments
  • [WIP] Keras upgrades

    [WIP] Keras upgrades

    Work in progress...

    Upgrades to deeplearning4j-keras to be a little better structured and expand the API from Keras to DL4J. Encourages hijacking model methods rather than implementing an actual Keras backend for efficiency and performance.

    Main goals of this PR include:

    • expanding Keras to better support DL4J
    • model saving methods via save_model
    • supporting Keras functional API
  • Implement new UI functionality using Play framework

    Implement new UI functionality using Play framework

    _WIP DO NOT MERGE_

    Play framework UI: builds upon earlier StatsListener and StatsStorage work implemented here: https://github.com/deeplearning4j/deeplearning4j/pull/2143

  • Fix RBMs and AE

    Fix RBMs and AE

    • Setup vb params to persist and be updated when in pretraining mode. It was skipping the update part
    • Added flag for pretraining to configuration at layer level and set trigger to turn off after layer pretrains. LayerUpdater will skip vb params when running outside pretrain. In previous setup, backprop was hard coded to true in many cases when setting params or gradients and it would skip vb (visual bias) during pretrain phase. In this change, getting the count for params or gradients or updating them will take vb into account. It will just not have any changes applied in the updater when it is not in pretrain mode.
    • HiddenUnit is the activation in RBM - added backpropGradient and derivative for hidden unit in RBM to account for this fact
    • RBM needed a reverse sign on application of gradients for the step function
    • Deprecated unused code in RBM and cleaned up functions in AE that appeared out of date
    • Expanded RBM tests and fixed gradient checks
  • "A fatal error has been detected by the Java Runtime Environment" when running ParagraphVectors.inferVector(), 1.0.0-alpha

    Issue Description

    I submitted this issue before for dl4j v0.80, and thought it was resolved after upgrading to 1.00-alpha. However when I built a new ParagraphVectors model and called the method inferVector() to infer a batch of new texts, the error came back again. The information about the issue is as follows:

    I'm running DL4J on my personal laptop, within Eclipse IDE. If I saved the ParagraphVectors model to a file and then loaded the model from the same file to call ParagraphVectors.inferVector, I received the error message of "A fatal error has been detected by the Java Runtime Environment". One error report is in attachment.

    I noticed that this issue appears to be more likely to happen when the new text is a (slightly) longer sentence. The data for training the model and new texts are in Simplified Chinese, all being properly processed before using Dl4J.

    The code snippet causing this issue is as follows, within a next() function of a DataSetIterator:

            for(int j=0; j<report.size(); j++){
                String stc = report.get(j);
                // this is where the problem is
                // m_SWV is loaded from a saved model, and proper TokenizerFactory has been set
                INDArray vector = ((ParagraphVectors)m_SWV).inferVector(stc);  
    
                features.put(new INDArrayIndex[]{NDArrayIndex.point(i), NDArrayIndex.all(), NDArrayIndex.point(j)}, vector);
                temp[1] = j;
                featuresMask.putScalar(temp, 1.0); 
            }
    

    Version Information

    Please indicate relevant versions, including, if relevant:

    • Deeplearning4j 1.0.0-alpha
    • platform information (OS, etc): DELL Inspiron 15 laptop with Windows 8 as OS
    • Java version: jdk1.8.0_60

    hs_err_pid4712_jdk1.8_60.log

  • Word2Vec/ParagraphVectors/DeepWalk Spark

    Word2Vec/ParagraphVectors/DeepWalk Spark

    WIP; DO NOT MERGE;

    Word2Vec/ParagraphVectors/DeepWalk implementation for Spark, using VoidParameterServer available in ND4j

    Do not merge before this: https://github.com/deeplearning4j/nd4j/pull/1551

  • DL4J Hanging after

    DL4J Hanging after "Loaded [JCublasBackend] backend"

    Hi,

    We are running some DL4J code as part of a wider system. This code runs fine on an Alienware development PC with CUDA 9.1 on Ubuntu, run from Eclipse.

    However, when we package this application and run it on a RHEL ppc64le server with CUDA 9.1, we see that ND4J is not doing anything after the following output:

    2309 [pool-8-thread-1] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend

    I have verified we are running the latest NVIDIA drivers and CUDA 9.1 is installed successfully. Below is the output from running the CUDA 9.1 sample deviceQuery, which lists the GPU devices:

     CUDA Device Query (Runtime API) version (CUDART static linking)
    
    Detected 4 CUDA Capable device(s)
    
    Device 0: "Tesla P100-SXM2-16GB"
      CUDA Driver Version / Runtime Version          9.1 / 9.1
      CUDA Capability Major/Minor version number:    6.0
      Total amount of global memory:                 16276 MBytes (17066885120 bytes)
      (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
      GPU Max Clock rate:                            1481 MHz (1.48 GHz)
      Memory Clock rate:                             715 Mhz
      Memory Bus Width:                              4096-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    Device 1: "Tesla P100-SXM2-16GB"
      CUDA Driver Version / Runtime Version          9.1 / 9.1
      CUDA Capability Major/Minor version number:    6.0
      Total amount of global memory:                 16276 MBytes (17066885120 bytes)
      (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
      GPU Max Clock rate:                            1481 MHz (1.48 GHz)
      Memory Clock rate:                             715 Mhz
      Memory Bus Width:                              4096-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   3 / 1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    Device 2: "Tesla P100-SXM2-16GB"
      CUDA Driver Version / Runtime Version          9.1 / 9.1
      CUDA Capability Major/Minor version number:    6.0
      Total amount of global memory:                 16276 MBytes (17066885120 bytes)
      (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
      GPU Max Clock rate:                            1481 MHz (1.48 GHz)
      Memory Clock rate:                             715 Mhz
      Memory Bus Width:                              4096-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   6 / 1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    Device 3: "Tesla P100-SXM2-16GB"
      CUDA Driver Version / Runtime Version          9.1 / 9.1
      CUDA Capability Major/Minor version number:    6.0
      Total amount of global memory:                 16276 MBytes (17066885120 bytes)
      (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
      GPU Max Clock rate:                            1481 MHz (1.48 GHz)
      Memory Clock rate:                             715 Mhz
      Memory Bus Width:                              4096-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   7 / 1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    > Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
    > Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
    > Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
    > Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
    > Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 4
    Result = PASS
    
    

    Can someone please help us with diagnosing this issue? It seems CUDA is installed correctly but DL4J is not producing any output and the following Java code is just hanging when calling Nd4j.create() for the first time:

    ...
    Nd4j.create()
    ...
    

    Note that this same code works fine on the AlienWare on Ubuntu 64 bit.

    Aha! Link: https://skymindai.aha.io/features/ND4J-143

  • Feature Request: Add Support for Apple Silicon M1

    Feature Request: Add Support for Apple Silicon M1

    Issue Description

    New Apple Silicon M1 processor yields javacpp.platform of macosx-arm64. These artifacts aren't available in Maven Central Repository which causes builds and IDEs on this new hardware to error/complain.

    See these two forum topics for more information: https://community.konduit.ai/t/support-for-apple-silicon-m1/1168 https://community.konduit.ai/t/compiling-on-arm/283

    Expected behavior: prebuilt jars for macosx-arm64 should exist in maven central repo

  • Convert Mat image to INDArray, When trying to convert Mat image to INDArray it is returning me INDArray null

    Convert Mat image to INDArray, When trying to convert Mat image to INDArray it is returning me INDArray null

    I have this code and I do not understand why my IDNarray image is returning me null when I try convert Mat in INDArray. I using the android sutdio 3.0.1.

    //************************* Digit classification *******************************************************************
            for (int i = 0; i < rects.size() ; i++) {
                Rect rect = rects.get(i);
                digit = inverted.submat(rect.y, rect.y + rect.height, rect.x, rect.x + rect.width);
                Imgproc.resize(digit, digit, new Size(28, 28));
    
                    NativeImageLoader nativeImageLoader = new NativeImageLoader(digit.height(), digit.width(), digit.channels());//Use the nativeImageLoader to convert to numerical matrix
                    INDArray image = nativeImageLoader.asMatrix(digit);//put image into INDArray
    
                System.out.println("carregar modelo matrixes  " + image);
     }
    

    output: carregar modelo matrixes NULL

  • Add CenterLossOutputLayer for efficient training

    Add CenterLossOutputLayer for efficient training

    Work in progress...

    Center loss has proven to be more efficient than triplet loss, and it enables classifier training which is also more speedy than triplets.

    @AlexDBlack can you take a look at CenterLossParamInitializer and confirm it's on the right track? Also, should we just specify numClasses in layer conf? Let's keep discussion in Gitter :)

  • Can not run CUDA example on Jetson TX1

    Can not run CUDA example on Jetson TX1

    Issue Description

    deeplearning4jtest-1.0/bin/deeplearning4jtest 10000 10 09:07:35.540 [main] INFO deeplearning4jtest.CSVExample - Build model.... 09:07:35.652 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend Exception in thread "main" java.lang.ExceptionInInitializerError at org.nd4j.jita.concurrency.CudaAffinityManager.getNumberOfDevices(CudaAffinityManager.java:173) at org.nd4j.jita.constant.ConstantProtector.purgeProtector(ConstantProtector.java:36) at org.nd4j.jita.constant.ConstantProtector.(ConstantProtector.java:29) at org.nd4j.jita.constant.ConstantProtector.(ConstantProtector.java:19) at org.nd4j.jita.constant.ProtectedCudaConstantHandler.(ProtectedCudaConstantHandler.java:45) at org.nd4j.jita.constant.CudaConstantHandler.(CudaConstantHandler.java:17) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5753) at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5694) at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:184) at org.deeplearning4j.nn.conf.NeuralNetConfiguration$Builder.seed(NeuralNetConfiguration.java:677) at deeplearning4jtest.CSVExample.main(CSVExample.java:54) Caused by: java.lang.RuntimeException: ND4J is probably missing dependencies. For more information, please refer to: http://nd4j.org/getstarted.html at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:51) at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:19) ... 13 more Caused by: java.lang.UnsatisfiedLinkError: no jnind4jcuda in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:963) at org.bytedeco.javacpp.Loader.load(Loader.java:764) at org.bytedeco.javacpp.Loader.load(Loader.java:671) at org.nd4j.nativeblas.Nd4jCuda.(Nd4jCuda.java:10) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.bytedeco.javacpp.Loader.load(Loader.java:726) at org.bytedeco.javacpp.Loader.load(Loader.java:671) at org.nd4j.nativeblas.Nd4jCuda$NativeOps.(Nd4jCuda.java:62) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.nd4j.nativeblas.NativeOpsHolder.(NativeOpsHolder.java:29) ... 14 more Caused by: java.lang.UnsatisfiedLinkError: no nd4jcuda in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:963) at org.bytedeco.javacpp.Loader.load(Loader.java:752) ... 24 more

    Version Information

    Please indicate relevant versions, including, if relevant:

    • Deeplearning4j version - 0.8.0
    • platform information (OS, etc) - Ubuntu 16.04, arm64, Jetson TX1
    • CUDA version, if used - 8.0
    • NVIDIA driver version, if in use -

    Contributing

    If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it! - I could help, if I can.

  • libopenblas_nolapack.so.0: cannot open shared object file: No such file or directory

    libopenblas_nolapack.so.0: cannot open shared object file: No such file or directory

    Hello,

    I've just tried to run my application on beta2 and I've got the follow exception: Caused by: java.lang.UnsatisfiedLinkError: /app/.javacpp/cache/openblas-0.3.0-1.4.2-linux-x86_64.jar/org/bytedeco/javacpp/linux-x86_64/libjniopenblas_nolapack.so: libopenblas_nolapack.so.0: cannot open shared object file: No such file or directory

    You can find full stacktrace here - https://gist.github.com/sergmain/0685cda1456721595637def8ca347662

    Few days ago, I opened an issue https://github.com/deeplearning4j/deeplearning4j/issues/6083 Since then the issue was fixed and beta2 was released.

    I rolled back to beta and my application started to work.

    there is stub project for reproducing this problem on heroku https://github.com/sergmain/dl4j-uber-jar It doesn't contain actual keras model in this repo but you can use any.

    Summary: beta - working beta2 - not working target OS - Heroku's PaaS target pratform for DL4J is specified in /.mvn/jvm.config

  • Synchronization of loss variables between SameDiff and TrainingConfig

    Synchronization of loss variables between SameDiff and TrainingConfig

    Issue Description

    This is a follow-up of issue #9684. When fitting a SameDiff graph, the reported loss (as per ScoreListener and LossCurve) is constantly zero although the learning succeeds. The workaround is to manually pass the names of the loss variables to the used TrainingConfig instance. It would still be nice if this wasn't necessary.

    Example:

    int batchSize = 4;
    int modelDim = 8;
    
    SameDiff sd = SameDiff.create();
    
    SDVariable features = sd.placeHolder("features", FLOAT, batchSize, modelDim);
    SDVariable labels = sd.placeHolder("labels", FLOAT, batchSize, modelDim);
    SDVariable bias = sd.var("bias", new OneInitScheme('c'), FLOAT, modelDim);
    SDVariable predictions = features.add("predictions", bias);
    sd.loss.meanSquaredError("loss", labels, predictions, null);
    
    TrainingConfig config = new TrainingConfig.Builder()
            .updater(new Adam(0.1))
            .dataSetFeatureMapping("features")
            .dataSetLabelMapping("labels")
            // .lossVariables(List.of("loss")) // <<< this line fixes the problem
            .build();
    sd.setTrainingConfig(config);
    
    // Task: output must be equal to input
    RecordReader reader = new CollectionRecordReader(
            Collections.nCopies(batchSize, Collections.nCopies(2 * modelDim, new IntWritable(1))));
    DataSetIterator iterator = new RecordReaderDataSetIterator(
            reader, batchSize, modelDim, 2 * modelDim - 1, true);
    
    // ScoreListener will consistently report a loss of 0
    History hist = sd.fit(iterator, 10, new ScoreListener(1));
    
    // The recorded loss curve is also constantly 0
    LossCurve curve = hist.lossCurve();
    System.out.println("Loss curve:\n" + curve.getLossValues());
    
    // However, the loss calculated here is > 0
    Map<String, INDArray> map = sd.output(iterator, "loss");
    System.out.println("Final loss: " + map.get("loss"));
    

    I'm not entirely sure if this is a bug or a feature. The reason for the described behavior seems to be that the SameDiff and TrainingConfig instances store the loss variables separately (and ScoreListener/LossCurve report the losses from the latter). When the TraningConfig is assigned to the SameDiff via setTrainingConfig() (see here), the loss variables of the TrainingConfig are copied to the SameDiff but not vice versa. Copying also in the opposite direction could be problematic if the user wants to use one TrainingConfig for multiple different SameDiff graphs. But you can judge this way better.

    Version Information

    • Deeplearning4j version: 1.0.0-SNAPSHOT
    • Platform information (OS, etc): Linux Mint 21
    • CUDA version, if used: N/A
    • NVIDIA driver version, if in use: N/A
  • NLP fixes 2 ( better batching)

    NLP fixes 2 ( better batching)

    What changes were proposed in this pull request?

    Add better batching (accumulate instead of exec) during training relative to a configured batch size. Adds vector calculation threads as a configuration. Removes unused fields. Reduces frequency of updates to just thread locals in fixed places to improve training performance. Move learning rate decay to after updates to ensure slower decay.

    How was this patch tested?

    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

    Quick checklist

    The following checklist helps ensure your PR is complete:

    • [ X] Eclipse Contributor Agreement signed, and signed commits - see IP Requirements page for details
    • [ X] Reviewed the Contributing Guidelines and followed the steps within.
    • [ X] Created tests for any significant new code additions.
    • [ X] Relevant tests for your changes are passing.
  • Support using ParallelInference to get the output of an internal layer

    Support using ParallelInference to get the output of an internal layer

    Issue Description

    Currently I am relying on the method

    public INDArray[] output(List<String> layers, boolean train, INDArray[] features, INDArray[] featureMasks)
    

    to get the outputs of internal ComputationGraph layers.

    I would like to use the ParallelInference wrapper but it does not support that method.

    Reference to community post (https://community.konduit.ai/t/using-parallelinference-to-get-a-specific-layer/2054)

    Version Information

    • Deeplearning4j version 1.0.0-M2.1
    • Platform information - Arch Linux
    • CPU Only
  • Python4J Futher Performance Optimizations

    Python4J Futher Performance Optimizations

    Follow up ticket for review findings from: https://github.com/eclipse/deeplearning4j/issues/9595#issuecomment-1146085342

    • PythonTypes still inializes an array of types for each conversion, this was one of the biggest performance bottlenecks that I removed in my version by introducing static fields for these (they are immutable anyhow)
    • UncheckedPythonInterpreter should not store retrieved variables in a map always. First this introduces a memory leak because the map has no eviction, second this slows down the path for retrievals where caching is not needed. Would be better if a CachedPythonInterpreter (as a wrapper delegate) is introduced as a separate class to opt-in to caching or to leave the caching in the hands of client that calls the interpreter. Creating an entry in the map and creating a Pair object is also garbage collector overhead (further away from zero-allocation principles). Also dunno if it is thread-safe to share variable instances between interpreters because the ConcurrentHashMap is static instead of inside the ThreadLocal.
  • Support for using sd.grad output as an intermediate variable

    Support for using sd.grad output as an intermediate variable

    Issue Description

    Currently, it is impossible to use the output of sd.grad as a variable for further computations. Consider following class:

    package com.valb3r.idr.networks;
    
    import org.nd4j.autodiff.samediff.SDVariable;
    import org.nd4j.autodiff.samediff.SameDiff;
    import org.nd4j.linalg.api.buffer.DataType;
    import org.nd4j.weightinit.impl.XavierInitScheme;
    
    public class Issue {
    
        public static void main(String[] args) {
            SameDiff sd = SameDiff.create();
            //Create input and label variables
            SDVariable sdfPoint = sd.placeHolder("point", DataType.FLOAT, -1, 3); //Shape: [?, 3]
            SDVariable ray = sd.placeHolder("ray", DataType.FLOAT, -1, 3); //Shape: [?, 3]
            SDVariable expectedColor = sd.placeHolder("expected-color", DataType.FLOAT, -1, 3); //Shape: [?, 3]
    
            SDVariable sdfInput = denseLayer(sd, 10, 3, sdfPoint);
            SDVariable sdf = denseLayer(sd, 3, 10, sdfInput);
            sdf.markAsLoss();
    
            SDVariable idrRenderGradient = sd.grad(sdfPoint.name());
            SDVariable dotGrad = idrRenderGradient.dot(ray); // org.nd4j.autodiff.util.SameDiffUtils.validateDifferentialFunctionSameDiff(SameDiffUtils.java:134)
    
            sd.loss().meanSquaredError(expectedColor, dotGrad, null);
        }
    
        private static SDVariable denseLayer(SameDiff sd, int nOut, int nIn, SDVariable input) {
            SDVariable w = sd.var(input.name() + "-w1", new XavierInitScheme('c', nIn, nOut), DataType.FLOAT, nIn, nOut);
            SDVariable b = sd.zero(input.name() + "-b1", 1, nOut);
            SDVariable z = input.mmul(w).add(b);
            return sd.nn().tanh(z);
        }
    }
    

    Variable idrRenderGradient is expected to be the gradient of sdf variable and should be usable in computation graph, but unfortunately it is not the case, line SDVariable dotGrad = idrRenderGradient.dot(ray); throws an exception:

    Exception in thread "main" java.lang.IllegalStateException
    	at org.nd4j.common.base.Preconditions.checkState(Preconditions.java:253)
    	at org.nd4j.autodiff.util.SameDiffUtils.validateDifferentialFunctionSameDiff(SameDiffUtils.java:134)
    	at org.nd4j.linalg.api.ops.BaseReduceOp.<init>(BaseReduceOp.java:85)
    	at org.nd4j.linalg.api.ops.BaseReduceOp.<init>(BaseReduceOp.java:114)
    

    For self-contained minimum reproducible example, please see: https://github.com/valb3r/same-diff/blob/master/src/main/java/com/valb3r/idr/networks/Issue.java

    For more details on discussion, please see: https://community.konduit.ai/t/using-gradient-as-an-intermediate-sdvariable/1890

    Version Information

    Please indicate relevant versions, including, if relevant:

    • Deeplearning4j version: 1.0.0-M1.1, 1.0.0-M2
    • Platform information: MacOS, Apple Silicon, CPU
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

Nov 28, 2022
Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict outcomes.

Linear-Regression Implementation of Univaraint Linear Regresion (Supervised Machine Learning) in c++. With a data set (training set) you can predict o

Nov 3, 2021
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

Frog - A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch Copyright 2006-2020 Ko van der Sloot, Maarten van Gompel, Antal van den

Aug 24, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Nov 27, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite English | 简体中文 KSAI Lite是一个轻量级、灵活性强、高性能且易于扩展的深度学习推理框架,底层基于tensorflow lite,定位支持包括移动端、嵌入式以及服务器端在内的多硬件平台。 当前KSAI Lite已经应用在金山office内部业务中,并逐步支持金山

Nov 2, 2022
TinNet - A compact C++17 based deep learning library.

[email protected] A compact DNN library. Build This project uses Bazel as a build system(1.0 or above required) and compiles with Clang(NOT required, automatic

Oct 12, 2020
A C++ implementation of nx-TAS by hamhub7 intended to make shortcuts easier than before.

C-TAS Documentation Features C-TAS is a C++ implementation of nx-TAS by hamhub7 intended to make shortcuts easier than before. This is a blatant conve

Sep 20, 2021
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

Nov 30, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Nov 22, 2022
header only, dependency-free deep learning framework in C++14
header only, dependency-free deep learning framework in C++14

The project may be abandoned since the maintainer(s) are just looking to move on. In the case anyone is interested in continuing the project, let us k

Nov 30, 2022
LibDEEP BSD-3-ClauseLibDEEP - Deep learning library. BSD-3-Clause

LibDEEP LibDEEP is a deep learning library developed in C language for the development of artificial intelligence-based techniques. Please visit our W

Nov 28, 2022
Caffe: a fast open framework for deep learning.

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR)/The Berke

Nov 26, 2022
Forward - A library for high performance deep learning inference on NVIDIA GPUs
 Forward - A library for high performance deep learning inference on NVIDIA GPUs

a library for high performance deep learning inference on NVIDIA GPUs.

Mar 17, 2021
A library for high performance deep learning inference on NVIDIA GPUs.
A library for high performance deep learning inference on NVIDIA GPUs.

Forward - A library for high performance deep learning inference on NVIDIA GPUs Forward - A library for high performance deep learning inference on NV

Nov 21, 2022
Nimble: Physics Engine for Deep Learning
Nimble: Physics Engine for Deep Learning

Nimble: Physics Engine for Deep Learning

Nov 20, 2022
Deploying Deep Learning Models in C++: BERT Language Model
 Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

Nov 14, 2022
TFCC is a C++ deep learning inference framework.

TFCC is a C++ deep learning inference framework.

Sep 28, 2022