Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine

DSSTNE (pronounced "Destiny") is an open source software library for training and deploying recommendation models with sparse inputs, fully connected hidden layers, and sparse outputs. Models with weight matrices that are too large for a single GPU can still be trained on a single host. DSSTNE has been used at Amazon to generate personalized product recommendations for our customers at Amazon's scale. It is designed for production deployment of real-world applications which need to emphasize speed and scale over experimental flexibility.

DSSTNE was built with a number of features for production recommendation workloads:

  • Multi-GPU Scale: Training and prediction both scale out to use multiple GPUs, spreading out computation and storage in a model-parallel fashion for each layer.
  • Large Layers: Model-parallel scaling enables larger networks than are possible with a single GPU.
  • Sparse Data: DSSTNE is optimized for fast performance on sparse datasets, common in recommendation problems. Custom GPU kernels perform sparse computation on the GPU, without filling in lots of zeroes.

Benchmarks

Scaling up

License

License

Setup

  • Follow Setup for step by step instructions on installing and setting up DSSTNE

User Guide

  • Check User Guide for detailed information about the features in DSSTNE

Examples

  • Check Examples to start trying your first Neural Network Modeling using DSSTNE

Q&A

FAQ

Comments
  • Use a smart pointer instead of a raw pointer.

    Use a smart pointer instead of a raw pointer.

    I split the patch on #85 per class.

    All commits in this PR can be applied separately. Tell me if there is a problem in the commit, the commit conflicts with your development or you think it should be checked more carefully. I'll update the PR except that commits.

  • Low ranking accuracy of the example with MovieLens20M?

    Low ranking accuracy of the example with MovieLens20M?

    Hi,

    I've been playing around today with DSSTNE with the goal of running the example with MovieLens20M and compare the NN in the example with some state-of-the-art CF algorithms that I have implemented here. From my evaluation (which is by no means exhaustive or perfect) the example provided by DSSTNE does not seem to be competitive with respect to state of the art CF algorithms.

    To summarise, I have downloaded the original MovieLens 20M dataset and I have performed a random 80%-20% partition. I have transformed the training subset to the DSSTNE format, with the only difference that I do not include the timestamps of the dataset, but 1's for all movies (is this actually very important??). I have generated recommendations with my CF algorithms (popularity, user-based and matrix factorisation) and, following the steps in the example, the predictions of DSSTNE. Finally, I have evaluated the performance with the testing subset using precision at cutoff 10.

    These are the results, the configuration provided in your example does not seem to work very well: pop 0.10974162112149495 ub 0.24097987334078072 mf 0.25135912784469483 dsstne 0.056956854920365056

    I am no expert in ANN's so I cannot figure out easily whether I should modify the parameters in the config.json provided in the example to make it work better. Have you compared the performance of the example with similar CF algorithms? If so, could you please share some results/insights?

    Cheers Saúl

  • Changes to allow NetCDFhelper unit tests to compile without CUDA-related dependencies

    Changes to allow NetCDFhelper unit tests to compile without CUDA-related dependencies

    This PR includes a set of changes that will allow tests for NetCDFhelper.cpp/h to be included in the unit test suite without pulling in CUDA-related dependencies. I recommend reviewing the PR by commit, as each commit neatly captures a set of changes that I have made.

    The first and most significant change was to lift the nested enum types from NNDataSet (NNTypes.h) into a separate header file (NNEnum.h). This has two advantages:

    • The NNEnum.h has no external dependencies other than the STL, so it can be easily used outside of engine code. As it is, these enum values were the only declarations from NNTypes.h that were used in NetCDFhelper.
    • It becomes clear which namespace/class the enum values should be qualified with. Most of the changes in this commit are to correctly qualify usage of those enum values.

    The second commit tidies up NetCDFhelper.h. This should all be pretty self-explanatory.

    Finally, the third commit updates the unit test build and Travis CI configuration to bring in the necessary NetCDF dependencies to compile NetCDFhelper.cpp.

  • Segmentation fault

    Segmentation fault

    Hi,

    I completed the training then performed the predicting, but got below exception: Do you have any suggestions?

    BTW, just one neural network got this error, others is OK.

    =========== exception messages ========= Exported gl_input_predict.samplesIndex with 65075 entries. Raw max index is: 65064 Rounded up max index to: 65152 Created NetCDF file gl_input_predict.nc for dataset gl_input Number of network input nodes: 65064 Number of entries to generate predictions for: 65075 LoadNetCDF: Loading UInt data set NNDataSet::NNDataSet: Name of data set: gl_input NNDataSet::NNDataSet: Attributes: Sparse Boolean NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints. NNDataSet::NNDataSet: 3778407 total datapoints. NNDataSet::NNDataSet: 65075 examples. [snx-dsstne:02608] *** Process received signal *** [snx-dsstne:02608] Signal: Segmentation fault (11) [snx-dsstne:02608] Signal code: Address not mapped (1) [snx-dsstne:02608] Failing at address: 0xb3a1840 [snx-dsstne:02608] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f02cc834330] [snx-dsstne:02608] [ 1] predict[0x430d26] [snx-dsstne:02608] [ 2] predict[0x453fa0] [snx-dsstne:02608] [ 3] predict[0x42a87b] [snx-dsstne:02608] [ 4] predict[0x408307] [snx-dsstne:02608] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f02cc480f45] [snx-dsstne:02608] [ 6] predict[0x40aab1] [snx-dsstne:02608] *** End of error message *** Segmentation fault (core dumped)

  • Questions about Dataset

    Questions about Dataset

    Hi, I was playing with the sample data, and now I have 3 questions.

    Q1. How to make dataset with multiple feature values? Currently, only one feature has one feature value. Is it possible to a feature has multiple values? If so how can I do that?

    Q2. Changing all timestamps to 1 manually giving me a different result. ml20m-all is the dataset of userId and movieId with timestamp.

    userId movieId,timestamp: movieId,timestamp: movieId,timestamp…

    On Issue#21, Mr.Rejith said “Currently no movie features are taken. Currently only 1/0 signals are supported from the wrapper script even though the Engine supports analog signals.” So I changed all timestamps in ml20m-all to 1, and ran DSSTNE with modified data. eg) 2,1112486027:29,1112484676:32,1112484819 to 2,1:29,1:32,1 I thought results would be the same, but it was not. I am guessing that DSSTNE treats feature value as continuous value. Is this right? Then why did DSSTNE give me a different result?

    Q3. Does DSSTNE support digital inputs? On Issue#11, Mr.Rejith said “DSSTNE Engine supports analog inputs but we have not exposed it in the wrapper . if the Rating comes it could be viewed as an analog signals” Analog inputs like Rating are continuous value, so I wondered if DSSTNE supports digital inputs like category id which is discrete value.

    DSSTNE is wonderful. I feel like it has so much potential. But I couldn’t figure how to use it well, and I couldn’t find detailed documentations online.

    Thank you, yuasa

  • How to continue learning?

    How to continue learning?

    Hi,

    I have used the 'train -n gl.nc ...' to get a network file. In the future, if I have more new sample data can be used to training. How to continue training? I used the command 'train -n gl.cn ..' again, but got below error:

    Error: Network file already exists: gl.nc

    Thanks.

  • Not able to compile dsstne due to the gpu architecture

    Not able to compile dsstne due to the gpu architecture

    Hi,

    I am trying to setup DSSTNE on AWS following this guide but I have found problems regarding my gpu architecture

    [email protected]:~/amazon-dsstne/src/amazon/dsstne$ make cd engine && make ************ RELEASE mode ************ make[1]: Entering directory /home/ubuntu/amazon-dsstne/src/amazon/dsstne/engine' nvcc -use_fast_math --ptxas-options="-v" -gencode arch=compute_50,code=sm_50 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_60,code=sm_60 -DOMPI_SKIP_MPICXX -std=c++11 -I/usr/local/cuda/include -IB40C -IB40C/KernelCommon -I/usr/local/include -I/usr/local/openmpi/include -I/usr/include/jsoncpp -I../utils -I../engine -c kernels.cu nvcc fatal : Unsupported gpu architecture 'compute_60' make[1]: *** [kernels.o] Error 1 make[1]: Leaving directory/home/ubuntu/amazon-dsstne/src/amazon/dsstne/engine' make: *** [lib/libdsstne.a] Error 2

    In my EC2 instance I have previously installed TensorFlow plus CUDA, CUDNN and Bazel. Versions are: tensorflow-0.11.0rc0-cp27-none-linux_x86_64 cuda-repo-ubuntu1404_7.5-18_amd64 cudnn-7.5-linux-x64-v5.1 bazel-0.3.2

    Everything seems to work fine but DSSTNE.

  • hdf5 1.8.9 does not support gcc 4.9.

    hdf5 1.8.9 does not support gcc 4.9.

    Setup guide saids you used hdf5 1.8.9, but it was broken and cannot be compiled(with both gcc-4.9 and gcc-5).

    th5s.c:733:9: error: C++ style comments are not allowed in ISO C90
             // ret = H5Pset_alloc_time(plist_id, alloc_time);
             ^
    th5s.c:733:9: error: (this will be reported only once per input file)
    

    I think you should update document to use 1.8.12 that was fixed this compile error.

  • Error: invalid device function launching kernel kScaleAndBias_kernel

    Error: invalid device function launching kernel kScaleAndBias_kernel

    Hi All,

    I am following the example with movielens and got the error as below:

    train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 256 -e 10

    NNNetwork::NNNetwork: 1 input layer NNNetwork::NNNetwork: 1 output layer NNWeight::NNWeight: Allocating 13697024 bytes (128, 26752) for weights between Input and Hidden Error: invalid device function launching kernel kScaleAndBias_kernel GpuContext::Shutdown: Shutting down cuBLAS on GPU for process 0 GpuContext::Shutdown: CuBLAS shut down on GPU for process 0 GpuContext::Shutdown: Shutting down cuRand on GPU for process 0 GpuContext::Shutdown: CuRand shut down on GPU for process 0 GpuContext::Shutdown: Process 0 out of 1 finalized.

    image

    Could you suggest some hints to figure out what did I do wrong?

    Thanks a lot.

  • Trouble with AWS GPU instance and Docker

    Trouble with AWS GPU instance and Docker

    Hello,

    I made an AWS GPU instance from the ami called "amazon-dsstne" (ami-7a0df81a) and made docker image of dsstne there. I followed the example in the document. Then when running "train", it fails with the following error message:

    modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.13.0-83-generic/modules.dep.bin' cudaGetDeviceCount failed unknown error

    Thanks.

  • Issue when follow the example

    Issue when follow the example

    Hello all,

    I am following the example with movielens and got the error

    generateNetCDF -d gl_input -i ml20m-all -o gl_input.nc -f features_input -s samples_input -c
    generateNetCDF: error while loading shared libraries: libnetcdf_c++4.so.1: cannot open shared object file: No such file or directory
    

    I installed all the libraries required in the homepage (I am running Ubuntu 14.04 with GPU CUDA). Could you suggest some hints to figure out what did I do wrong?

    Thanks a lot,

  • AMI in the setup guide is no available

    AMI in the setup guide is no available

    Hi team,

    The AMI ami-fe173884 provided in the setup guide is no longer available. https://github.com/amzn/amazon-dsstne/blob/master/docs/getting_started/setup.md#ami-with-nvidia-docker

    Also tried to use the EC2 Launch Instance Wizard and it results in the below error when trying to launch an EC2 instance:


    We cannot proceed with your requested configuration. You cannot use this AMI (ami-25c0eb32). This AMI has been deregistered, or you do not have permission to use it. Try again with another AMI for which you have permissions, or request permission to use this AMI from its owner.

  • Test dsstne module via python fails

    Test dsstne module via python fails

    Hi, I'm trying to test dsstne module using python, following this document. Here's the complete error log: I also changed the alpha to a lower amount, but still getting malloc error. Any suggestions here?

    NNLayer::Allocate: Allocating 3538944 bytes (864, 1024) of delta data for layer P3
    NNLayer::Deallocate: Deallocating all data for layer Hidden10
    NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of unit data for layer Hidden10
    NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of delta data for layer Hidden10
    NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of dropout data for layer Hidden10
    NNLayer::Deallocate: Deallocating all data for layer Output
    NNLayer::Allocate: Allocating 40960 bytes (10, 1024) of unit data for layer Output
    NNLayer::Allocate: Allocating 40960 bytes (10, 1024) of delta data for layer Output
    NNDataSet<T>::Shard: Model Sharding sparse dataset output across all GPUs.
    Getting algorithm between Input and C1
    Getting algorithm between C1 and C1a
    Getting algorithm between P1 and C2
    Getting algorithm between C2 and C2a
    Getting algorithm between P2 and C3
    Getting algorithm between C3 and C3a
    NNNetwork::RefreshState: Setting cuDNN workspace size to 4442259456 bytes.
    GpuBuffer::Allocate failed (cudaMalloc) out of memory
    python: GpuTypes.h:522: void GpuBuffer<T>::Allocate() [with T = unsigned char]: Assertion `0' failed.
    [e501c14cf80e:19359] *** Process received signal ***
    [e501c14cf80e:19359] Signal: Aborted (6)
    [e501c14cf80e:19359] Signal code:  (-6)
    [e501c14cf80e:19359] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbe10a30390]
    [e501c14cf80e:19359] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fbe1068a428]
    [e501c14cf80e:19359] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fbe1068c02a]
    [e501c14cf80e:19359] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2dbd7)[0x7fbe10682bd7]
    [e501c14cf80e:19359] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2dc82)[0x7fbe10682c82]
    [e501c14cf80e:19359] [ 5] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN9NNNetwork12RefreshStateEv+0x4a0)[0x7fbdffeebe80]
    [e501c14cf80e:19359] [ 6] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN9NNNetwork5TrainEjfffff+0x69)[0x7fbdffef5289]
    [e501c14cf80e:19359] [ 7] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN18NNNetworkFunctions5TrainEP7_objectS1_+0xfb)[0x7fbdffe3cccb]
    [e501c14cf80e:19359] [ 8] python(PyEval_EvalFrameEx+0x5ca)[0x4bc9ba]
    [e501c14cf80e:19359] [ 9] python(PyEval_EvalCodeEx+0x306)[0x4ba036]
    [e501c14cf80e:19359] [10] python[0x4eb32f]
    [e501c14cf80e:19359] [11] python(PyRun_FileExFlags+0x82)[0x4e5592]
    [e501c14cf80e:19359] [12] python(PyRun_SimpleFileExFlags+0x186)[0x4e3e46]
    [e501c14cf80e:19359] [13] python(Py_Main+0x54e)[0x493ade]
    [e501c14cf80e:19359] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fbe10675830]
    [e501c14cf80e:19359] [15] python(_start+0x29)[0x4934a9]
    [e501c14cf80e:19359] *** End of error message ***
    Aborted (core dumped)
    
  • Input stream data and updating model

    Input stream data and updating model

    So I have some questions about input data and updating the trained model:

    I've seen in the documentation that we should provide a csv file and then the input data format will be changed to the desired one using generateNETCDF. But I want to know if it's possible to feed streaming data to this engine, and if yes then how? Assuming a website which has many users that their data is getting changed and updated every moment, also new users join the website and they're gonna have new profiles and producing new data, So then how is the model going to be updated?

    Regards,

  • Fast Forward DSSTNE to NVLINK and faster sparse kernels

    Fast Forward DSSTNE to NVLINK and faster sparse kernels

    Issue #, if available: No issue

    *Description of changes: Adds NVLINK support through DGX-1V, and enables more efficient sparse kernels for smaller embedding widths.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

  • build error due to legacy shuffle API

    build error due to legacy shuffle API

    it threw:

    nvcc -O3 -std=c++11 --compiler-options=-fPIC -use_fast_math --ptxas-options="-v" -gencode arch=compute_70,code=sm_70 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_30,code=sm_30 -DOMPI_SKIP_MPICXX --keep-dir /test/workspace/ctr/amazon-dsstne/build/tmp/engine/cuda -I/usr/local/include -isystem /usr/local/cuda/include -isystem /usr/lib/openmpi/include -isystem /usr/include/jsoncpp -IB40C -IB40C/KernelCommon -I/test/workspace/ctr/amazon-dsstne/build/include -I../utils  -c kernels.cu -o /test/workspace/ctr/amazon-dsstne/build/tmp/engine/cuda/kernels.o
    ptxas /tmp/tmpxft_00000a17_00000000-8_kernels.compute_70.ptx, line 61962; error   : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
    

    the old CUDA shuffle API have been deprecated in CUDA 9.0, and not available after CUDA 10 (link)

    • CUDA: 10.1
    • GPU: NVIDIA Tesla V100
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Dec 30, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

Dec 31, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Dec 23, 2022
Caffe2 is a lightweight, modular, and scalable deep learning framework.

Source code now lives in the PyTorch repository. Caffe2 Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the origin

Jan 6, 2023
Dec 20, 2022
A flexible, high-performance serving system for machine learning models

XGBoost Serving This is a fork of TensorFlow Serving, extended with the support for XGBoost, alphaFM and alphaFM_softmax frameworks. For more informat

Nov 18, 2022
Turi Create simplifies the development of custom machine learning models.
Turi Create simplifies the development of custom machine learning models.

Quick Links: Installation | Documentation Turi Create Turi Create simplifies the development of custom machine learning models. You don't have to be a

Jan 1, 2023
Deploying Deep Learning Models in C++: BERT Language Model
 Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

Nov 14, 2022
tutorial on how to train deep learning models with c++ and dlib.

Dlib Deep Learning tutorial on how to train deep learning models with c++ and dlib. usage git clone https://github.com/davisking/dlib.git mkdir build

Dec 21, 2021
CNStream is a streaming framework for building Cambricon machine learning pipelines
CNStream is a streaming framework for building Cambricon  machine learning pipelines

CNStream is a streaming framework for building Cambricon machine learning pipelines

Dec 30, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library,  for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Jan 3, 2023
Yet another tensor library in C++. It allows direct access to its underlying data buffer, and serializes in JSON.

Yet another tensor library in C++. It allows direct access to its underlying data buffer, and serializes in JSON. Built on top of zax json parser, C++ structures having tensor members can also be JSON-serialized and deserialized, allowing one to save and load the state of a highly hierarchical object.

Dec 15, 2022
Benchmark framework of compute-in-memory based accelerators for deep neural network (inference engine focused)

DNN+NeuroSim V1.3 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly a

Nov 24, 2022
GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units

GPTPU: General-Purpose Computing on (Edge) Tensor Processing Units Welcome to the repository of ESCAL @ UCR's GPTPU project! We aim at demonstrating t

Dec 23, 2022
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Feb 12, 2022
oneAPI Deep Neural Network Library (oneDNN)

oneAPI Deep Neural Network Library (oneDNN) This software was previously known as Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-

Jan 6, 2023
ESP32/8266 Arduino/PlatformIO library that painlessly enables incredibly fast re-connect to the previous wireless network after deep sleep.

WiFiQuick ESP32/8266 Platformio/Arduino library that painlessly enables incredibly fast re-connect to the previous wireless network after deep sleep.

Apr 3, 2022