Toy path tracer for my own learning purposes (CPU/GPU, C++/C#, Win/Mac/Wasm, DX11/Metal, also Unity)

Toy Path Tracer Build Status

Toy path tracer for my own learning purposes, using various approaches/techs. Somewhat based on Peter Shirley's Ray Tracing in One Weekend minibook (highly recommended!), and on Kevin Beason's smallpt.

Screenshot

I decided to write blog posts about things I discover as I do this, currently:

Right now: can only do spheres, no bounding volume hierachy of any sorts, a lot of stuff hardcoded.

Implementations I'm playing with (again, everything is in toy/learning/WIP state; likely suboptimal) are below. These are all on a scene with ~50 spheres and two light sources, measured in Mray/s.

  • CPU. Testing on "PC" AMD ThreadRipper 1950X 3.4GHz (SMT disabled, 16c/16t) and "Mac" mid-2018 MacBookPro i9 2.9GHz (6c/12t):
    • C++ w/ some SSE SIMD: PC 187, Mac 74, iPhone X (A11) 12.9, iPhone SE (A9) 8.5
    • C++: PC 100, Mac 35.7
    • C# (Unity with Burst compiler w/ some 4-wide SIMD): PC 133, Mac 60. Note that this is an early version of Burst.
    • C# (Unity with Burst compiler): PC 82, Mac 36. Note that this is an early version of Burst.
    • C# (.NET Core): PC 53, Mac 23.6
    • C# (Mono with optimized settings): Mac 22.0
    • C# (Mono defaults): Mac 6.1
    • WebAssembly (single threaded, no SIMD): 4.5-5.5 Mray/s on PCs, 2.0-4.0 Mray/s on mobiles.
  • GPU. Simplistic ports to compute shader:
    • PC D3D11. GeForce GTX 1080 Ti: 1854
    • Mac Metal. AMD Radeon Pro 560X: 246
    • iOS Metal. A11 GPU (iPhone X): 46.6, A9 GPU (iPhone SE): 19.8

A lot of stuff in the implementation is totally suboptimal or using the tech in a "wrong" way. I know it's just a simple toy, ok :)

Building

  • C++ projects:
    • Windows (Visual Studio 2017) in Cpp/Windows/TestCpu.sln. DX11 Win32 app that displays result as a fullscreen CPU-updated or GPU-rendered texture.
    • Mac/iOS (Xcode 10) in Cpp/Apple/Test.xcodeproj. Metal app that displays result as a fullscreen CPU-updated or GPU-rendered texture. Should work on both Mac (Test Mac target) and iOS (Test iOS target).
    • WebAssembly in Cpp/Emscripten/build.sh. CPU, single threaded, no SIMD.
  • C# project in Cs/TestCs.sln. A command line app that renders some frames and dumps out final TGA screenshot at the end.
  • Unity project in Unity. I used Unity 2020.3.8.
Comments
  • How do you run/update the numbers on Mono?

    How do you run/update the numbers on Mono?

    Hello Aras,

    How are you running/updating the numbers in Mono, as there is no MathF there yet.

    I did a port from MathF to the DesktopCLR API, which is supported in Mono, and put my changes here:

    https://github.com/migueldeicaza/ToyPathTracer/tree/desktop-clr

    The odd thing is that on my system, .NET Core 2.1.4 seems slower, perhaps we have different versions?

    $ dotnet --version
    2.1.4
    $ mono --version
    Mono JIT compiler version 5.13.0 (master/4723e6603e6 Sun Apr  1 21:34:34 EDT 2018)
    Copyright (C) 2002-2014 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
    	TLS:           normal
    	SIGSEGV:       altstack
    	Notification:  kqueue
    	Architecture:  amd64
    	Disabled:      none
    	Misc:          softdebug 
    	Interpreter:   yes
    	LLVM:          supported, not enabled.
    	GC:            sgen (concurrent by default)
    $ dotnet run
    2752.30ms 4.2Mrays/s 11.47Mrays/frame frames 1
    ^c
    $ mono demo.exe
    1819.82ms 6.3Mrays/s 11.47Mrays/frame frames 1
    ^c
    $ mono -O=float32 demo.exe
    1462.56ms 7.8Mrays/s 11.47Mrays/frame frames 1
    ^c
    $
    

    That said, to make it more apples to apples, I just submitted a pull request to Mono to get MathF:

    https://github.com/mono/mono/pull/7941

    With that patch, I can compare apples of the same species, and now I get:

    $ dotnet run
    2876.87ms 4.1Mrays/s 11.76Mrays/frame frames 1
    2874.08ms 4.1Mrays/s 11.76Mrays/frame frames 2
    2876.00ms 4.1Mrays/s 11.76Mrays/frame frames 3
    2860.35ms 4.1Mrays/s 11.76Mrays/frame frames 4
    2852.96ms 4.1Mrays/s 11.75Mrays/frame frames 5
    2842.87ms 4.1Mrays/s 11.76Mrays/frame frames 6
    2840.46ms 4.1Mrays/s 11.76Mrays/frame frames 7
    2835.01ms 4.1Mrays/s 11.76Mrays/frame frames 8
    2829.51ms 4.2Mrays/s 11.76Mrays/frame frames 9
    2829.30ms 4.2Mrays/s 11.75Mrays/frame frames 10
    2833.15ms 4.2Mrays/s 11.76Mrays/frame frames 11
    2831.47ms 4.2Mrays/s 11.76Mrays/frame frames 12
    2841.67ms 4.1Mrays/s 11.76Mrays/frame frames 13
    2851.48ms 4.1Mrays/s 11.76Mrays/frame frames 14
    2857.88ms 4.1Mrays/s 11.76Mrays/frame frames 15
    2881.28ms 4.1Mrays/s 11.75Mrays/frame frames 16
    2888.21ms 4.1Mrays/s 11.76Mrays/frame frames 17
    2894.16ms 4.1Mrays/s 11.76Mrays/frame frames 18
    2896.44ms 4.1Mrays/s 11.75Mrays/frame frames 19
    2904.69ms 4.0Mrays/s 11.76Mrays/frame frames 20
    2937.43ms 4.0Mrays/s 11.76Mrays/frame frames 21
    2936.99ms 4.0Mrays/s 11.76Mrays/frame frames 22
    2937.48ms 4.0Mrays/s 11.76Mrays/frame frames 23
    2935.11ms 4.0Mrays/s 11.76Mrays/frame frames 24
    2933.07ms 4.0Mrays/s 11.76Mrays/frame frames 25
    2932.50ms 4.0Mrays/s 11.76Mrays/frame frames 26
    2928.50ms 4.0Mrays/s 11.76Mrays/frame frames 27
    2928.81ms 4.0Mrays/s 11.76Mrays/frame frames 28
    2927.48ms 4.0Mrays/s 11.76Mrays/frame frames 29
    2925.44ms 4.0Mrays/s 11.76Mrays/frame frames 30
    $ mono mathf.exe
    1815.47ms 6.5Mrays/s 11.76Mrays/frame frames 1
    1825.87ms 6.4Mrays/s 11.76Mrays/frame frames 2
    1813.91ms 6.5Mrays/s 11.76Mrays/frame frames 3
    1836.47ms 6.4Mrays/s 11.76Mrays/frame frames 4
    1849.84ms 6.4Mrays/s 11.75Mrays/frame frames 5
    1843.00ms 6.4Mrays/s 11.76Mrays/frame frames 6
    1870.65ms 6.3Mrays/s 11.76Mrays/frame frames 7
    1873.14ms 6.3Mrays/s 11.76Mrays/frame frames 8
    1871.27ms 6.3Mrays/s 11.76Mrays/frame frames 9
    1873.10ms 6.3Mrays/s 11.75Mrays/frame frames 10
    1871.02ms 6.3Mrays/s 11.76Mrays/frame frames 11
    1868.86ms 6.3Mrays/s 11.76Mrays/frame frames 12
    1870.36ms 6.3Mrays/s 11.76Mrays/frame frames 13
    1872.45ms 6.3Mrays/s 11.76Mrays/frame frames 14
    1871.38ms 6.3Mrays/s 11.76Mrays/frame frames 15
    1870.67ms 6.3Mrays/s 11.76Mrays/frame frames 16
    1873.83ms 6.3Mrays/s 11.76Mrays/frame frames 17
    1876.38ms 6.3Mrays/s 11.76Mrays/frame frames 18
    1878.16ms 6.3Mrays/s 11.75Mrays/frame frames 19
    1879.80ms 6.3Mrays/s 11.76Mrays/frame frames 20
    1880.18ms 6.3Mrays/s 11.76Mrays/frame frames 21
    1882.34ms 6.2Mrays/s 11.76Mrays/frame frames 22
    1878.91ms 6.3Mrays/s 11.76Mrays/frame frames 23
    1880.97ms 6.3Mrays/s 11.76Mrays/frame frames 24
    1879.36ms 6.3Mrays/s 11.76Mrays/frame frames 25
    1879.97ms 6.3Mrays/s 11.76Mrays/frame frames 26
    1878.91ms 6.3Mrays/s 11.76Mrays/frame frames 27
    1878.21ms 6.3Mrays/s 11.76Mrays/frame frames 28
    1879.25ms 6.3Mrays/s 11.76Mrays/frame frames 29
    1879.26ms 6.3Mrays/s 11.76Mrays/frame frames 30
    $ mono -O=float32 mathf.exe
    1633.95ms 7.2Mrays/s 11.76Mrays/frame frames 1
    1545.29ms 7.6Mrays/s 11.76Mrays/frame frames 2
    1509.87ms 7.8Mrays/s 11.76Mrays/frame frames 3
    1550.70ms 7.6Mrays/s 11.76Mrays/frame frames 4
    1565.38ms 7.5Mrays/s 11.75Mrays/frame frames 5
    1551.46ms 7.6Mrays/s 11.76Mrays/frame frames 6
    1567.24ms 7.5Mrays/s 11.76Mrays/frame frames 7
    1579.61ms 7.4Mrays/s 11.76Mrays/frame frames 8
    1565.49ms 7.5Mrays/s 11.76Mrays/frame frames 9
    1558.81ms 7.5Mrays/s 11.75Mrays/frame frames 10
    1570.96ms 7.5Mrays/s 11.76Mrays/frame frames 11
    1583.83ms 7.4Mrays/s 11.76Mrays/frame frames 12
    1587.42ms 7.4Mrays/s 11.76Mrays/frame frames 13
    1590.42ms 7.4Mrays/s 11.76Mrays/frame frames 14
    1596.51ms 7.4Mrays/s 11.76Mrays/frame frames 15
    1596.27ms 7.4Mrays/s 11.75Mrays/frame frames 16
    1591.59ms 7.4Mrays/s 11.76Mrays/frame frames 17
    1591.84ms 7.4Mrays/s 11.76Mrays/frame frames 18
    1585.52ms 7.4Mrays/s 11.75Mrays/frame frames 19
    1578.45ms 7.4Mrays/s 11.76Mrays/frame frames 20
    1575.85ms 7.5Mrays/s 11.76Mrays/frame frames 21
    1571.86ms 7.5Mrays/s 11.76Mrays/frame frames 22
    1568.69ms 7.5Mrays/s 11.76Mrays/frame frames 23
    1565.06ms 7.5Mrays/s 11.76Mrays/frame frames 24
    1562.82ms 7.5Mrays/s 11.76Mrays/frame frames 25
    1560.53ms 7.5Mrays/s 11.76Mrays/frame frames 26
    1558.02ms 7.5Mrays/s 11.76Mrays/frame frames 27
    1555.59ms 7.6Mrays/s 11.76Mrays/frame frames 28
    1554.55ms 7.6Mrays/s 11.76Mrays/frame frames 29
    1552.10ms 7.6Mrays/s 11.76Mrays/frame frames 30
    
  • Cosine distribution

    Cosine distribution

    Hi Aras,

    For lambertian scattering we have this line: float3 target = rec.pos + rec.normal + RandomInUnitSphere(); and then we normalize target

    The result of RandomInUnitSphere() should also be normalized, otherwise we won't get cosine distribution for target direction. I believe Peter Shirley mentioned about potential bug in the image generation in the third book, it could be because of this.

  •  Am I missing an assembly reference?

    Am I missing an assembly reference?

    Even though Burst is installed in the package manager, something is missing /Users/ibicha/Library/Unity/cache/packages/staging-packages.unity.com/[email protected].2.3/Editor/BurstReflection.cs(19,67): error CS0246: The type or namespace nameAssembliesType' could not be found. Are you missing an assembly reference?` (Yes this is probably related to the package manager, not the path tracer itself I guess)

  • Add support for .NET CoreRT (AOT) and net461 (Desktop/Mono)

    Add support for .NET CoreRT (AOT) and net461 (Desktop/Mono)

    Hey, Just a small PR to add support for a preview of .NET CoreRT AOT. On my machine it gives slightly +5% better performance than .NET CoreCLR and output a single exe file (not tiny though, still 4Mo!)

  • Add thread support to Emscripten version

    Add thread support to Emscripten version

    Results with Chrome 70 in Mray/s:

    | OS | CPU | single-thread | multi-thread | |------------|-------------------------------|---------------|-------------------| | Windows 10 | Intel Core i7-7700, 3.6GHz | 4.8 | 19.5 (8 threads) | | ArchLinux | 2x Intel Xeon E5-2698, 2.3GHz | 4.1 | ~37 (64 threads) |

    Requires enabling thread support in chrome://flags - WebAssembly threads support

    Implementation with threads does a bit ugly copy on every frame to display image in Canvas - I could not figure out how to create ImageData from SharedArrayBuffer.

    You probably will want to set PTHREAD_POOL_SIZE variable to reasonable amount of threads if deploying this to public. Currently it will use count of cpu cores on machine where it is being built as thread count to use.

  • C++: SSE intrinsics for HitSpheres, switch to larger scene

    C++: SSE intrinsics for HitSpheres, switch to larger scene

    Switch to larger scene (9 spheres, 1 lights) -> (46 spheres, 2 lights).

    SSE intrinsics implementation of HitSpheres:

    • PC 79 -> 107 Mray/s
    • Mac 17 -> 30 Mray/s
  • C++: Simple SIMD / SoA experiment

    C++: Simple SIMD / SoA experiment

    • Let's try to use SSE for the float3 struct, i.e. the "baby's first SIMD" approach :)
    • Play around with SoA layout for all sphere data
    • Play around with compiler settings wrt floating point math

    PC: 135 -> 186 Mray/s Mac: 38.7 -> 49.8 Mray/s

    Largest win is not from SSE float3, but from: 1) SoA layout of sphere data and 2) on MSVC, fp:fast setting.

  • GPU: naïve D3D11 implementation

    GPU: naïve D3D11 implementation

    Similarly to #5, a trivial impl for D3D11/HLSL.

    AMD ThreadRipper, 3.4GHz, no SMT (C++ CPU 135 Mray/s):

    • GeForce GTX 1080 Ti: 2780 Mray/s
    • Radeon Pro WX 9100: 3700 Mray/s
    • Radeon HD 7700: 417 Mray/s
  • GPU: naïve Metal implementation

    GPU: naïve Metal implementation

    Trivially "direct" port of C++ implementation to Apple Metal, as a compute shader. It's not optimal for GPU at all, to be optimized at some later time!

    MacBookPro, late 2013, 2.3GHz (C++ CPU 38.4 Mray/s):

    • Intel Iris Pro: 204 Mray/s
    • GeForce GT 750M: 157 Mray/s

    iMac 5K, 2017, 3.8GHz (C++ CPU 59.0 Mray/s):

    • Radeon Pro 580: 1650 Mray/s
  • GPU Tracer

    GPU Tracer

    Hi Aras, I read about your post about adapting the ray_color function of P. Shirley to GPU.

    I thought implementing it in the following manner also fits the underlying equation:

    vec3 ray_color(in Ray r, in Scene scene, in vec3 background, int depth) {
      //
      // partly taken from here
      // http://aras-p.info/blog/2018/04/03/Daily-Pathtracer-Part-5-Metal-GPU/
      // by Aras Pranckevičius
      Ray r_in;
      r_in.origin = r.origin;
      r_in.direction = r.direction;
      vec3 result = vec3(0);
      vec3 color = vec3(1);
    
      while (true) {
        if (depth <= 0) {
          return vec3(0);
        }
        HitRecord rec;
        if (hit(scene, r_in, 0.001, INFINITY, rec)) {
          Ray r_out;
          vec3 atten;
          vec3 emitColor = emit(rec.mat_ptr, rec.u, rec.v, rec.point);
          if (scatter(rec.mat_ptr, r_in, rec, atten, r_out) == true) {
            r_in = r_out;
            result += (color * emitColor);
            color *= atten;
            depth--;
          } else {
            result += (color * emitColor);
            return result;
          }
        } else {
          result += (color * background);
          return result;
        }
      }
    }
    

    I thought, I was being smart, but then I met this guy.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

Dec 31, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Dec 30, 2022
CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU executio

CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models supporting both CPU and GPU execution. The goal is to provide comprehensive inference features and be the most efficient and cost-effective solution to deploy standard neural machine translation systems such as Transformer models.

Jan 2, 2023
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

Dec 29, 2022
A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.
A lightweight 2D Pose model  can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

Jan 3, 2023
Raytracer implemented with CPU and GPU using CUDA
Raytracer implemented with CPU and GPU using CUDA

Raytracer This is a training project aimed at learning ray tracing algorithm and practicing convert sequential CPU code into a parallelized GPU code u

Nov 29, 2021
Radeon Rays is ray intersection acceleration library for hardware and software multiplatforms using CPU and GPU

RadeonRays 4.1 Summary RadeonRays is a ray intersection acceleration library. AMD developed RadeonRays to help developers make the most of GPU and to

Dec 29, 2022
4eisa40 GPU computing : exploiting the GPU to execute advanced simulations

GPU-computing 4eisa40 GPU computing : exploiting the GPU to execute advanced simulations Activities Parallel programming Algorithms Image processing O

Jan 10, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Dec 30, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Feb 12, 2022
We implemented our own sequential version of GA, PSO, SA and ACA using C++ and the parallelized version with CUDA support

We implemented our own sequential version of GA, PSO, SA and ACA using C++ (some using Eigen3 as matrix operation backend) and the parallelized version with CUDA support. All of them are much faster than the popular lib scikit-opt.

May 7, 2022
Implement yolov5 with Tensorrt C++ api, and integrate batchedNMSPlugin. A Python wrapper is also provided.
Implement yolov5 with Tensorrt C++ api, and integrate batchedNMSPlugin. A Python wrapper is also provided.

yolov5 Original codes from tensorrtx. I modified the yololayer and integrated batchedNMSPlugin. A yolov5s.wts is provided for fast demo. How to genera

Dec 6, 2022
A hierarchical parameter server framework based on MXNet. GeoMX also implements multiple communication-efficient strategies.

Introduction GeoMX is a MXNet-based two-layer parameter server framework, aiming at integrating data knowledge that owned by multiple independent part

Oct 21, 2022
A hierarchical parameter server framework based on MXNet. GeoMX also implements multiple communication-efficient strategies.

Introduction GeoMX is a MXNet-based two-layer parameter server framework, aiming at integrating data knowledge that owned by multiple independent part

Oct 21, 2022
BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser)
BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser)

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser)

Dec 17, 2022
FaceSwap, Realtime using cpu, 3D, c++
FaceSwap,  Realtime using cpu, 3D, c++

faceswap_cxx 3D FaceSwap, Using cpu realtime realtime face swap using cpu with 3D model Introduction c++版使用cpu实时换脸,参考git: https://github.com/MarekKowa

Nov 23, 2022
A GPU (CUDA) based Artificial Neural Network library
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Dec 10, 2022