A demonstration of various different techniques for implementing 'threaded code,' a technique used in Forth and in virtual machines like the JVM.

Threaded code is a technique used in the implementation of virtual machines (VMs). It avoids the overhead of calling subroutines repeatedly by 'threading' them together with jumps. This was originally done in assembly, where it is the most straightforward to implement. In C, it is harder to implement threaded code that is also performant.

These files are a test of three different techniques for produced fast threaded code. Each .c file implements a simple VM that counts to 2 billion, printing a star whenever the loop counter is evenly divisible by 200 million. The test settings can be changed in threaded_code.h. They all can be compiled and run with 'make benchmark'.

The VMs

No threading:

Doesn't use threaded code. This technique is often used to emulate switch statements in languages like Python. It is going to be the slowest of any of the VMs, and so acts a a control.

Computed goto:

The equivalent in Fortran is assigned goto, but the Python VM calls it computed goto. GCC, Clang, and ICC allow for the location of a label to be stored in a variable. Since goto can only be used in local scope, the code implementing the VM instructions has to fit into one large function.

Continuation-passing:

This is a pretty unusual technique to find in a C program, so I'll spend more time on it. It starts by calling a function, but then it never returns from it. That function ends by calling another function, which also ends by calling another function, and—

This is a recursive technique. As such, it would only be able to keep calling new functions a few hundred or thousand times before it causes a stack overflow. However, GCC, Clang, and ICC have support for tail call optimization, which replaces those function calls with jumps. This means that it's more or less equivelant to computed goto at an assembly level.

Switch statement:

This is the traditional way to implement a VM in C. It is similar in structure to computed goto, but is regarded as slower due to several factors. One is that it requires branching, another is that C performs bounds checking on switch statements. On the other hand, switch statements are used so often that compiler writers focus on making them faster.

For more thorough explanations, see https://www.complang.tuwien.ac.at/forth/threaded-code.html.

So, how do these examples compare to each other?

Test results:

I tested these VMs on a desktop computer running Debian with an Intel Core i3-3220 CPU (Ivy Bridge) at 3.30GHz. Each was run 5 times, with their times to complete in seconds being averaged. This was then performed again at different optimization levels.

Average time to complete execution (seconds), based on the compiler and optimization level

Clang 7.0.1 -O0 -O1 -O2 -O3 -Ofast
No threading (control) 32.2s 18.4 16.4 16.2 45.8
Computed goto 18.8s 5.4 5.4 5.2 5.4
Continuation passing * 7.8 5.6 5.8 4.6
Switch statement 22.0s 13.4 7.2 6.6 6.6
GCC 8.3.0 -O0 -O1 -O2 -O3 -Ofast
No threading (control) 30.4 17.8 33.2 14.8 74.2
Computed goto 15.8 6.8 4.8 4.8 5.0
Continuation passing * 9.0 6.2 6.2 6.2
Switch statement 22.0 13.6 12.8 24.6 24.2
ICC 2021.5.0 -O0 -O1 -O2 -O3 -Ofast
No threading (control) 39.0 20.4 19.8 20.4 21.0
Computed goto 23.8 5.2 5.4 5.2 5.2
Continuation passing * 8.8 6.4 6.4 6.6
Switch statement 26.4 15.4 18.4 18.4 17.6

*Requires -foptimize-sibling-calls and -Og at the minimum to be set. Otherwise, a stack overflow will result.

Discussion

Only one of these techniques is guaranteed to work in all compilers, the switch statement. However, projects like the Python VM have started using this option only as a fallback, preferring computed goto instead. It's easy to see why, it's the fastest in almost every test.

Of the techniques, I'd say the continuation-passing one was my favorite to write. It's much closer to a normal program in structure than the computed goto or switch statement example, and you could split the code implementing the VM into separate files without having to rely on macros to stitch them back together. It's only slightly less performant than computed goto, while being more in line with structured programming.

That said, clang did shockingly well at optimizing that switch statement. I'm also on rather old hardware at this point. Give a few more years for other compilers to catch up, and alternatives to the switch statement may not even matter.

Similar Resources

`lv_lib_100ask` is a reference for various out of the box schemes based on lvgl library or an enhanced interface for various components of lvgl library.

Introduction lv_lib_100ask is a reference for various out of the box schemes based on lvgl library or an enhanced interface for various components of

Dec 15, 2022

Professor Terence Parr has taught us how to create a virtual machine Now it is time to pwn virtual machine

My First real world CTF Simple Virtual Machine Challenge description Professor Terence Parr has taught us how to create a virtual machine Now it is ti

Feb 17, 2022

chia-plotter (pipelined multi-threaded)

chia-plotter (pipelined multi-threaded) This is a new implementation of a chia plotter which is designed as a processing pipeline, similar to how GPUs

Dec 31, 2022

Firmware update for XeniumOS used on Xenium and OpenXenium modchips to provide software fixes and various improvements.

Firmware update for XeniumOS used on Xenium and OpenXenium modchips to provide software fixes and various improvements.

Firmware update for XeniumOS used on Xenium and OpenXenium modchips to provide software fixes and various improvements. About • Features • Installatio

Dec 9, 2022

New lateral movement technique by abusing Windows Perception Simulation Service to achieve DLL hijacking code execution.

New lateral movement technique by abusing Windows Perception Simulation Service to achieve DLL hijacking code execution.

BOF - Lateral movement technique by abusing Windows Perception Simulation Service to achieve DLL hijacking ServiceMove is a POC code for an interestin

Nov 14, 2022

This is an experimental OS-from-scratch project. Just for demonstration, not useful at all.

This is an experimental OS-from-scratch project. Just for demonstration, not useful at all.

OS Playground This is an experimental OS-from-scratch project. Just for demonstration, not useful at all. Different from OS in other projects, this OS

Nov 5, 2022

ADTW demonstration application - implemented with EAP

Amerced DTW demonstration application Amerced DTW (ADTW) is a variant of DTW, "amercing" ("penalizing") warping steps by a fixed penalty. This is a NN

Nov 26, 2021

A demonstration PoC for CVE-2022-21877 (storage spaces controller memory leak)

A demonstration PoC for CVE-2022-21877 (storage spaces controller memory leak)

POC CVE-2022-21877 This repository contains a POC for the CVE-2022-21877, found by Quang Linh, working at STAR Labs. This is an information leak found

Mar 8, 2022

Flavors are typically used to build your app for different environments such as dev and prod.

Flavors are typically used to build your app for different environments such as dev and prod.

Flutter flavor example Flavors are typically used to build your app for different environments such as dev and prod. A sample flutter project to showc

Feb 11, 2022
Comments
  • The reasons continuation-passing fared worse, analysed

    The reasons continuation-passing fared worse, analysed

    When I saw this on HN, I got immediately hooked as I am a big proponent of implementing interpreters in CP-style. For such a trivial interpreter, there must not be any difference, yet CP-style performed significantly worse.

    I've compiled the sources with clang 10.0.0 at -O3. Got similar results.

    Looking at the disassembly of computed_goto.c, some interesting quirks are immediately apparent:

    print_i:
        if (i % PRINT_INTERVAL == 0) {
            printf("* ");
            fflush(stdout);
        }
        goto *dispatcher[prog[++ip]];
    
        testl   %r13d, %r13d
        je  .LBB0_4
    # %bb.5:                                #   in Loop: Header=BB0_3 Depth=1
        addl    $1, %ebx
        leaq    (%r14,%rbx,4), %rax
        movl    (%rax), %eax
        jmpq    *run_program.dispatcher(,%rax,8)
    

    The compiler was able to figure out that the condition is most likely false. Also modulo is computed somewhere else, presumably reducing latency. r13d is zero if computing modulo would yield 0.

    branch:
        if (i == MAX_ITERATIONS)
            ip += 1;
        goto *dispatcher[prog[++ip]];
    
        addl    %r12d, %ebx
        addl    $1, %ebx
        leaq    (%r14,%rbx,4), %rax
        movl    (%rax), %eax
        jmpq    *run_program.dispatcher(,%rax,8)
    

    Interestingly, there are no branches whatsoever! It's easy to infer that %ebx holds ip. So it doesip += %r12d + 1. Presumably, %r12d holds either 0 or 1.

    To summarise, it looks like the compiler introduced two temporary variables

    %r13d = i % PRINT_INTERVAL == 0
    %r12d = i == MAX_ITERATIONS
    

    The following excerpt proves our findings:

    incr_i:
        i += 1;
        goto *dispatcher[prog[++ip]];
    
        addl    $1, %r15d   # <-------------- ++i
        addl    $1, %ebx
        movl    %ebx, %eax
        leaq    (%r14,%rax,4), %rax
        xorl    %r12d, %r12d
        cmpl    $2000000000, %r15d      # imm = 0x77359400
        sete    %r12b    # <-------------------------- %r12d = i == MAX_ITERATIONS
        movl    %r15d, %ecx
        imulq   $1441151881, %rcx, %rcx # imm = 0x55E63B89
        shrq    $58, %rcx   # <---------- %rcx = i / PRINT_INTERVAL
        imull   $200000000, %ecx, %ecx  # imm = 0xBEBC200
        movl    %r15d, %r13d
        subl    %ecx, %r13d  # <---------- %r13d = i - i / PRINT_INTERVAL
        movl    (%rax), %eax
        jmpq    *run_program.dispatcher(,%rax,8)
    

    We can see that the compiler is quite sophisticated. The reasons this transformation is beneficial are quite apparent: fewer branches, latency reduction. I must mention that this optimisation opportunity is specific to the toy interpreter under the benchmark and is typically not applicable to more complex interpreters.

    CP-style implementation has branches in print_i and branch_i as expected. Even though they should be accurately predicted, for some reasons they are not. Providing hints to the compiler makes both implementation exhibit the same performance on my compiler/CPU:

    diff --git a/continuation_passing.c b/continuation_passing.c
    index 2272860..647cdf9 100644
    --- a/continuation_passing.c
    +++ b/continuation_passing.c
    @@ -26,7 +26,7 @@ void incr_i(unsigned ip, unsigned i, program prog)
    
     void print_i(unsigned ip, unsigned i, program prog)
     {
    -       if (i % PRINT_INTERVAL == 0) {
    +       if (__builtin_expect(i % PRINT_INTERVAL == 0, 0)) {
                    printf("* ");
                    fflush(stdout);
            }
    @@ -35,7 +35,7 @@ void print_i(unsigned ip, unsigned i, program prog)
    
     void branch(unsigned ip, unsigned i, program prog)
     {
    -       if (i == MAX_ITERATIONS)
    +       if (__builtin_expect(i == MAX_ITERATIONS, 0))
                    ip += 1;
            next(ip, i, prog);
     }
    
  • It would be nice to benchmark more bytecode sequences with different properties

    It would be nice to benchmark more bytecode sequences with different properties

    The current bytecode sequence has an important property: no instruction appears twice. Hence for each instruction handler the next instruction handler to dispatch to is perfectly predictable. It amplifies the advantage of implementations that replicate instruction dispatch.

    In most if not all VMs the next instruction handler to dispatch to is seldom perfectly predictable. It would be nice to benchmark a more complex instruction sequence to get a more realistic estimate.

  • Benchmark should use a time source with sub-second resolution

    Benchmark should use a time source with sub-second resolution

    The benchmark uses time() from time.h to measure elapsed time. It measures time with 1 second resolution. As certain runs complete in 6 seconds or less, the not so fine grained time source introduces significant inaccuracy.

rdtsc x86 instruction to detect virtual machines
rdtsc x86 instruction to detect virtual machines

rdtsc_detector rdtsc x86 instruction to detect virtual machines What is rdtsc? The Time Stamp Counter (TSC) is a 64-bit register present on all x86 pr

Apr 29, 2022
This project aims to provide a framework and a solid implementation of different techniques
This project aims to provide a framework and a solid implementation of different techniques

This project aims to provide a framework and a solid implementation of different techniques for generating complete seamless procedural cities with interiors for all buildings.

Dec 27, 2022
A Windows user-mode shellcode execution tool that demonstrates various techniques that malware uses
A Windows user-mode shellcode execution tool that demonstrates various techniques that malware uses

Jektor Toolkit v1.0 This utility focuses on shellcode injection techniques to demonstrate methods that malware may use to execute shellcode on a victi

Sep 5, 2022
A tiny Forth I built in a week.
A tiny Forth I built in a week.

( Wrote a blog post about this here ) It was raining hard, a week ago. And what could you possibly do on a rainy Saturday afternoon? Well... You can m

Dec 5, 2022
A C++ implementation of the Forth programming language

onward A C++ implementation of the Forth programming language Build # Clone repository git clone https://github.com/tetsuo-cpp/onward.git cd onward/

Mar 6, 2022
Research tool able to detect and mitigate evasion techniques used by malware in-the-wild

JuanLesPIN IntelPin tool to detect and mitigate Windows malware evasion techniques. This tool is a prototype developed for a research project whose pa

Dec 30, 2022
jvm-monitor is a lightweight monitoring tool that logs all the local variables whenever exceptions occur.

jvm-monitor jvm-monitor is a Java agent attached to a Java VM (virtual machine), which logs all the local variables when exceptions occur. Rationales

Nov 21, 2021
Simple native jvm class dumper written in C by hook ClassLoader
Simple native jvm class dumper written in C by hook ClassLoader

JVM Native Class Dumper Simple native jvm class dumper written in C by hook ClassLoader What is used for? This tool allows you to dump all java classe

Nov 7, 2022
A cross platform shader language with multi-threaded offline compilation or platform shader source code generation
A cross platform shader language with multi-threaded offline compilation or platform shader source code generation

A cross platform shader language with multi-threaded offline compilation or platform shader source code generation. Output json reflection info and c++ header with your shaders structs, fx-like techniques and compile time branch evaluation via (uber-shader) "permutations".

Dec 14, 2022
x64 Windows kernel code execution via user-mode, arbitrary syscall, vulnerable IOCTLs demonstration
x64 Windows kernel code execution via user-mode, arbitrary syscall, vulnerable IOCTLs demonstration

anycall x64 Windows kernel code execution via user-mode, arbitrary syscall, vulnerable IOCTLs demonstration Read: https://www.godeye.club/2021/05/14/0

Dec 30, 2022