A microbenchmark support library

Benchmark

build-and-test pylint test-bindings

Build Status Build status Coverage Status

A library to benchmark code snippets, similar to unit tests. Example:

#include <benchmark/benchmark.h>

static void BM_SomeFunction(benchmark::State& state) {
  // Perform setup here
  for (auto _ : state) {
    // This code gets timed
    SomeFunction();
  }
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();

To get started, see Requirements and Installation. See Usage for a full example and the User Guide for a more comprehensive feature overview.

It may also help to read the Google Test documentation as some of the structural aspects of the APIs are similar.

Resources

Discussion group

IRC channel: freenode #googlebenchmark

Additional Tooling Documentation

Assembly Testing Documentation

Requirements

The library can be used with C++03. However, it requires C++11 to build, including compiler and standard library support.

The following minimum versions are required to build the library:

  • GCC 4.8
  • Clang 3.4
  • Visual Studio 14 2015
  • Intel 2015 Update 1

See Platform-Specific Build Instructions.

Installation

This describes the installation process using cmake. As pre-requisites, you'll need git and cmake installed.

See dependencies.md for more details regarding supported versions of build tools.

# Check out the library.
$ git clone https://github.com/google/benchmark.git
# Benchmark requires Google Test as a dependency. Add the source tree as a subdirectory.
$ git clone https://github.com/google/googletest.git benchmark/googletest
# Go to the library root directory
$ cd benchmark
# Make a build directory to place the build output.
$ cmake -E make_directory "build"
# Generate build system files with cmake.
$ cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ../
# or, starting with CMake 3.13, use a simpler form:
# cmake -DCMAKE_BUILD_TYPE=Release -S . -B "build"
# Build the library.
$ cmake --build "build" --config Release

This builds the benchmark and benchmark_main libraries and tests. On a unix system, the build directory should now look something like this:

/benchmark
  /build
    /src
      /libbenchmark.a
      /libbenchmark_main.a
    /test
      ...

Next, you can run the tests to check the build.

$ cmake -E chdir "build" ctest --build-config Release

If you want to install the library globally, also run:

sudo cmake --build "build" --config Release --target install

Note that Google Benchmark requires Google Test to build and run the tests. This dependency can be provided two ways:

  • Checkout the Google Test sources into benchmark/googletest as above.
  • Otherwise, if -DBENCHMARK_DOWNLOAD_DEPENDENCIES=ON is specified during configuration, the library will automatically download and build any required dependencies.

If you do not wish to build and run the tests, add -DBENCHMARK_ENABLE_GTEST_TESTS=OFF to CMAKE_ARGS.

Debug vs Release

By default, benchmark builds as a debug library. You will see a warning in the output when this is the case. To build it as a release library instead, add -DCMAKE_BUILD_TYPE=Release when generating the build system files, as shown above. The use of --config Release in build commands is needed to properly support multi-configuration tools (like Visual Studio for example) and can be skipped for other build systems (like Makefile).

To enable link-time optimisation, also add -DBENCHMARK_ENABLE_LTO=true when generating the build system files.

If you are using gcc, you might need to set GCC_AR and GCC_RANLIB cmake cache variables, if autodetection fails.

If you are using clang, you may need to set LLVMAR_EXECUTABLE, LLVMNM_EXECUTABLE and LLVMRANLIB_EXECUTABLE cmake cache variables.

Stable and Experimental Library Versions

The main branch contains the latest stable version of the benchmarking library; the API of which can be considered largely stable, with source breaking changes being made only upon the release of a new major version.

Newer, experimental, features are implemented and tested on the v2 branch. Users who wish to use, test, and provide feedback on the new features are encouraged to try this branch. However, this branch provides no stability guarantees and reserves the right to change and break the API at any time.

Usage

Basic usage

Define a function that executes the code to measure, register it as a benchmark function using the BENCHMARK macro, and ensure an appropriate main function is available:

#include <benchmark/benchmark.h>

static void BM_StringCreation(benchmark::State& state) {
  for (auto _ : state)
    std::string empty_string;
}
// Register the function as a benchmark
BENCHMARK(BM_StringCreation);

// Define another benchmark
static void BM_StringCopy(benchmark::State& state) {
  std::string x = "hello";
  for (auto _ : state)
    std::string copy(x);
}
BENCHMARK(BM_StringCopy);

BENCHMARK_MAIN();

To run the benchmark, compile and link against the benchmark library (libbenchmark.a/.so). If you followed the build steps above, this library will be under the build directory you created.

# Example on linux after running the build steps above. Assumes the
# `benchmark` and `build` directories are under the current directory.
$ g++ mybenchmark.cc -std=c++11 -isystem benchmark/include \
  -Lbenchmark/build/src -lbenchmark -lpthread -o mybenchmark

Alternatively, link against the benchmark_main library and remove BENCHMARK_MAIN(); above to get the same behavior.

The compiled executable will run all benchmarks by default. Pass the --help flag for option information or see the guide below.

Usage with CMake

If using CMake, it is recommended to link against the project-provided benchmark::benchmark and benchmark::benchmark_main targets using target_link_libraries. It is possible to use find_package to import an installed version of the library.

find_package(benchmark REQUIRED)

Alternatively, add_subdirectory will incorporate the library directly in to one's CMake project.

add_subdirectory(benchmark)

Either way, link to the library as follows.

target_link_libraries(MyTarget benchmark::benchmark)

Platform Specific Build Instructions

Building with GCC

When the library is built using GCC it is necessary to link with the pthread library due to how GCC implements std::thread. Failing to link to pthread will lead to runtime exceptions (unless you're using libc++), not linker errors. See issue #67 for more details. You can link to pthread by adding -pthread to your linker command. Note, you can also use -lpthread, but there are potential issues with ordering of command line parameters if you use that.

Building with Visual Studio 2015 or 2017

The shlwapi library (-lshlwapi) is required to support a call to CPUInfo which reads the registry. Either add shlwapi.lib under [ Configuration Properties > Linker > Input ], or use the following:

// Alternatively, can add libraries using linker options.
#ifdef _WIN32
#pragma comment ( lib, "Shlwapi.lib" )
#ifdef _DEBUG
#pragma comment ( lib, "benchmarkd.lib" )
#else
#pragma comment ( lib, "benchmark.lib" )
#endif
#endif

Can also use the graphical version of CMake:

  • Open CMake GUI.
  • Under Where to build the binaries, same path as source plus build.
  • Under CMAKE_INSTALL_PREFIX, same path as source plus install.
  • Click Configure, Generate, Open Project.
  • If build fails, try deleting entire directory and starting again, or unticking options to build less.

Building with Intel 2015 Update 1 or Intel System Studio Update 4

See instructions for building with Visual Studio. Once built, right click on the solution and change the build to Intel.

Building on Solaris

If you're running benchmarks on solaris, you'll want the kstat library linked in too (-lkstat).

User Guide

Command Line

Output Formats

Output Files

Running Benchmarks

Running a Subset of Benchmarks

Result Comparison

Library

Runtime and Reporting Considerations

Passing Arguments

Calculating Asymptotic Complexity

Templated Benchmarks

Fixtures

Custom Counters

Multithreaded Benchmarks

CPU Timers

Manual Timing

Setting the Time Unit

Preventing Optimization

Reporting Statistics

Custom Statistics

Using RegisterBenchmark

Exiting with an Error

A Faster KeepRunning Loop

Disabling CPU Frequency Scaling

Output Formats

The library supports multiple output formats. Use the --benchmark_format=<console|json|csv> flag (or set the BENCHMARK_FORMAT=<console|json|csv> environment variable) to set the format type. console is the default format.

The Console format is intended to be a human readable format. By default the format generates color output. Context is output on stderr and the tabular data on stdout. Example tabular output looks like:

Benchmark                               Time(ns)    CPU(ns) Iterations
----------------------------------------------------------------------
BM_SetInsert/1024/1                        28928      29349      23853  133.097kB/s   33.2742k items/s
BM_SetInsert/1024/8                        32065      32913      21375  949.487kB/s   237.372k items/s
BM_SetInsert/1024/10                       33157      33648      21431  1.13369MB/s   290.225k items/s

The JSON format outputs human readable json split into two top level attributes. The context attribute contains information about the run in general, including information about the CPU and the date. The benchmarks attribute contains a list of every benchmark run. Example json output looks like:

{
  "context": {
    "date": "2015/03/17-18:40:25",
    "num_cpus": 40,
    "mhz_per_cpu": 2801,
    "cpu_scaling_enabled": false,
    "build_type": "debug"
  },
  "benchmarks": [
    {
      "name": "BM_SetInsert/1024/1",
      "iterations": 94877,
      "real_time": 29275,
      "cpu_time": 29836,
      "bytes_per_second": 134066,
      "items_per_second": 33516
    },
    {
      "name": "BM_SetInsert/1024/8",
      "iterations": 21609,
      "real_time": 32317,
      "cpu_time": 32429,
      "bytes_per_second": 986770,
      "items_per_second": 246693
    },
    {
      "name": "BM_SetInsert/1024/10",
      "iterations": 21393,
      "real_time": 32724,
      "cpu_time": 33355,
      "bytes_per_second": 1199226,
      "items_per_second": 299807
    }
  ]
}

The CSV format outputs comma-separated values. The context is output on stderr and the CSV itself on stdout. Example CSV output looks like:

name,iterations,real_time,cpu_time,bytes_per_second,items_per_second,label
"BM_SetInsert/1024/1",65465,17890.7,8407.45,475768,118942,
"BM_SetInsert/1024/8",116606,18810.1,9766.64,3.27646e+06,819115,
"BM_SetInsert/1024/10",106365,17238.4,8421.53,4.74973e+06,1.18743e+06,

Output Files

Write benchmark results to a file with the --benchmark_out=<filename> option (or set BENCHMARK_OUT). Specify the output format with --benchmark_out_format={json|console|csv} (or set BENCHMARK_OUT_FORMAT={json|console|csv}). Note that the 'csv' reporter is deperecated and the saved .csv file is not parsable by csv parsers.

Specifying --benchmark_out does not suppress the console output.

Running Benchmarks

Benchmarks are executed by running the produced binaries. Benchmarks binaries, by default, accept options that may be specified either through their command line interface or by setting environment variables before execution. For every --option_flag=<value> CLI switch, a corresponding environment variable OPTION_FLAG=<value> exist and is used as default if set (CLI switches always prevails). A complete list of CLI options is available running benchmarks with the --help switch.

Running a Subset of Benchmarks

The --benchmark_filter=<regex> option (or BENCHMARK_FILTER=<regex> environment variable) can be used to only run the benchmarks that match the specified <regex>. For example:

$ ./run_benchmarks.x --benchmark_filter=BM_memcpy/32
Run on (1 X 2300 MHz CPU )
2016-06-25 19:34:24
Benchmark              Time           CPU Iterations
----------------------------------------------------
BM_memcpy/32          11 ns         11 ns   79545455
BM_memcpy/32k       2181 ns       2185 ns     324074
BM_memcpy/32          12 ns         12 ns   54687500
BM_memcpy/32k       1834 ns       1837 ns     357143

Result comparison

It is possible to compare the benchmarking results. See Additional Tooling Documentation

Runtime and Reporting Considerations

When the benchmark binary is executed, each benchmark function is run serially. The number of iterations to run is determined dynamically by running the benchmark a few times and measuring the time taken and ensuring that the ultimate result will be statistically stable. As such, faster benchmark functions will be run for more iterations than slower benchmark functions, and the number of iterations is thus reported.

In all cases, the number of iterations for which the benchmark is run is governed by the amount of time the benchmark takes. Concretely, the number of iterations is at least one, not more than 1e9, until CPU time is greater than the minimum time, or the wallclock time is 5x minimum time. The minimum time is set per benchmark by calling MinTime on the registered benchmark object.

Average timings are then reported over the iterations run. If multiple repetitions are requested using the --benchmark_repetitions command-line option, or at registration time, the benchmark function will be run several times and statistical results across these repetitions will also be reported.

As well as the per-benchmark entries, a preamble in the report will include information about the machine on which the benchmarks are run.

Passing Arguments

Sometimes a family of benchmarks can be implemented with just one routine that takes an extra argument to specify which one of the family of benchmarks to run. For example, the following code defines a family of benchmarks for measuring the speed of memcpy() calls of different lengths:

static void BM_memcpy(benchmark::State& state) {
  char* src = new char[state.range(0)];
  char* dst = new char[state.range(0)];
  memset(src, 'x', state.range(0));
  for (auto _ : state)
    memcpy(dst, src, state.range(0));
  state.SetBytesProcessed(int64_t(state.iterations()) *
                          int64_t(state.range(0)));
  delete[] src;
  delete[] dst;
}
BENCHMARK(BM_memcpy)->Arg(8)->Arg(64)->Arg(512)->Arg(1<<10)->Arg(8<<10);

The preceding code is quite repetitive, and can be replaced with the following short-hand. The following invocation will pick a few appropriate arguments in the specified range and will generate a benchmark for each such argument.

BENCHMARK(BM_memcpy)->Range(8, 8<<10);

By default the arguments in the range are generated in multiples of eight and the command above selects [ 8, 64, 512, 4k, 8k ]. In the following code the range multiplier is changed to multiples of two.

BENCHMARK(BM_memcpy)->RangeMultiplier(2)->Range(8, 8<<10);

Now arguments generated are [ 8, 16, 32, 64, 128, 256, 512, 1024, 2k, 4k, 8k ].

The preceding code shows a method of defining a sparse range. The following example shows a method of defining a dense range. It is then used to benchmark the performance of std::vector initialization for uniformly increasing sizes.

static void BM_DenseRange(benchmark::State& state) {
  for(auto _ : state) {
    std::vector<int> v(state.range(0), state.range(0));
    benchmark::DoNotOptimize(v.data());
    benchmark::ClobberMemory();
  }
}
BENCHMARK(BM_DenseRange)->DenseRange(0, 1024, 128);

Now arguments generated are [ 0, 128, 256, 384, 512, 640, 768, 896, 1024 ].

You might have a benchmark that depends on two or more inputs. For example, the following code defines a family of benchmarks for measuring the speed of set insertion.

static void BM_SetInsert(benchmark::State& state) {
  std::set<int> data;
  for (auto _ : state) {
    state.PauseTiming();
    data = ConstructRandomSet(state.range(0));
    state.ResumeTiming();
    for (int j = 0; j < state.range(1); ++j)
      data.insert(RandomNumber());
  }
}
BENCHMARK(BM_SetInsert)
    ->Args({1<<10, 128})
    ->Args({2<<10, 128})
    ->Args({4<<10, 128})
    ->Args({8<<10, 128})
    ->Args({1<<10, 512})
    ->Args({2<<10, 512})
    ->Args({4<<10, 512})
    ->Args({8<<10, 512});

The preceding code is quite repetitive, and can be replaced with the following short-hand. The following macro will pick a few appropriate arguments in the product of the two specified ranges and will generate a benchmark for each such pair.

BENCHMARK(BM_SetInsert)->Ranges({{1<<10, 8<<10}, {128, 512}});

Some benchmarks may require specific argument values that cannot be expressed with Ranges. In this case, ArgsProduct offers the ability to generate a benchmark input for each combination in the product of the supplied vectors.

BENCHMARK(BM_SetInsert)
    ->ArgsProduct({{1<<10, 3<<10, 8<<10}, {20, 40, 60, 80}})
// would generate the same benchmark arguments as
BENCHMARK(BM_SetInsert)
    ->Args({1<<10, 20})
    ->Args({3<<10, 20})
    ->Args({8<<10, 20})
    ->Args({3<<10, 40})
    ->Args({8<<10, 40})
    ->Args({1<<10, 40})
    ->Args({1<<10, 60})
    ->Args({3<<10, 60})
    ->Args({8<<10, 60})
    ->Args({1<<10, 80})
    ->Args({3<<10, 80})
    ->Args({8<<10, 80});

For more complex patterns of inputs, passing a custom function to Apply allows programmatic specification of an arbitrary set of arguments on which to run the benchmark. The following example enumerates a dense range on one parameter, and a sparse range on the second.

static void CustomArguments(benchmark::internal::Benchmark* b) {
  for (int i = 0; i <= 10; ++i)
    for (int j = 32; j <= 1024*1024; j *= 8)
      b->Args({i, j});
}
BENCHMARK(BM_SetInsert)->Apply(CustomArguments);

Passing Arbitrary Arguments to a Benchmark

In C++11 it is possible to define a benchmark that takes an arbitrary number of extra arguments. The BENCHMARK_CAPTURE(func, test_case_name, ...args) macro creates a benchmark that invokes func with the benchmark::State as the first argument followed by the specified args.... The test_case_name is appended to the name of the benchmark and should describe the values passed.

template <class ...ExtraArgs>
void BM_takes_args(benchmark::State& state, ExtraArgs&&... extra_args) {
  [...]
}
// Registers a benchmark named "BM_takes_args/int_string_test" that passes
// the specified values to `extra_args`.
BENCHMARK_CAPTURE(BM_takes_args, int_string_test, 42, std::string("abc"));

Note that elements of ...args may refer to global variables. Users should avoid modifying global state inside of a benchmark.

Calculating Asymptotic Complexity (Big O)

Asymptotic complexity might be calculated for a family of benchmarks. The following code will calculate the coefficient for the high-order term in the running time and the normalized root-mean square error of string comparison.

static void BM_StringCompare(benchmark::State& state) {
  std::string s1(state.range(0), '-');
  std::string s2(state.range(0), '-');
  for (auto _ : state) {
    benchmark::DoNotOptimize(s1.compare(s2));
  }
  state.SetComplexityN(state.range(0));
}
BENCHMARK(BM_StringCompare)
    ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity(benchmark::oN);

As shown in the following invocation, asymptotic complexity might also be calculated automatically.

BENCHMARK(BM_StringCompare)
    ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity();

The following code will specify asymptotic complexity with a lambda function, that might be used to customize high-order term calculation.

BENCHMARK(BM_StringCompare)->RangeMultiplier(2)
    ->Range(1<<10, 1<<18)->Complexity([](benchmark::IterationCount n)->double{return n; });

Templated Benchmarks

This example produces and consumes messages of size sizeof(v) range_x times. It also outputs throughput in the absence of multiprogramming.

template <class Q> void BM_Sequential(benchmark::State& state) {
  Q q;
  typename Q::value_type v;
  for (auto _ : state) {
    for (int i = state.range(0); i--; )
      q.push(v);
    for (int e = state.range(0); e--; )
      q.Wait(&v);
  }
  // actually messages, not bytes:
  state.SetBytesProcessed(
      static_cast<int64_t>(state.iterations())*state.range(0));
}
BENCHMARK_TEMPLATE(BM_Sequential, WaitQueue<int>)->Range(1<<0, 1<<10);

Three macros are provided for adding benchmark templates.

#ifdef BENCHMARK_HAS_CXX11
#define BENCHMARK_TEMPLATE(func, ...) // Takes any number of parameters.
#else // C++ < C++11
#define BENCHMARK_TEMPLATE(func, arg1)
#endif
#define BENCHMARK_TEMPLATE1(func, arg1)
#define BENCHMARK_TEMPLATE2(func, arg1, arg2)

Fixtures

Fixture tests are created by first defining a type that derives from ::benchmark::Fixture and then creating/registering the tests using the following macros:

  • BENCHMARK_F(ClassName, Method)
  • BENCHMARK_DEFINE_F(ClassName, Method)
  • BENCHMARK_REGISTER_F(ClassName, Method)

For Example:

class MyFixture : public benchmark::Fixture {
public:
  void SetUp(const ::benchmark::State& state) {
  }

  void TearDown(const ::benchmark::State& state) {
  }
};

BENCHMARK_F(MyFixture, FooTest)(benchmark::State& st) {
   for (auto _ : st) {
     ...
  }
}

BENCHMARK_DEFINE_F(MyFixture, BarTest)(benchmark::State& st) {
   for (auto _ : st) {
     ...
  }
}
/* BarTest is NOT registered */
BENCHMARK_REGISTER_F(MyFixture, BarTest)->Threads(2);
/* BarTest is now registered */

Templated Fixtures

Also you can create templated fixture by using the following macros:

  • BENCHMARK_TEMPLATE_F(ClassName, Method, ...)
  • BENCHMARK_TEMPLATE_DEFINE_F(ClassName, Method, ...)

For example:

template<typename T>
class MyFixture : public benchmark::Fixture {};

BENCHMARK_TEMPLATE_F(MyFixture, IntTest, int)(benchmark::State& st) {
   for (auto _ : st) {
     ...
  }
}

BENCHMARK_TEMPLATE_DEFINE_F(MyFixture, DoubleTest, double)(benchmark::State& st) {
   for (auto _ : st) {
     ...
  }
}

BENCHMARK_REGISTER_F(MyFixture, DoubleTest)->Threads(2);

Custom Counters

You can add your own counters with user-defined names. The example below will add columns "Foo", "Bar" and "Baz" in its output:

static void UserCountersExample1(benchmark::State& state) {
  double numFoos = 0, numBars = 0, numBazs = 0;
  for (auto _ : state) {
    // ... count Foo,Bar,Baz events
  }
  state.counters["Foo"] = numFoos;
  state.counters["Bar"] = numBars;
  state.counters["Baz"] = numBazs;
}

The state.counters object is a std::map with std::string keys and Counter values. The latter is a double-like class, via an implicit conversion to double&. Thus you can use all of the standard arithmetic assignment operators (=,+=,-=,*=,/=) to change the value of each counter.

In multithreaded benchmarks, each counter is set on the calling thread only. When the benchmark finishes, the counters from each thread will be summed; the resulting sum is the value which will be shown for the benchmark.

The Counter constructor accepts three parameters: the value as a double ; a bit flag which allows you to show counters as rates, and/or as per-thread iteration, and/or as per-thread averages, and/or iteration invariants, and/or finally inverting the result; and a flag specifying the 'unit' - i.e. is 1k a 1000 (default, benchmark::Counter::OneK::kIs1000), or 1024 (benchmark::Counter::OneK::kIs1024)?

  // sets a simple counter
  state.counters["Foo"] = numFoos;

  // Set the counter as a rate. It will be presented divided
  // by the duration of the benchmark.
  // Meaning: per one second, how many 'foo's are processed?
  state.counters["FooRate"] = Counter(numFoos, benchmark::Counter::kIsRate);

  // Set the counter as a rate. It will be presented divided
  // by the duration of the benchmark, and the result inverted.
  // Meaning: how many seconds it takes to process one 'foo'?
  state.counters["FooInvRate"] = Counter(numFoos, benchmark::Counter::kIsRate | benchmark::Counter::kInvert);

  // Set the counter as a thread-average quantity. It will
  // be presented divided by the number of threads.
  state.counters["FooAvg"] = Counter(numFoos, benchmark::Counter::kAvgThreads);

  // There's also a combined flag:
  state.counters["FooAvgRate"] = Counter(numFoos,benchmark::Counter::kAvgThreadsRate);

  // This says that we process with the rate of state.range(0) bytes every iteration:
  state.counters["BytesProcessed"] = Counter(state.range(0), benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024);

When you're compiling in C++11 mode or later you can use insert() with std::initializer_list:

  // With C++11, this can be done:
  state.counters.insert({{"Foo", numFoos}, {"Bar", numBars}, {"Baz", numBazs}});
  // ... instead of:
  state.counters["Foo"] = numFoos;
  state.counters["Bar"] = numBars;
  state.counters["Baz"] = numBazs;

Counter Reporting

When using the console reporter, by default, user counters are printed at the end after the table, the same way as bytes_processed and items_processed. This is best for cases in which there are few counters, or where there are only a couple of lines per benchmark. Here's an example of the default output:

------------------------------------------------------------------------------
Benchmark                        Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------
BM_UserCounter/threads:8      2248 ns      10277 ns      68808 Bar=16 Bat=40 Baz=24 Foo=8
BM_UserCounter/threads:1      9797 ns       9788 ns      71523 Bar=2 Bat=5 Baz=3 Foo=1024m
BM_UserCounter/threads:2      4924 ns       9842 ns      71036 Bar=4 Bat=10 Baz=6 Foo=2
BM_UserCounter/threads:4      2589 ns      10284 ns      68012 Bar=8 Bat=20 Baz=12 Foo=4
BM_UserCounter/threads:8      2212 ns      10287 ns      68040 Bar=16 Bat=40 Baz=24 Foo=8
BM_UserCounter/threads:16     1782 ns      10278 ns      68144 Bar=32 Bat=80 Baz=48 Foo=16
BM_UserCounter/threads:32     1291 ns      10296 ns      68256 Bar=64 Bat=160 Baz=96 Foo=32
BM_UserCounter/threads:4      2615 ns      10307 ns      68040 Bar=8 Bat=20 Baz=12 Foo=4
BM_Factorial                    26 ns         26 ns   26608979 40320
BM_Factorial/real_time          26 ns         26 ns   26587936 40320
BM_CalculatePiRange/1           16 ns         16 ns   45704255 0
BM_CalculatePiRange/8           73 ns         73 ns    9520927 3.28374
BM_CalculatePiRange/64         609 ns        609 ns    1140647 3.15746
BM_CalculatePiRange/512       4900 ns       4901 ns     142696 3.14355

If this doesn't suit you, you can print each counter as a table column by passing the flag --benchmark_counters_tabular=true to the benchmark application. This is best for cases in which there are a lot of counters, or a lot of lines per individual benchmark. Note that this will trigger a reprinting of the table header any time the counter set changes between individual benchmarks. Here's an example of corresponding output when --benchmark_counters_tabular=true is passed:

---------------------------------------------------------------------------------------
Benchmark                        Time           CPU Iterations    Bar   Bat   Baz   Foo
---------------------------------------------------------------------------------------
BM_UserCounter/threads:8      2198 ns       9953 ns      70688     16    40    24     8
BM_UserCounter/threads:1      9504 ns       9504 ns      73787      2     5     3     1
BM_UserCounter/threads:2      4775 ns       9550 ns      72606      4    10     6     2
BM_UserCounter/threads:4      2508 ns       9951 ns      70332      8    20    12     4
BM_UserCounter/threads:8      2055 ns       9933 ns      70344     16    40    24     8
BM_UserCounter/threads:16     1610 ns       9946 ns      70720     32    80    48    16
BM_UserCounter/threads:32     1192 ns       9948 ns      70496     64   160    96    32
BM_UserCounter/threads:4      2506 ns       9949 ns      70332      8    20    12     4
--------------------------------------------------------------
Benchmark                        Time           CPU Iterations
--------------------------------------------------------------
BM_Factorial                    26 ns         26 ns   26392245 40320
BM_Factorial/real_time          26 ns         26 ns   26494107 40320
BM_CalculatePiRange/1           15 ns         15 ns   45571597 0
BM_CalculatePiRange/8           74 ns         74 ns    9450212 3.28374
BM_CalculatePiRange/64         595 ns        595 ns    1173901 3.15746
BM_CalculatePiRange/512       4752 ns       4752 ns     147380 3.14355
BM_CalculatePiRange/4k       37970 ns      37972 ns      18453 3.14184
BM_CalculatePiRange/32k     303733 ns     303744 ns       2305 3.14162
BM_CalculatePiRange/256k   2434095 ns    2434186 ns        288 3.1416
BM_CalculatePiRange/1024k  9721140 ns    9721413 ns         71 3.14159
BM_CalculatePi/threads:8      2255 ns       9943 ns      70936

Note above the additional header printed when the benchmark changes from BM_UserCounter to BM_Factorial. This is because BM_Factorial does not have the same counter set as BM_UserCounter.

Multithreaded Benchmarks

In a multithreaded test (benchmark invoked by multiple threads simultaneously), it is guaranteed that none of the threads will start until all have reached the start of the benchmark loop, and all will have finished before any thread exits the benchmark loop. (This behavior is also provided by the KeepRunning() API) As such, any global setup or teardown can be wrapped in a check against the thread index:

static void BM_MultiThreaded(benchmark::State& state) {
  if (state.thread_index == 0) {
    // Setup code here.
  }
  for (auto _ : state) {
    // Run the test as normal.
  }
  if (state.thread_index == 0) {
    // Teardown code here.
  }
}
BENCHMARK(BM_MultiThreaded)->Threads(2);

If the benchmarked code itself uses threads and you want to compare it to single-threaded code, you may want to use real-time ("wallclock") measurements for latency comparisons:

BENCHMARK(BM_test)->Range(8, 8<<10)->UseRealTime();

Without UseRealTime, CPU time is used by default.

CPU Timers

By default, the CPU timer only measures the time spent by the main thread. If the benchmark itself uses threads internally, this measurement may not be what you are looking for. Instead, there is a way to measure the total CPU usage of the process, by all the threads.

void callee(int i);

static void MyMain(int size) {
#pragma omp parallel for
  for(int i = 0; i < size; i++)
    callee(i);
}

static void BM_OpenMP(benchmark::State& state) {
  for (auto _ : state)
    MyMain(state.range(0));
}

// Measure the time spent by the main thread, use it to decide for how long to
// run the benchmark loop. Depending on the internal implementation detail may
// measure to anywhere from near-zero (the overhead spent before/after work
// handoff to worker thread[s]) to the whole single-thread time.
BENCHMARK(BM_OpenMP)->Range(8, 8<<10);

// Measure the user-visible time, the wall clock (literally, the time that
// has passed on the clock on the wall), use it to decide for how long to
// run the benchmark loop. This will always be meaningful, an will match the
// time spent by the main thread in single-threaded case, in general decreasing
// with the number of internal threads doing the work.
BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->UseRealTime();

// Measure the total CPU consumption, use it to decide for how long to
// run the benchmark loop. This will always measure to no less than the
// time spent by the main thread in single-threaded case.
BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime();

// A mixture of the last two. Measure the total CPU consumption, but use the
// wall clock to decide for how long to run the benchmark loop.
BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime()->UseRealTime();

Controlling Timers

Normally, the entire duration of the work loop (for (auto _ : state) {}) is measured. But sometimes, it is necessary to do some work inside of that loop, every iteration, but without counting that time to the benchmark time. That is possible, although it is not recommended, since it has high overhead.

static void BM_SetInsert_With_Timer_Control(benchmark::State& state) {
  std::set<int> data;
  for (auto _ : state) {
    state.PauseTiming(); // Stop timers. They will not count until they are resumed.
    data = ConstructRandomSet(state.range(0)); // Do something that should not be measured
    state.ResumeTiming(); // And resume timers. They are now counting again.
    // The rest will be measured.
    for (int j = 0; j < state.range(1); ++j)
      data.insert(RandomNumber());
  }
}
BENCHMARK(BM_SetInsert_With_Timer_Control)->Ranges({{1<<10, 8<<10}, {128, 512}});

Manual Timing

For benchmarking something for which neither CPU time nor real-time are correct or accurate enough, completely manual timing is supported using the UseManualTime function.

When UseManualTime is used, the benchmarked code must call SetIterationTime once per iteration of the benchmark loop to report the manually measured time.

An example use case for this is benchmarking GPU execution (e.g. OpenCL or CUDA kernels, OpenGL or Vulkan or Direct3D draw calls), which cannot be accurately measured using CPU time or real-time. Instead, they can be measured accurately using a dedicated API, and these measurement results can be reported back with SetIterationTime.

static void BM_ManualTiming(benchmark::State& state) {
  int microseconds = state.range(0);
  std::chrono::duration<double, std::micro> sleep_duration {
    static_cast<double>(microseconds)
  };

  for (auto _ : state) {
    auto start = std::chrono::high_resolution_clock::now();
    // Simulate some useful workload with a sleep
    std::this_thread::sleep_for(sleep_duration);
    auto end = std::chrono::high_resolution_clock::now();

    auto elapsed_seconds =
      std::chrono::duration_cast<std::chrono::duration<double>>(
        end - start);

    state.SetIterationTime(elapsed_seconds.count());
  }
}
BENCHMARK(BM_ManualTiming)->Range(1, 1<<17)->UseManualTime();

Setting the Time Unit

If a benchmark runs a few milliseconds it may be hard to visually compare the measured times, since the output data is given in nanoseconds per default. In order to manually set the time unit, you can specify it manually:

BENCHMARK(BM_test)->Unit(benchmark::kMillisecond);

Preventing Optimization

To prevent a value or expression from being optimized away by the compiler the benchmark::DoNotOptimize(...) and benchmark::ClobberMemory() functions can be used.

static void BM_test(benchmark::State& state) {
  for (auto _ : state) {
      int x = 0;
      for (int i=0; i < 64; ++i) {
        benchmark::DoNotOptimize(x += i);
      }
  }
}

DoNotOptimize(<expr>) forces the result of <expr> to be stored in either memory or a register. For GNU based compilers it acts as read/write barrier for global memory. More specifically it forces the compiler to flush pending writes to memory and reload any other values as necessary.

Note that DoNotOptimize(<expr>) does not prevent optimizations on <expr> in any way. <expr> may even be removed entirely when the result is already known. For example:

  /* Example 1: `<expr>` is removed entirely. */
  int foo(int x) { return x + 42; }
  while (...) DoNotOptimize(foo(0)); // Optimized to DoNotOptimize(42);

  /*  Example 2: Result of '<expr>' is only reused */
  int bar(int) __attribute__((const));
  while (...) DoNotOptimize(bar(0)); // Optimized to:
  // int __result__ = bar(0);
  // while (...) DoNotOptimize(__result__);

The second tool for preventing optimizations is ClobberMemory(). In essence ClobberMemory() forces the compiler to perform all pending writes to global memory. Memory managed by block scope objects must be "escaped" using DoNotOptimize(...) before it can be clobbered. In the below example ClobberMemory() prevents the call to v.push_back(42) from being optimized away.

static void BM_vector_push_back(benchmark::State& state) {
  for (auto _ : state) {
    std::vector<int> v;
    v.reserve(1);
    benchmark::DoNotOptimize(v.data()); // Allow v.data() to be clobbered.
    v.push_back(42);
    benchmark::ClobberMemory(); // Force 42 to be written to memory.
  }
}

Note that ClobberMemory() is only available for GNU or MSVC based compilers.

Statistics: Reporting the Mean, Median and Standard Deviation of Repeated Benchmarks

By default each benchmark is run once and that single result is reported. However benchmarks are often noisy and a single result may not be representative of the overall behavior. For this reason it's possible to repeatedly rerun the benchmark.

The number of runs of each benchmark is specified globally by the --benchmark_repetitions flag or on a per benchmark basis by calling Repetitions on the registered benchmark object. When a benchmark is run more than once the mean, median and standard deviation of the runs will be reported.

Additionally the --benchmark_report_aggregates_only={true|false}, --benchmark_display_aggregates_only={true|false} flags or ReportAggregatesOnly(bool), DisplayAggregatesOnly(bool) functions can be used to change how repeated tests are reported. By default the result of each repeated run is reported. When report aggregates only option is true, only the aggregates (i.e. mean, median and standard deviation, maybe complexity measurements if they were requested) of the runs is reported, to both the reporters - standard output (console), and the file. However when only the display aggregates only option is true, only the aggregates are displayed in the standard output, while the file output still contains everything. Calling ReportAggregatesOnly(bool) / DisplayAggregatesOnly(bool) on a registered benchmark object overrides the value of the appropriate flag for that benchmark.

Custom Statistics

While having mean, median and standard deviation is nice, this may not be enough for everyone. For example you may want to know what the largest observation is, e.g. because you have some real-time constraints. This is easy. The following code will specify a custom statistic to be calculated, defined by a lambda function.

void BM_spin_empty(benchmark::State& state) {
  for (auto _ : state) {
    for (int x = 0; x < state.range(0); ++x) {
      benchmark::DoNotOptimize(x);
    }
  }
}

BENCHMARK(BM_spin_empty)
  ->ComputeStatistics("max", [](const std::vector<double>& v) -> double {
    return *(std::max_element(std::begin(v), std::end(v)));
  })
  ->Arg(512);

Using RegisterBenchmark(name, fn, args...)

The RegisterBenchmark(name, func, args...) function provides an alternative way to create and register benchmarks. RegisterBenchmark(name, func, args...) creates, registers, and returns a pointer to a new benchmark with the specified name that invokes func(st, args...) where st is a benchmark::State object.

Unlike the BENCHMARK registration macros, which can only be used at the global scope, the RegisterBenchmark can be called anywhere. This allows for benchmark tests to be registered programmatically.

Additionally RegisterBenchmark allows any callable object to be registered as a benchmark. Including capturing lambdas and function objects.

For Example:

auto BM_test = [](benchmark::State& st, auto Inputs) { /* ... */ };

int main(int argc, char** argv) {
  for (auto& test_input : { /* ... */ })
      benchmark::RegisterBenchmark(test_input.name(), BM_test, test_input);
  benchmark::Initialize(&argc, argv);
  benchmark::RunSpecifiedBenchmarks();
}

Exiting with an Error

When errors caused by external influences, such as file I/O and network communication, occur within a benchmark the State::SkipWithError(const char* msg) function can be used to skip that run of benchmark and report the error. Note that only future iterations of the KeepRunning() are skipped. For the ranged-for version of the benchmark loop Users must explicitly exit the loop, otherwise all iterations will be performed. Users may explicitly return to exit the benchmark immediately.

The SkipWithError(...) function may be used at any point within the benchmark, including before and after the benchmark loop. Moreover, if SkipWithError(...) has been used, it is not required to reach the benchmark loop and one may return from the benchmark function early.

For example:

static void BM_test(benchmark::State& state) {
  auto resource = GetResource();
  if (!resource.good()) {
    state.SkipWithError("Resource is not good!");
    // KeepRunning() loop will not be entered.
  }
  while (state.KeepRunning()) {
    auto data = resource.read_data();
    if (!resource.good()) {
      state.SkipWithError("Failed to read data!");
      break; // Needed to skip the rest of the iteration.
    }
    do_stuff(data);
  }
}

static void BM_test_ranged_fo(benchmark::State & state) {
  auto resource = GetResource();
  if (!resource.good()) {
    state.SkipWithError("Resource is not good!");
    return; // Early return is allowed when SkipWithError() has been used.
  }
  for (auto _ : state) {
    auto data = resource.read_data();
    if (!resource.good()) {
      state.SkipWithError("Failed to read data!");
      break; // REQUIRED to prevent all further iterations.
    }
    do_stuff(data);
  }
}

A Faster KeepRunning Loop

In C++11 mode, a ranged-based for loop should be used in preference to the KeepRunning loop for running the benchmarks. For example:

static void BM_Fast(benchmark::State &state) {
  for (auto _ : state) {
    FastOperation();
  }
}
BENCHMARK(BM_Fast);

The reason the ranged-for loop is faster than using KeepRunning, is because KeepRunning requires a memory load and store of the iteration count ever iteration, whereas the ranged-for variant is able to keep the iteration count in a register.

For example, an empty inner loop of using the ranged-based for method looks like:

# Loop Init
  mov rbx, qword ptr [r14 + 104]
  call benchmark::State::StartKeepRunning()
  test rbx, rbx
  je .LoopEnd
.LoopHeader: # =>This Inner Loop Header: Depth=1
  add rbx, -1
  jne .LoopHeader
.LoopEnd:

Compared to an empty KeepRunning loop, which looks like:

.LoopHeader: # in Loop: Header=BB0_3 Depth=1
  cmp byte ptr [rbx], 1
  jne .LoopInit
.LoopBody: # =>This Inner Loop Header: Depth=1
  mov rax, qword ptr [rbx + 8]
  lea rcx, [rax + 1]
  mov qword ptr [rbx + 8], rcx
  cmp rax, qword ptr [rbx + 104]
  jb .LoopHeader
  jmp .LoopEnd
.LoopInit:
  mov rdi, rbx
  call benchmark::State::StartKeepRunning()
  jmp .LoopBody
.LoopEnd:

Unless C++03 compatibility is required, the ranged-for variant of writing the benchmark loop should be preferred.

Disabling CPU Frequency Scaling

If you see this error:

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

you might want to disable the CPU frequency scaling while running the benchmark:

sudo cpupower frequency-set --governor performance
./mybench
sudo cpupower frequency-set --governor powersave
Owner
Google
Google โค๏ธ Open Source
Google
Comments
  • Add user-defined counters.

    Add user-defined counters.

    This PR adds user-defined counters, as discussed in issue #240 . I've added usage notes to the README.md file.

    This was implemented mainly by adding two classes: BenchmarkCounters which is responsible for managing the collection of counters and Counter which contains logic for each individual counter. BenchmarkCounters objects can be summed up to aggregate multithread benchmark results. Then I added a BenchmarkCounters member to State and Run, which are accessible as needed.

    Though the changes in this PR generally adhere to the gist of the discussion in issue #240 , I've opted to leave the bytes_processed and items_processed as they previously were. In an early commit I did move these quantities to the BenchmarkCounters class, but otherwise noted that these changes were maybe a bit too far-reaching for what I wanted to do here. As an illustration of the resulting approach, I left some of this code commented out (eg, see src/benchmark.cc:470).

    I added two methods to the Fixture class: InitState() and TerminateState(). These were needed to change the state before and after the benchmark. They receive a non-const reference to benchmark::State so that they start and stop hardware counters as needed. I did this instead of changing the State constness in Setup() and TearDown() to retain compatibility. See src/benchmark.cc:758.

    The JSON reporter was the easiest to adapt. The Console and CSV reporters had the part printing the header moved out of Reporter::PrintContext(). In the console reporter, I store the previously-printed header, and check on each Run whether the header is different; if it is then a new header is printed. The printing of header inside the table was not immediately clear, so I added another separator line filled with '-' before the names; this makes the in-table headers more visible by separating them more clearly from the results above. In the CSV reporter, I gather first all the fields in a set, and then print counters (or nothing) based on this set. This set is local so that part is done outside of the PrintRunData() method which I promptly recognize as awkward. See src/csv_reporter.cc:80 below.

    Note that I needed to add #includes of <utility>,<vector>,<string> and <initializer_list> (this last only when compiling in C++11) to the benchmark_api.h file. If you do not want this, let me know, or just feel free to suggest alternative approaches that would obviate these includes.

    All the existing unit tests are passed successfully. I added also another test showing use with the PAPI library, but will do a separate PR for that, after this one is closed. Also, I did not add new specific tests for the user counters. If this PR is approved, then we'll discuss what to test for.

    Please let me know if something in the code of this PR should be changed. For example, maybe BenchmarkCounters could be called UserCounters. Also, I tried to be faithful to the style of this project, but if for any reason something was overlooked feel free to point it out; I'll gladly correct it.

  • CI: Add Conan testing, mass building and publishing

    CI: Add Conan testing, mass building and publishing

    This

    • is adding a recipe for the Conan package manager
    • also adds a small test to the CI if Conan package creation is working
    • updated: also adds a upload script to the CI for the Conan recipe to a Conan repository

    ~~Please note that for future releases of benchmark the version string needs to be updated in conanfile.py~~ (not anymore, see discussion)


    ~~Future work could include~~ Remaining to do list:

    • [x] setting up a free repository on Bintray: https://bintray.com/dominichamon/benchmark
    • [ ] including the package in the official Conan repository conan-center
    • [x] and extending the CI to upload new versions of the recipe and built packages to the Bintray repository automatically

    To use the recipe after this pull request is merged it requires to clone this git repository first; when the recipe is included in conan-center every Conan client would be able to use it via one single install command.


    Please let me know if you have any further questions regarding supporting Conan ๐Ÿ˜„

    Fixes #635

    //cc @raulbocanegra @p-groarke @danimtb @Mikayex @mpusz @iblis-ms

  • Benchmark ID for Better Tooling

    Benchmark ID for Better Tooling

    This Pull Request implements a Benchmark ID that is output for the various reporting statures.

    What it does

    This pull request outputs an additional column (console and csv) and an additional field (json) called "ID" ("id" for csv and json). It looks like this:

    Multi-Pretty ID Printing Pretty ID Printing

    Rationale

    Suppose one registers 2 benchmarks functions, as below (implementation purely for exposition):

    #include <benchmark/benchmark.h>
    
    static void BM_StringCreation(benchmark::State& state) {
      for (auto _ : state)
        std::string empty_string;
    }
    // Register the function as a benchmark
    BENCHMARK(BM_StringCreation);
    
    // Define another benchmark
    static void BM_StringCreation_Append(benchmark::State& state) {
      for (auto _ : state) {
        std::string empty_string;
        empty_string += "hello";
      }
    }
    BENCHMARK(BM_StringCreation_Append);
    
    BENCHMARK_MAIN();
    

    Now consider that someone runs this code with --benchmark_repetitions=2. We get the report for BM_StringCreation and BM_StringCreation_Append, giving us stddev, mean, and median for the benchmarks. The problem then becomes...

    How do you associate a the name of the benchmark statistic with the benchmark it belongs to?

    Normally, this wouldn't be a problem... but in the presence of custom statistics allowed by ComputeStatistics, we can run into problems. For example, if a statistic called my_stat is added, we run into issues where a tool cannot reliably determine which stats belong to which benchmark:

    --------------------------------------------...
    Name
    --------------------------------------------...
    BM_StringCreation
    BM_StringCreation
    BM_StringCreation_stddev
    BM_StringCreation_median
    BM_StringCreation_mean
    BM_StringCreation_Append
    BM_StringCreation_Append
    BM_StringCreation_Append_stddev
    BM_StringCreation_Append_median
    BM_StringCreation_Append_mean
    BM_StringCreation_Append_my_stat
    

    Is the statistic BM_StringCreation_Append_my_stat called Append_my_stat and belong to BM_StringCreation? Or is the statistic's name my_stat and belongs to BM_StringCreation_Append?

    A unique identifier solves this ambiguity, paving the way for tools to appropriately group statistic measurements with the right benchmark with a very small, simple change. It also allows tools to use order-independent parsers for json and csv, which is important in tooling languages where the default dictionary types and parsers do not guarantee read-order to be the order of submission.

    Code Submitted

    The code submitted here is adds a few extra utility functions and a few new members. Convention was followed as much as possible, and int was preferred except where convention (such as inside of ConsoleReporter, where the saved width types are of size_t) demonstrated a use for otherwise.

  • [BUG] requirement of C++11 is incompatible with Google Test

    [BUG] requirement of C++11 is incompatible with Google Test

    Google Benchmark insists on being compiled with the standard not exceeding C++11 https://github.com/google/benchmark/blob/974cd5a5c5a78e76ebc50961f4dbf3bf6d4ade4e/CMakeLists.txt#L174

    At the same time, Google Benchmark requires Google Test, which is incompatible with C++11 https://github.com/google/googletest/blob/5b909beeec178f338be997830b6c31a80cda7a93/googletest/include/gtest/internal/gtest-internal.h#L635

      typedef ::std::map<std::string, CodeLocation, std::less<>> RegisteredTestsMap;
    

    std::less without template parameters doesn't compile in C++11.

  • C++11 Regular Expressions

    C++11 Regular Expressions

    I would like to get benchmark working on Windows. I see that #29 has done a lot of work in that respect but resulted in #30 about the replacement of the Regex class with std::regex.

    This PR implements a C++11 backend to the Regex class that passes the unit tests. This is more to start a discussion of how to proceed with this branch.

    From what I can see Regex is only used in one place in benchmark.cc so the Regex class could be dropped all together. However this is opposed by @dominichamon

    i actually had some issues when i tried using std::regex for this project as the matching wasn't quite the same. I don't remember the details, i'm afraid, but i'd want much more testing before making this change.

    So I need to make the unit tests much more thorough than what they are now. I'm not sure what strings I should be testing to make sure that the C++11 backend is actually doing the same thing (it uses the same extended matching flags).

    There are other issues here, whilst __cplusplus >= 201103L is the correct check to work out if we are in C++11 mode gcc 4.7 and 4.8 don't actually have <regex> it only came in 4.9. So switching to just using std::regex would result in dropping support for anything before 4.9. So a much better check would be for CMake to compile a snippet to work out if we have regular expressions and fallback to POSIX regular expressions as neccessary.

    Soooooo, where shall we go from here? I'm happy to put in as much effort as needed until benchmark builds from CMake for MinGW. VS should drop out nicely from this as CMake is awesome...but MSVC is crap so who knows what will happen there. Maybe clang to the rescue :wink:

  • Openmp compatibility

    Openmp compatibility

    This patch makes Google Benchmark compatible with OpenMP and other user-level thread management. Until now, google benchmark would only report the CPU usage of the master thread if the code being benchmarked used OpenMP or otherwise spawned multiple threads internally.

    This version reports the total process CPU usage if the number of google-benchmark threads is set to <= 1 , but reverts to the existing behaviour otherwise.

    It may actually be preferable to report the total process CPU usage in all cases, but this is sufficient for my needs.

    We have been using google benchmark in our parallel programming class, however, every term students are confused when the CPU time roughly reflects the wall-clock time for parallelized codes doing the same amount of work. This version is also advantageous because it can better demonstrate the overhead of threading, that some tasks take more total CPU time when multi-threaded, and, sometimes, tasks may actually take less overall CPU time.

    If my feature patch cannot be merged, I would like to request that the maintainers implement this. It is very important to us.

  • Make `PauseTiming()` and `ResumeTiming()` per thread.

    Make `PauseTiming()` and `ResumeTiming()` per thread.

    Currently we time benchmarks using a single global timer that tracks per-process CPU usage. Pausing and resuming this timer have to act as a barrier to all threads. This has crippling effects on multi-threaded benchmarks. If you pause every iterator you synchronize the entire benchmark. It's effectively no longer multi-threaded.

    This patch changes to a per-thread timer. Instead of measuring process CPU time we sum thread CPU time and we pause on a per-thread basis.

    Below are comparison of the new and old results from basic_test.cc. Note that the BM_spin_pause_during test get 95% faster.

    Benchmark                                                    Time           CPU
    -------------------------------------------------------------------------------
    BM_empty_mean                                               +0.00         +0.00
    BM_empty/threads:4_mean                                     +0.00         +0.00
    BM_spin_empty/8_mean                                        +0.00         +0.00
    BM_spin_empty/512_mean                                      +0.01         +0.00
    BM_spin_empty/8k_mean                                       -0.00         -0.00
    BM_spin_empty/8/threads:4_mean                              +0.00         -0.17
    BM_spin_empty/512/threads:4_mean                            +0.01         +0.00
    BM_spin_empty/8k/threads:4_mean                             -0.01         +0.00
    BM_spin_pause_before/8_mean                                 +0.00         +0.00
    BM_spin_pause_before/512_mean                               +0.03         +0.02
    BM_spin_pause_before/8k_mean                                +0.01         +0.01
    BM_spin_pause_before/8/threads:4_mean                       +0.00         +0.00
    BM_spin_pause_before/512/threads:4_mean                     +0.04         +0.01
    BM_spin_pause_before/8k/threads:4_mean                      -0.03         -0.00
    BM_spin_pause_during/8_mean                                 -0.24         -0.25
    BM_spin_pause_during/512_mean                               -0.24         -0.24
    BM_spin_pause_during/8k_mean                                -0.13         -0.13
    BM_spin_pause_during/8/threads:4_mean                       -0.97         -0.90
    BM_spin_pause_during/512/threads:4_mean                     -0.96         -0.89
    BM_spin_pause_during/8k/threads:4_mean                      -0.95         -0.85
    BM_pause_during_mean                                        -0.23         -0.20
    BM_pause_during/threads:4_mean                              -0.97         -0.90
    BM_pause_during/real_time_mean                              -0.24         -0.26
    BM_pause_during/real_time/threads:4_mean                    -0.97         -0.90
    BM_spin_pause_after/8_mean                                  +0.00         +0.00
    BM_spin_pause_after/512_mean                                +0.00         +0.00
    BM_spin_pause_after/8k_mean                                 -0.00         -0.00
    BM_spin_pause_after/8/threads:4_mean                        +0.00         +0.00
    BM_spin_pause_after/512/threads:4_mean                      -0.01         -0.02
    BM_spin_pause_after/8k/threads:4_mean                       +0.01         +0.01
    BM_spin_pause_before_and_after/8_mean                       +0.00         +0.00
    BM_spin_pause_before_and_after/512_mean                     +0.00         +0.00
    BM_spin_pause_before_and_after/8k_mean                      +0.01         +0.01
    BM_spin_pause_before_and_after/8/threads:4_mean             +0.00         +0.00
    BM_spin_pause_before_and_after/512/threads:4_mean           +0.06         +0.04
    BM_spin_pause_before_and_after/8k/threads:4_mean            -0.00         +0.02
    BM_empty_stop_start_mean                                    +0.00         +0.00
    BM_empty_stop_start/threads:4_mean                          +0.00         +0.00
    

    There's still work to do on this, but I was hoping for initial feedback on the direction.

  • [BUG] benchmark's CXX feature check fails to detect POSIX_REGEX | GNU_POSIX_REGEX

    [BUG] benchmark's CXX feature check fails to detect POSIX_REGEX | GNU_POSIX_REGEX

    Describe the bug trying to compile llvm 11 from source with cmake 3.19 and a llvm 11 toolchain on a X86 debian userland based node llvm source code pulls in llvm Googleโ€™s benchmarking but trying to configure the compilation ends with

    LLVM FileCheck Found: /usr/lib/llvm-11/bin/FileCheck
     -- Version: 0.0.0
     -- Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES -- failed to compile
     -- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
     -- Performing Test HAVE_POSIX_REGEX -- failed to compile
     CMake Warning at utils/benchmark/CMakeLists.txt:244 (message):
       Using std::regex with exceptions disabled is not fully supported
    

    System Which OS, compiler, and compiler version are you using:

    • OS: Linux 5.10 rc5 X64
    • package distro: Debian
    • Compiler and version: llvm11 toolchain
    • Cmake version: 3.19.0
    • libpcre2-dev & libpcre2-posix2 version: 10.34-7

    To reproduce Steps to reproduce the behaviour:

    1. install the llvm 11 toolchain from distro (and all other relevant dependencies for compilation, incl. libpcre2)
    2. git clone the llvm source code
    3. cmake for the llvm source code
    4. See failure

    Expected behaviour CXXFeatureCheck.cmake to properly detect POSIX_REGEX as provided through libpcre2

    Additional context https://pcre.org provides its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API.

    The CXX feature check is not sufficiently verbose to understand what is failing in the context of checking for GNU_POSIX_REGEX | POSIX_REGEX

  • Add PyPI upload job to wheel building workflow

    Add PyPI upload job to wheel building workflow

    This commit adds a job running after the wheel building job responsible for uploading the built wheels to PyPI. The job only runs on successful completion of all build jobs, and uploads to PyPI using a secret added to the Google Benchmark repo (TBD). Also, the setup-python action has been bumped to the latest version v3.

    TODO:

    • Verify that the credential setup is correct (by trying a wheel building CI job)
    • Make sure the token is added to this repo's Github secrets under the name pypi_password (or I can change the name to whatever it is now)
  • Implementation of random interleaving.

    Implementation of random interleaving.

    See http://github.com/google/benchmark/issues/1051 for the feature requests.

    Committer: Hai Huang (http://github.com/haih-g)

    On branch fr-1051 Changes to be committed: modified: include/benchmark/benchmark.h modified: src/benchmark.cc new file: src/benchmark_adjust_repetitions.cc new file: src/benchmark_adjust_repetitions.h modified: src/benchmark_api_internal.cc modified: src/benchmark_api_internal.h modified: src/benchmark_register.cc modified: src/benchmark_runner.cc modified: src/benchmark_runner.h modified: test/CMakeLists.txt new file: test/benchmark_random_interleaving_gtest.cc

  • Add support for GTest based unit tests.

    Add support for GTest based unit tests.

    As Dominic and I have previously discussed, there is some need/desire to improve the testing situation in Google Benchmark.

    One step to fixing this problem is to make it easier to write unit tests by adding support for GTest, which is what this patch does.

    By default it looks for an installed version of GTest. However the user can specify -DBENCHMARK_BUILD_EXTERNAL_GTEST=ON to instead download, build, and use copy of gtest from source. This is quite useful when Benchmark is being built in non-standard configurations, such as against libc++ or in 32 bit mode.

  • [FR] Consider switching to `nanobind` Python bindings

    [FR] Consider switching to `nanobind` Python bindings

    Is your feature request related to a problem? Please describe. Not a problem, more an idea: For Python >=3.8 bindings using C++17, the nanobind project (from the same hand as pybind11) promises binary size reductions and runtime performance increases (~40% in their benchmarks) over pybind11 for frequently re-called functions in a loop.

    For a benchmark library, runtime performance benefits like these could be interesting - especially since the quoted runtime performance increase for "calling functions repeatedly in a loop" fits the bill exactly in this case.

    Describe the solution you'd like Implement the GBM Python bindings in nanobind instead, measure the performance / binding size differences, decide where to go from there after interpreting the results.

    Describe alternatives you've considered Staying with the current thing.

    Additional context I want to do this as a personal experiment, to learn how to implement C++ Python bindings. If something comes out of it other than the nanobind-specific results (better/any docstrings, type hints etc), they could be added to the existing bindings, too.

    Not sure how stable nanobind is right now, but since it has pybind11 as its direct predecessor project, it might already work sufficiently for this scope.

  • Fixing issue with ARM64EC and MSVC

    Fixing issue with ARM64EC and MSVC

    The goal of this pull request is to allow the detection of MSVC's ARM64EC as if it is ARM64.

    See: https://learn.microsoft.com/en-us/windows/arm/arm64ec

    Thanks in advance for your review!

  • [FR] clang static analyzer: `src/include/benchmark/benchmark.h:1069:3: Potential memory leak [cplusplus.NewDeleteLeaks]`

    [FR] clang static analyzer: `src/include/benchmark/benchmark.h:1069:3: Potential memory leak [cplusplus.NewDeleteLeaks]`

    https://github.com/darktable-org/rawspeed/actions/runs/3574871458/jobs/6011864553#step:23:314 https://github.com/google/benchmark/blob/2257fa4d6afb8e5a2ccd510a70f38fe7fcdf1edf/include/benchmark/benchmark.h#L1272-L1276 triggers clang static analyzer reports:

    [HIGH] /__w/rawspeed/rawspeed/rawspeed-build/src/external/googlebenchmark/src/include/benchmark/benchmark.h:1069:3: Potential memory leak [cplusplus.NewDeleteLeaks]
      return internal::RegisterBenchmarkInternal(
      ^
    
    Found 1 defect(s) in benchmark.h
    

    It would be really good to either completely rewrite that code to avoid the issue in the first place, or at least silence it as a known false-positive.

  • [FR] cc_libraries should use

    [FR] cc_libraries should use "includes", not "strip_include_prefix"

    So that when users import google-benchmark, the headers are included as "system headers", instead of regular headers.

    If this is not done, compilers will throw warnings at google-benchmark code, which users have no way to act upon. Users should be able to choose a highly restrictive set of warnings without having to fix them in google-benchmark.

    Note that this pattern is already applied in the googletest repo.

A microbenchmark support library

Benchmark A library to benchmark code snippets, similar to unit tests. Example: #include <benchmark/benchmark.h> static void BM_SomeFunction(benchmar

Dec 31, 2022
Video game library manager with support for wide range of 3rd party libraries and game emulation support, providing one unified interface for your games.
Video game library manager with support for wide range of 3rd party libraries and game emulation support, providing one unified interface for your games.

An open source video game library manager and launcher with support for 3rd party libraries like Steam, GOG, Origin, Battle.net and Uplay. Includes game emulation support, providing one unified interface for your games.

Jan 3, 2023
Ethereum No DevFee, Support for GMiner, Phoenix Miner, NBMiner, T-Rex, lolMiner, TeamRedMiner, ClaymoreMiner Support pool and wallet๏ผŒpool account, worker name.

WiseNoDevFee Ethereum No DevFee, Support for GMiner, Phoenix Miner, NBMiner, T-Rex, lolMiner, TeamRedMiner, ClaymoreMiner Support pool and wallet๏ผŒpool

Dec 19, 2022
Filter driver which support changing DPI of mouse that does not support hardware dpi changing.

Custom Mouse DPI Driver ํ•˜๋“œ์›จ์–ด DPI ๋ณ€๊ฒฝ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๋งˆ์šฐ์Šค๋“ค์˜ DPI ๋ณ€๊ฒฝ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•„ํ„ฐ ๋“œ๋ผ์ด๋ฒ„ ๊ฒฝ๊ณ : ํ•ด๋‹น ๋“œ๋ผ์ด๋ฒ„๋Š” ์™„์ „ํžˆ ํ…Œ์ŠคํŠธ ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค Install ํ•ด๋‹น ๋“œ๋ผ์ด๋ฒ„๋Š” ์„œ๋ช…์ด ๋˜์–ด์žˆ์ง€์•Š์Šต๋‹ˆ๋‹ค. ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด tests

Sep 23, 2022
๐ŸŽต Music notation engraving library for MEI with MusicXML and Humdrum support and various toolkits (JavaScript, Python)
๐ŸŽต Music notation engraving library for MEI with MusicXML and Humdrum support and various toolkits (JavaScript, Python)

Verovio is a fast, portable and lightweight library for engraving Music Encoding Initiative (MEI) digital scores into SVG images. Verovio also contain

Jan 1, 2023
libsvm websitelibsvm - A simple, easy-to-use, efficient library for Support Vector Machines. [BSD-3-Clause] website

Libsvm is a simple, easy-to-use, and efficient software for SVM classification and regression. It solves C-SVM classification, nu-SVM classification,

Jan 2, 2023
Ultra fast and low latency asynchronous socket server & client C++ library with support TCP, SSL, UDP, HTTP, HTTPS, WebSocket protocols and 10K connections problem solution
Ultra fast and low latency asynchronous socket server & client C++ library with support TCP, SSL, UDP, HTTP, HTTPS, WebSocket protocols and 10K connections problem solution

CppServer Ultra fast and low latency asynchronous socket server & client C++ library with support TCP, SSL, UDP, HTTP, HTTPS, WebSocket protocols and

Jan 3, 2023
A small and portable INI file library with read/write support

minIni minIni is a portable and configurable library for reading and writing ".INI" files. At just below 900 lines of commented source code, minIni tr

Dec 29, 2022
Support for multiple RPC protocols in a single library

AnyRPC A multiprotocol remote procedure call system for C++ Overview AnyRPC provides a common system to work with a number of different remote procedu

Nov 17, 2022
PortAudio is a portable audio I/O library designed for cross-platform support of audio

PortAudio is a cross-platform, open-source C language library for real-time audio input and output.

Jan 1, 2023
Guidelines Support Library

GSL: Guidelines Support Library The Guidelines Support Library (GSL) contains functions and types that are suggested for use by the C++ Core Guideline

Jan 7, 2023
gsl-lite โ€“ A single-file header-only version of ISO C++ Guidelines Support Library (GSL) for C++98, C++11, and later

gsl-lite: Guidelines Support Library for C++98, C++11 up metadata build packages try online gsl-lite is an implementation of the C++ Core Guidelines S

Jan 7, 2023
A modern compile-time reflection library for C++ with support for overloads, templates, attributes and proxies

refl-cpp v0.12.1 Documentation refl-cpp encodes type metadata in the type system to allow compile-time reflection via constexpr and template metaprogr

Dec 31, 2022
A library OS for Linux multi-process applications, with Intel SGX support

Graphene Library OS with Intel SGX Support A Linux-compatible Library OS for Multi-Process Applications NOTE: We are in the middle of transitioning ou

Jan 4, 2023
theora-player is an embeddable theora video player C++ library based on the libtheora sample. It has no audio support at this moment.

theora-player Description theora-player is an embeddable theora video player C++ library based on the libtheora sample. It has no audio support at thi

Jun 18, 2022
GNU project's implementation of the standard C library(with Xuantie RISC-V CPU support).

GNU project's implementation of the standard C library(with Xuantie RISC-V CPU support).

Mar 17, 2022
Nov 29, 2021
ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.
ROS2 packages based on NVIDIA libArgus library for hardware-accelerated CSI camera support.

Isaac ROS Argus Camera This repository provides monocular and stereo nodes that enable ROS developers to use cameras connected to Jetson platforms ove

Dec 10, 2022