Code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT)

SBWT

This is the code for the paper Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform (SBWT). The repository includes implementations of the various SBWT variants described in the paper. Note that contrary to many other k-mer membership data structures, our code is not aware of DNA reverse complements. That is, it considers a k-mer and its reverse complement as separate k-mers.

This construction algorithm is based on the lightning-fast k-mer counter KMC. Our code links directly to the KMC binaries. We have made slight changes to the KMC codebase to make this possible. Our modified version of KMC is included in this repository.

We are currently actively working on the code. Top items on the to-do list are the following:

  • Support for construction and queries from multiple input files.
  • Saving space by indexing only canonical k-mers.

Compiling

git submodule init
git submodule update

# Build the KMC components
cd KMC
make -j4
cd ..

# Build the SBWT code
cd build
cmake .. -DMAX_KMER_LENGTH=32
make -j4

Change the parameter -DMAX_KMER_LENGTH=32 to increase the maximum allowed k-mer length, up to 255.

Troubleshooting: If you run into problems involving the <filesystem> header, you probably need to update your compiler. The compiler g++-8 should be sufficient. Install a new compiler and direct CMake to use it with the -DCMAKE_CXX_COMPILER option. For example, to set the compiler to g++-8, run CMake with the option -DCMAKE_CXX_COMPILER=g++-8.

Note: the Elias-Fano variants make use of the _pext_u64 instruction in the BMI2 instruction set. Older CPUs might not support this instruction. In that case, we fall back to a simple software implementation, which will ruin the performance of the Elias-Fano variants (those whose variant name starts with "mef").

Index construction

Below is the command to build the SBWT for input data example_data/coli3.fna provided in this repository, with k = 30. The index is written to the file index.sbwt.

./build/bin/sbwt build -i example_data/coli3.fna -o index.sbwt -k 30

This builds the default variant, which is the plain matrix SBWT. Other variant can be specified with the --variant option. The list of all command line options and parameters is below:

Construct an SBWT variant.
Usage:
  build [OPTION...]

  -i, --in-file arg           The input sequences in FASTA or FASTQ format.
  -o, --out-file arg          Output file for the constructed index.
  -k, --kmer-length arg       The k-mer length.
      --variant arg           The SBWT variant to build. Available 
                              variants: plain-matrix rrr-matrix mef-matrix 
                              plain-split rrr-split mef-split plain-concat 
                              mef-concat plain-subsetwt rrr-subsetwt 
                              (default: plain-matrix)
      --no-streaming-support  Save space by not building the streaming 
                              query support bit vector. This leads to 
                              slower queries.
  -t, --n-threads arg         Number of parallel threads. (default: 1)
  -a, --min-abundance arg     Discard all k-mers occurring fewer than this 
                              many times. By default we keep all k-mers. 
                              Note that we consider a k-mer distinct from 
                              its reverse complement. (default: 1)
  -m, --ram-gigas arg         RAM budget in gigabytes (not strictly 
                              enforced). Must be at least 2. (default: 2)
      --temp-dir arg          Location for temporary files. (default: .)
  -h, --help                  Print usage

Running queries

To query for existence of all k-mers in an index for all sequences in a fasta-file, run the following command:

./build/bin/sbwt search -i index.sbwt -q example_data/queries.fna -o out.txt

This prints for each query of length n in the input a line containing n-k+1 space-separated integers, which are the ranks of the columns representing the k-mer in the index. If the k-mer is not found, -1 is printed. If the index was built with --streaming-support, the faster streaming query algorithm is automatically used. The full options are:

Query all k-mers of all input reads.
Usage:
  ./build/bin/sbwt search [OPTION...]

  -o, --out-file arg    Output filename.
  -i, --index-file arg  Index input file.
  -q, --query-file arg  The query in FASTA format.
  -h, --help            Print usage

For developers: building and running the tests

git submodule init
git submodule update

# Build googletest
cd googletest
mkdir build
cd build
cmake ..
make

# Build SBWT
cd ../../build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DMAX_KMER_LENGTH=32
make

This will build the executable ./build/bin/sbwt_tests. Make sure to run the test executable from the root of the repository, or otherwise it will not find the example data in ./example_data.

Owner
Algorithmic Bioinformatics Group @ University of Helsinki
Algorithmic Bioinformatics Group @ University of Helsinki
Similar Resources

oZKS (Ordered Zero-Knowledge Set) is a library that provides an implementation of an Ordered (and Append Only) Zero-Knowledge Set.

Ordered Zero-Knowledge Set - oZKS Introduction oZKS is a library that provides an implementation of an Ordered (and Append Only) Zero Knowledge Set. A

Dec 20, 2022

6D - Pose Annotation Tool (6D-PAT) - is a tool that allows the user to load a set of images and also a set of 3D models and annotate where in the 2D image the 3D object ist placed.

6D - Pose Annotation Tool (6D-PAT) - is a tool that allows the user to load a set of images and also a set of 3D models and annotate where in the 2D image the 3D object ist placed.

6D - Pose Annotation Tool (6D-PAT) For detiled explanations checkout the WikiPage. What is it? With 6D-PAT you can create 6D annotations on images for

Nov 20, 2022

Supplementary code for SIGGRAPH 2021 paper: Discovering Diverse Athletic Jumping Strategies

Supplementary code for SIGGRAPH 2021 paper: Discovering Diverse Athletic Jumping Strategies

SIGGRAPH 2021: Discovering Diverse Athletic Jumping Strategies project page paper demo video Prerequisites Important Notes We suspect there are bugs i

Dec 6, 2022

Repository Containing the Code associated with the Paper: "Learning High-Speed Flight in the Wild"

Repository Containing the Code associated with the Paper:

Learning High-Speed Flight in the Wild This repo contains the code associated with the paper Learning Agile Flight in the Wild. For more information,

Jan 3, 2023

Code accompanying our SIGGRAPH 2021 Technical Communications paper "Transition Motion Tensor: A Data-Driven Approach for Versatile and Controllable Agents in Physically Simulated Environments"

Code accompanying our SIGGRAPH 2021 Technical Communications paper

SIGGRAPH ASIA 2021 Technical Communications Transition Motion Tensor: A Data-Driven Framework for Versatile and Controllable Agents in Physically Simu

Apr 21, 2022

This repo contains source code of our paper presented in IROS2021 "Single-Shot is Enough: Panoramic Infrastructure Based Calibration of Multiple Cameras and 3D LiDARs"

This repo contains source code of our paper presented in IROS2021

Single-Shot is Enough: Panoramic Infrastructure Based Calibration of Multiple Cameras and 3D LiDARs Updates [2021/09/01] first commit, source code of

Dec 19, 2022

Sandbox binary and source code for the Siggraph 2017 paper "Water Wave Packets" by Stefan Jeschke (NVIDIA) and Chris Wojtan (IST Austria)

----------------------------- Manual for wave packet viewer ----------------------------- System requirements: Windows8/8.1/10 with DirectX runtime e

Nov 28, 2022

Example code for the research paper "Masked Software Occlusion Culling"; implements an efficient alternative to the hierarchical depth buffer algorithm.

MaskedOcclusionCulling This code accompanies the research paper "Masked Software Occlusion Culling", and implements an efficient alternative to the hi

Dec 22, 2022
Comments
  • Compilation issue with old _pext_u64 instruction

    Compilation issue with old _pext_u64 instruction

    The build may fail due to missing support for the instruction _pext_u64:

    
    In file included from /usr/lib/gcc/x86_64-linux-gnu/9/include/immintrin.h:105,
                     from /home/analanko/code/SBWT/include/MEF.hpp:29,
                     from /home/analanko/code/SBWT/include/variants.hh:15,
                     from /home/analanko/code/SBWT/src/sbwt_build.cpp:9:
    /usr/lib/gcc/x86_64-linux-gnu/9/include/bmi2intrin.h: In member function ‘uint64_t mod_ef_vector<t_rank_1>::shrink(long long unsigned int) [with t_rank_1 = sdsl::rank_support_v<1, 1>]’:
    /usr/lib/gcc/x86_64-linux-gnu/9/include/bmi2intrin.h:76:1: error: inlining failed in call to always_inline ‘long long unsigned int _pext_u64(long long unsigned int, long long unsigned int)’: target specific option mismatch
       76 | _pext_u64 (unsigned long long __X, unsigned long long __Y)
          | ^~~~~~~~~
    In file included from /home/analanko/code/SBWT/include/variants.hh:15,
                     from /home/analanko/code/SBWT/src/sbwt_build.cpp:9:
    /home/analanko/code/SBWT/include/MEF.hpp:314:36: note: called from here
      314 |         return((uint64_t) _pext_u64( ((x & 0xAAAAAAAAAAAAAAAAULL) >> 1 | x), 0x5555555555555555ULL));
    
  • Variants with broken deserialization

    Variants with broken deserialization

    Deserializing following variants from disk doesn't work:

    mef-matrix: Segmentation fault mef-split: terminate called after throwing an instance of 'std::bad_alloc'

RemixDB: A read- and write-optimized concurrent KV store. Fast point and range queries. Extremely low write-amplification.

REMIX and RemixDB The REMIX data structure was introduced in paper "REMIX: Efficient Range Query for LSM-trees", FAST'21. This repository maintains a

Dec 3, 2022
Dynamic array supporting Range Minimum Queries

dynamic-RMQ Dynamic array supporting Range Minimum Queries. Data structure that represent a dynamic array supporting Range Minimum Queries. The data s

Aug 24, 2022
RLibm for 32-bit representations (float and posit32)

RLIBM-32 RLIBM-32 is both a library that provides correctly rounded result for all inputs and a collection of tools used to generate the correct polyn

Nov 1, 2022
A run-time C++ library for working with units of measurement and conversions between them and with string representations of units and measurements

Units What's new Some of the CMake target names have changed in the latest release, please update builds appropriately Documentation A library that pr

Dec 14, 2022
A FASTA - Fourier Transform based Sequence in Sequence Finder

A tool that finds a nucleic sub-sequence string ( from a FASTA file ) in a FASTA file using the fourier transform.

Sep 12, 2022
Example program demonstrating the Meijster's distance transform algorithm.
Example program demonstrating the Meijster's distance transform algorithm.

Distance Transform The distance transform operation consist in finding the shortest distance of a black pixel to a white one. This project demonstrate

Nov 15, 2021
FFTW is a free collection of fast C routines for computing the Discrete Fourier Transform in one or more dimensions

FFTW is a free collection of fast C routines for computing the Discrete Fourier Transform in one or more dimensions

Aug 12, 2022
Elk is a tiny embeddable JavaScript engine that implements a small but usable subset of ES6
Elk is a tiny embeddable JavaScript engine that implements a small but usable subset of ES6

Elk is a tiny embeddable JavaScript engine that implements a small but usable subset of ES6. It is designed for microcontroller development. Instead of writing firmware code in C/C++, Elk allows to develop in JavaScript. Another use case is providing customers with a secure, protected scripting environment for product customisation.

Jan 8, 2023
A subset of WidgetsFlutterBinding specifically for initializing the ServicesBinding.

flutter_services_binding A subset of WidgetsFlutterBinding specifically for initializing the ServicesBinding. When executing runApp within a custom Zo

Nov 17, 2022
TinyGL: a Small, Free and Fast Subset of OpenGL
TinyGL: a Small, Free and Fast Subset of OpenGL

TinyGL A major overhaul of Fabrice Bellard's TinyGL to be more useful as a software rasterizer. Now with limited multithreading support Tightly tweake

Dec 30, 2022