RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat ([email protected]) and Jeff Dean ([email protected])

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/ and https://rocksdb.slack.com/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.

Owner
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
Facebook
Comments
  • Memory grows without limit

    Memory grows without limit

    Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://www.facebook.com/groups/rocksdb.dev

    Expected behavior

    Process consumes about 10 megabytes

    Actual behavior

    Memory grows without limit

    Steps to reproduce the behavior

    Run this code:

    https://pastebin.com/Ch8RhsSB

    Sorry RocksDB team, but it is huge problem.

    This is trivial test and I expect it will work as finite state machine Populate memory - flush - re-use memory.

    I see like memory grows.

  • undefined symbol: clock_gettime

    undefined symbol: clock_gettime

    Using Centos 6 GCC 4.9 using devtools-3 Java 7 Kernel 2.6.32-504.12.2.el6.x86_64

    I am getting the following error after running a test for a while. /usr/java/latest/bin/java: symbol lookup error: /tmp/librocksdbjni2974434001434564758..so: undefined symbol: clock_gettime

    The same test runs fine on RocksDb 3.6. My investigation shows that librt.so is not linked with RocksDb correctly. My test (with 3.10.2) worked correctly after I ran export LD_PRELOAD=/lib64/rtkaio/librt.so.1

    To investigate the 3.10.2 library, I ran nm $nm /tmp/librocksdbjni2974434001434564758..so | grep clock 0000000000332260 T _ZNSt6chrono3_V212steady_clock3nowEv 000000000037e33c R _ZNSt6chrono3_V212steady_clock9is_steadyE 0000000000332230 T _ZNSt6chrono3_V212system_clock3nowEv 000000000037e33d R _ZNSt6chrono3_V212system_clock9is_steadyE U clock_gettime

    I ran nm on the older rocksdb-3.6 nm /tmp/librocksdbjni1323312933457066341..so |grep clock 0000000000287390 T _ZNSt6chrono3_V212steady_clock3nowEv 00000000002c783c R _ZNSt6chrono3_V212steady_clock9is_steadyE 0000000000287360 T _ZNSt6chrono3_V212system_clock3nowEv 00000000002c783d R _ZNSt6chrono3_V212system_clock9is_steadyE

    You can see that clock_gettime is undefined in 3.10.2 highlighted in the result of first nm command. Looking at the code, the single call to this function is only included in the C++ code if OS_LINUX or OS_FREEBSD is defined.

    Judging from the above nm results, do you think, in 3.6, none of the two flags above were set, where as in 3.10, somehow at least one gets set?

  • Improve point-lookup performance using a data block hash index

    Improve point-lookup performance using a data block hash index

    Summary:

    Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless BlockBasedTableOptions::data_block_index_type is set to data_block_index_type = kDataBlockBinaryAndHash.

    The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using BlockBasedTableOptions::data_block_hash_table_util_ratio. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.

    Test Plan:

    added unit test make -j32 check and make sure all test pass

    Some performance numbers. These experiments run against SSDs. CPU Util is the CPU util percentage of the DataBlockIter point-lookup among db_bench. The CPU util percentage is captured by perf.

    # large cache 20GB
           index | Throughput |             | fallback | cache miss | DB Space
    (util_ratio) |     (MB/s) | CPU Util(%) |    ratio |      ratio |     (GB)
    ------------ | -----------| ----------- | -------- | ---------- | --------
          binary |        116 |       27.17 |    1.000 |   0.000494 |     5.41
           hash1 |        123 |       22.21 |    0.524 |   0.000502 |     5.59
         hash0.9 |        126 |       22.89 |    0.559 |   0.000502 |     5.61
         hash0.8 |        129 |       21.65 |    0.487 |   0.000504 |     5.63
         hash0.7 |        127 |       21.12 |    0.463 |   0.000504 |     5.65
         hash0.6 |        130 |       20.62 |    0.423 |   0.000506 |     5.69
         hash0.5 |        132 |       19.34 |    0.311 |   0.000510 |     5.75
    
    
    # small cache 1GB
           index | Throughput |             | fallback | cache miss | DB Space
    (util_ratio) |     (MB/s) | CPU Util(%) |    ratio |      ratio |     (GB)
    ------------ | -----------| ----------- | -------- | ---------- | --------
          binary |       26.8 |        2.02 |    1.000 |   0.923345 |     5.41
           hash1 |       25.9 |        1.49 |    0.524 |   0.924571 |     5.59
         hash0.9 |       27.5 |        1.59 |    0.559 |   0.924561 |     5.61
         hash0.8 |       27.4 |        1.52 |    0.487 |   0.924868 |     5.63
         hash0.7 |       27.7 |        1.44 |    0.463 |   0.924858 |     5.65
         hash0.6 |       26.8 |        1.36 |    0.423 |   0.925160 |     5.69
         hash0.5 |       28.0 |        1.22 |    0.311 |   0.925779 |     5.75
    
    

    Also we compare with the master branch on which the feature PR based to make sure there is no performance regression on the default binary seek case. These experiments run against tmpfs without perf.

    master: b271f956c Fix a TSAN failure (#4250)
    feature: bf411a50b DataBlockHashIndex: inline SeekForGet() to speedup the fallback path
    
    # large cache 20GB
        branch | Throughput | cache miss | DB Space ||       branch | Throughput | cache miss | DB Space
          #run |     (MB/s) |      ratio |     (GB) ||         #run |     (MB/s) |      ratio |     (GB)
    ---------- | -----------| ---------- | -------- || ------------ | -----------| ---------- | --------
    master/1   |      127.5 |   0.000494 |     5.41 ||  feature/1   |      129.9 |   0.000494 |     5.41
    master/2   |      130.7 |   0.000494 |     5.41 ||  feature/2   |      126.3 |   0.000494 |     5.41
    master/3   |      128.7 |   0.000494 |     5.41 ||  feature/3   |      128.7 |   0.000494 |     5.41
    master/4   |      105.4 |   0.000494 |     5.41 ||  feature/4   |      131.1 |   0.000494 |     5.41
    master/5   |      135.8 |   0.000494 |     5.41 ||  feature/5   |      132.7 |   0.000494 |     5.41
    master/avg |      125.6 |   0.000494 |     5.41 ||  feature/avg |      129.7 |   0.000494 |     5.41
    
    
    # small cache 1GB
        branch | Throughput | cache miss | DB Space ||       branch | Throughput | cache miss | DB Space
          #run |     (MB/s) |      ratio |     (GB) ||         #run |     (MB/s) |      ratio |     (GB)
    ---------- | -----------| ---------- | -------- || ------------ | -----------| ---------- | --------
    master/1   |       36.9 |   0.923190 |     5.41 ||  feature/1   |       37.1 |   0.923189 |     5.41
    master/2   |       36.8 |   0.923184 |     5.41 ||  feature/2   |       35.8 |   0.923196 |     5.41
    master/3   |       35.8 |   0.923190 |     5.41 ||  feature/3   |       36.4 |   0.923183 |     5.41
    master/4   |       27.8 |   0.923200 |     5.41 ||  feature/4   |       36.6 |   0.923191 |     5.41
    master/5   |       37.7 |   0.923162 |     5.41 ||  feature/5   |       36.7 |   0.923141 |     5.41
    master/avg |       35.0 |   0.923185 |     5.41 ||  feature/avg |       36.5 |   0.923180 |     5.41
    							
    
    
    # benchmarking command
    # setting: num=200 million, reads=100 million, key_size=8B, value_size=40B, threads=16
    $DB_BENCH  --data_block_index_type=${block_index} \
               --db=${db} \
               --block_size=16000 --level_compaction_dynamic_level_bytes=1 \
               --num=$num \
               --key_size=$ks \
               --value_size=$vs \
               --benchmarks=fillseq --compression_type=snappy \
               --statistics=false --block_restart_interval=1 \
               --compression_ratio=0.4 \
               --data_block_hash_table_util_ratio=${util_ratio} \
               --statistics=true \
               >${write_log}
    
    $DB_BENCH  --data_block_index_type=${block_index} \
               --db=${db} \
               --block_size=16000 --level_compaction_dynamic_level_bytes=1 \
               --use_existing_db=true \
               --num=${num} \
               --reads=${reads} \
               --key_size=$ks \
               --value_size=$vs \
               --benchmarks=readtocache,readrandom \
               --compression_type=snappy \
               --block_restart_interval=16 \
               --compression_ratio=0.4 \
               --cache_size=${cache_size} \
               --data_block_hash_table_util_ratio=${util_ratio} \
               --use_direct_reads \
               --disable_auto_compactions \
               --threads=${threads} \
               --statistics=true \
               > ${read_log}
    
    
  • [4/4][ResourceMngmt] Account Bloom/Ribbon filter construction memory in global memory limit

    [4/4][ResourceMngmt] Account Bloom/Ribbon filter construction memory in global memory limit

    Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.

    Context: Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction by charging dummy entry to block cache, moving toward the goal of single global memory limit using block cache capacity. It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under LRUCacheOptions::strict_capacity_limit=true.

    The option to turn on this feature is BlockBasedTableOptions::reserve_table_builder_memory = true which by default is set to false. We decided not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.

    Summary:

    • Reserved/released cache for creation/destruction of three main memory users using CacheReservationManager:
      • hash entries (i.ehash_entries.size(), we bucket-charge hash entries during insertion for performance),
      • banding (Ribbon Filter only, bytes_coeff_rows +bytes_result_rows + bytes_backtrack),
      • final filter (i.e, mutable_buf's size).
        • Implementation details: in order to use CacheReservationManager::CacheReservationHandle to account final filter's memory, we have to store the CacheReservationManager object and CacheReservationHandle for final filter in XXPH3BitsFilterBuilder as well as explicitly delete the filter bits builder when done with the final filter in block based table.
    • Added option fo run filter_bench with this memory reservation feature

    Test:

    • Added new tests in db_bloom_filter_test to verify filter construction peak cache reservation under combination of BlockBasedTable::Rep::FilterType (e.g, kFullFilter, kPartitionedFilter), BloomFilterPolicy::Mode(e.g, kFastLocalBloom, kStandard128Ribbon, kDeprecatedBlock) and BlockBasedTableOptions::reserve_table_builder_memory
      • To address the concern for slow test: tests with memory reservation under kFullFilter + kStandard128Ribbon and kPartitionedFilter take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locally
    • Added new test in bloom_test to verify Ribbon Filter fallback on large banding in FullFilter
    • Added test in filter_bench to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below:
      • FastLocalBloom

        • baseline ./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg':
          • Build avg ns/key: 29.56295 (DEBUG_LEVEL=1), 29.98153 (DEBUG_LEVEL=0)
        • new feature (expected to be similar as above)./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg':
          • Build avg ns/key: 30.99046 (DEBUG_LEVEL=1), 30.48867 (DEBUG_LEVEL=0)
        • new feature of RibbonFilter with fallback (expected to be similar as above) ./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg' :
          • Build avg ns/key: 31.146975 (DEBUG_LEVEL=1), 30.08165 (DEBUG_LEVEL=0)
      • Ribbon128

        • baseline ./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg':
          • Build avg ns/key: 129.17585 (DEBUG_LEVEL=1), 130.5225 (DEBUG_LEVEL=0)
        • new feature (expected to be similar as above) ./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg':
          • Build avg ns/key: 131.61645 (DEBUG_LEVEL=1), 132.98075 (DEBUG_LEVEL=0)
        • new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) ./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg' :
          • Build avg ns/key: 52.032965 (DEBUG_LEVEL=1), 52.597825 (DEBUG_LEVEL=0)
          • And the warning message of "Cache reservation for Ribbon filter banding failed due to cache full" is indeed logged to console.
  • Rocks DB crash when used via JNI. Version: 6.20.3

    Rocks DB crash when used via JNI. Version: 6.20.3

    Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

    Expected behavior

    Rocks DB continues to serve reads and writes

    Actual behavior

    Rocks DB crashes.

    Steps to reproduce the behavior

    I can provide the backtrace dump reported for now. Do let me know what other pieces of information are needed.

    
    Host: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 32 cores, 376G, Red Hat Enterprise Linux Server release 7.8 (Maipo)
    Time: Mon Aug  2 17:21:25 2021 PDT elapsed time: 480.081057 seconds (0d 0h 8m 0s)
    
    ---------------  T H R E A D  ---------------
    
    Current thread (0x00007f0e02e36800):  JavaThread "grpc-default-executor-24" daemon [_thread_in_native, id=48415, stack(0x00007f0defb58000,0x00007f0defc59000)]
    
    Stack: [0x00007f0defb58000,0x00007f0defc59000],  sp=0x00007f0defc56bc0,  free space=1018k
    Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
    C  [librocksdbjni15379259497632127760.so+0x29bb93]  rocksdb::LegacyFileSystemWrapper::NewSequentialFile(std::string const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSSequentialFile, std::default_delete<rocksdb::FSSequentialFile> >*, rocksdb::IODebugContext*)+0x33
    C  [librocksdbjni15379259497632127760.so+0x3d2df0]  rocksdb::ReadFileToString(rocksdb::FileSystem*, std::string const&, std::string*)+0x90
    C  [librocksdbjni15379259497632127760.so+0x37fc0b]  rocksdb::VersionSet::GetCurrentManifestPath(std::string const&, rocksdb::FileSystem*, std::string*, unsigned long*)+0x5b
    C  [librocksdbjni15379259497632127760.so+0x397b24]  rocksdb::VersionSet::ListColumnFamilies(std::vector<std::string, std::allocator<std::string> >*, std::string const&, rocksdb::FileSystem*)+0x64
    C  [librocksdbjni15379259497632127760.so+0x27e7eb]  rocksdb::DB::ListColumnFamilies(rocksdb::DBOptions const&, std::string const&, std::vector<std::string, std::allocator<std::string> >*)+0x5b
    C  [librocksdbjni15379259497632127760.so+0x1dfda9]  Java_org_rocksdb_RocksDB_listColumnFamilies+0x89
    J 4117  org.rocksdb.RocksDB.listColumnFamilies(JLjava/lang/String;)[[B (0 bytes) @ 0x00007f0e38785fa2 [0x00007f0e38785ec0+0x00000000000000e2]
    J 16357 c2 org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/rocksdb/DBOptions;Lorg/rocksdb/WriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;Z)V (717 bytes) @ 0x00007f0e3938e460 [0x00007f0e3938d9e0+0x0000000000000a80]
    J 14139 c2 org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;JLjava/lang/String;Z)V (18 bytes) @ 0x00007f0e3906a668 [0x00007f0e39066b00+0x0000000000003b68]
    J 15249 c2 org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB; (339 bytes) @ 0x00007f0e391e30d0 [0x00007f0e391e2ac0+0x0000000000000610]
    J 18254 c2 org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/hdds/client/BlockID;)Lorg/apache/hadoop/ozone/container/common/helpers/BlockData; (299 bytes) @ 0x00007f0e38f40858 [0x00007f0e38f406e0+0x0000000000000178]
    J 17900 c2 org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (11 bytes) @ 0x00007f0e3966b42c [0x00007f0e39668340+0x00000000000030ec]
    J 17904 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (1105 bytes) @ 0x00007f0e39658378 [0x00007f0e39656ba0+0x00000000000017d8]
    J 8487 c2 org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(Ljava/lang/Object;Lorg/apache/hadoop/hdds/function/FunctionWithServiceException;Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; (205 bytes) @ 0x00007f0e38c0e908 [0x00007f0e38c0e6e0+0x0000000000000228]
    J 12956 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (38 bytes) @ 0x00007f0e38dd88d0 [0x00007f0e38dd8740+0x0000000000000190]
    J 17877 c2 org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(Ljava/lang/Object;)V (9 bytes) @ 0x00007f0e3962efd8 [0x00007f0e3962ef60+0x0000000000000078]
    J 17878 c2 org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(Ljava/lang/Object;)V (155 bytes) @ 0x00007f0e39631718 [0x00007f0e39631380+0x0000000000000398]
    J 17572 c2 org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext()V (77 bytes) @ 0x00007f0e3923dc40 [0x00007f0e3923d8c0+0x0000000000000380]
    J 14423 c2 org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V (35 bytes) @ 0x00007f0e3889bf28 [0x00007f0e3889be80+0x00000000000000a8]
    J 14467 c2 org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (99 bytes) @ 0x00007f0e38bbb27c [0x00007f0e38bbb180+0x00000000000000fc]
    J 15549 c2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V [email protected] (187 bytes) @ 0x00007f0e39278620 [0x00007f0e39278460+0x00000000000001c0]
    J 6773 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V [email protected] (9 bytes) @ 0x00007f0e31304be4 [0x00007f0e31304b40+0x00000000000000a4]
    J 6760 c1 java.lang.Thread.run()V [email protected] (17 bytes) @ 0x00007f0e31303934 [0x00007f0e313037c0+0x0000000000000174]
    v  ~StubRoutines::call_stub
    V  [libjvm.so+0x88abd6]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x366
    V  [libjvm.so+0x888bdd]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0x1ed
    V  [libjvm.so+0x935d0c]  thread_entry(JavaThread*, Thread*)+0x6c
    V  [libjvm.so+0xe2c91a]  JavaThread::thread_main_inner()+0x1fa
    V  [libjvm.so+0xe2933f]  Thread::call_run()+0x14f
    V  [libjvm.so+0xc6fb9e]  thread_native_entry(Thread*)+0xee
    
    Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
    J 4117  org.rocksdb.RocksDB.listColumnFamilies(JLjava/lang/String;)[[B (0 bytes) @ 0x00007f0e38785f2d [0x00007f0e38785ec0+0x000000000000006d]
    J 16357 c2 org.apache.hadoop.hdds.utils.db.RDBStore.<init>(Ljava/io/File;Lorg/rocksdb/DBOptions;Lorg/rocksdb/WriteOptions;Ljava/util/Set;Lorg/apache/hadoop/hdds/utils/db/CodecRegistry;Z)V (717 bytes) @ 0x00007f0e3938e460 [0x00007f0e3938d9e0+0x0000000000000a80]
    J 14139 c2 org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(Lorg/apache/hadoop/hdds/conf/ConfigurationSource;JLjava/lang/String;Z)V (18 bytes) @ 0x00007f0e3906a668 [0x00007f0e39066b00+0x0000000000003b68]
    J 15249 c2 org.apache.hadoop.ozone.container.common.utils.ContainerCache.getDB(JLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hdds/conf/ConfigurationSource;)Lorg/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB; (339 bytes) @ 0x00007f0e391e30d0 [0x00007f0e391e2ac0+0x0000000000000610]
    J 18254 c2 org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.getBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/hdds/client/BlockID;)Lorg/apache/hadoop/ozone/container/common/helpers/BlockData; (299 bytes) @ 0x00007f0e38f40858 [0x00007f0e38f406e0+0x0000000000000178]
    J 17900 c2 org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (11 bytes) @ 0x00007f0e3966b42c [0x00007f0e39668340+0x00000000000030ec]
    J 17904 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (1105 bytes) @ 0x00007f0e39658378 [0x00007f0e39656ba0+0x00000000000017d8]
    J 8487 c2 org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(Ljava/lang/Object;Lorg/apache/hadoop/hdds/function/FunctionWithServiceException;Ljava/lang/Object;Ljava/lang/String;)Ljava/lang/Object; (205 bytes) @ 0x00007f0e38c0e908 [0x00007f0e38c0e6e0+0x0000000000000228]
    J 12956 c2 org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto; (38 bytes) @ 0x00007f0e38dd88d0 [0x00007f0e38dd8740+0x0000000000000190]
    J 17877 c2 org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(Ljava/lang/Object;)V (9 bytes) @ 0x00007f0e3962efd8 [0x00007f0e3962ef60+0x0000000000000078]
    J 17878 c2 org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(Ljava/lang/Object;)V (155 bytes) @ 0x00007f0e39631718 [0x00007f0e39631380+0x0000000000000398]
    J 17572 c2 org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext()V (77 bytes) @ 0x00007f0e3923dc40 [0x00007f0e3923d8c0+0x0000000000000380]
    J 14423 c2 org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run()V (35 bytes) @ 0x00007f0e3889bf28 [0x00007f0e3889be80+0x00000000000000a8]
    J 14467 c2 org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run()V (99 bytes) @ 0x00007f0e38bbb27c [0x00007f0e38bbb180+0x00000000000000fc]
    J 15549 c2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V [email protected] (187 bytes) @ 0x00007f0e39278620 [0x00007f0e39278460+0x00000000000001c0]
    J 6773 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V [email protected] (9 bytes) @ 0x00007f0e31304be4 [0x00007f0e31304b40+0x00000000000000a4]
    J 6760 c1 java.lang.Thread.run()V [email protected] (17 bytes) @ 0x00007f0e31303934 [0x00007f0e313037c0+0x0000000000000174]
    v  ~StubRoutines::call_stub
    
    siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000050```
    [hs_err_pid70046.log](https://github.com/facebook/rocksdb/files/6926076/hs_err_pid70046.log)
    
  • Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST

    Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST

    This PR makes it able to LogAndApply VersionEdits related to WALs, and also be able to Recover from MANIFEST with WAL related VersionEdits.

    The VersionEdits related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related VersionEdits with other types of edits will make logic in ProcessManifestWrite more complicated, so VersionEdits related to WAL can either be WAL additions or deletions, like column family add and drop.

    Test Plan:

    a set of unit tests are added in version_set_test.cc

  • RocksDB JNI Maven publication

    RocksDB JNI Maven publication

    Once the ongoing RocksDB JNI work is committed (https://github.com/facebook/rocksdb/pull/235), I'm looking at publishing RocksDB's JNI libraries to Maven central so that the work can be consumed by other Java projects.

    My proposal is to use Sonatype's open source Maven repo as the host for RocksDB JARs. This repo gets mirrored into Maven central. The artifacts will be published under the 'org.rocksdb' group.

    Does this sound OK with everyone?

  • Java: Support Apple Silicon/M1 machines

    Java: Support Apple Silicon/M1 machines

    When I run an application on an Apple Silicon based JVM depending on the RocksDB JVM library it fails with UnsatisfiedLinkError.

    Expected behavior

    Maven dependency includes Mac Arm64 compilation

    Actual behavior

    Maven dependency does not include Mac Arm64 compilation

    Steps to reproduce the behavior

    Run RocksDB inside a Apple Silicon JVM application. UnsatisfiedLinkError is thrown.

    Depends on #7710

    Note: RocksDB for Java works normally running in Rosetta mode, but as soon as environment is started in non Rosetta mode the program fails.

  • Added mechanism to track deadlock chain

    Added mechanism to track deadlock chain

    Changes:

    • extended the wait_txn_map to track additional information
    • designed circular buffer to store n latest deadlocks' information
    • added test coverage to verify the additional information tracked is accurately stored in the buffer
  • db_bench v5.2 shows fillrandom performance degradation on windows

    db_bench v5.2 shows fillrandom performance degradation on windows

    As compared with 4.8. I am wondering if this is a repro on Linux.

    5.2 bench command line everything is the same minus deprecated options

    db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --use_direct_reads=1 --use_direct_writes=1 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1

    4.8 bench command

    db_bench_je --benchmarks=fillrandom --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --value_size=800 --block_size=65536 --cache_size=1048576 --bloom_bits=10 --cache_numshardbits=4 --open_files=500000 --verify_checksum=1 --db=k:\data\BulkLoadRandom_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --disable_data_sync=1 --write_buffer_size=268435456 --target_file_size_base=1073741824 --max_write_buffer_number=30 --max_background_compactions=20 --max_background_flushes=4 --level0_file_num_compaction_trigger=10000000 --level0_slowdown_writes_trigger=10000000 --level0_stop_writes_trigger=10000000 --num_levels=2 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_grandparent_overlap_factor=10 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --memtablerep=vector --use_existing_db=0 --disable_auto_compactions=1 --source_compaction_factor=10000000 --bufferedio=0 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --random_access_max_buffer_size=0 --new_table_reader_for_compaction_inputs=1 --skip_table_builder_flush=1

    The results show the following: 4.8

    DB path: [k:\data\BulkLoadRandom_10M] fillrandom : 1.723 micros/op 580378 ops/sec; 451.6 MB/s Microseconds per write: Count: 10000000 Average: 1.7227 StdDev: 46.11 Min: 0 Median: 0.9637 Max: 26036 Percentiles: P50: 0.96 P75: 1.50 P99: 2.63 P99.9: 9.00 P99.99: 17.34

    Latest GItHub at the time of this writing:

    DB path: [k:\data\BulkLoadRandom_10M] fillrandom : 2.769 micros/op 361160 ops/sec; 281.1 MB/s Microseconds per write: Count: 10000000 Average: 2.7686 StdDev: 37.23 Min: 1 Median: 2.0641 Max: 17129 Percentiles: P50: 2.06 P75: 2.55 P99: 3.66 P99.9: 11.10 P99.99: 28.33

  • Major Compactions

    Major Compactions

    I've been doing some testing on a service that uses rocksdb for storage internally and finding that major compactions are sometimes causing outages which last for a few minutes (since major compactions seem to block everything else). Also, when I restart the process, sometimes a major compaction is triggered which causes the db to take many minutes to open.

    Wondering where I should start looking to alleviate these issues. Thanks!

  • VerifySstUniqueIds status is overrided for multi CFs

    VerifySstUniqueIds status is overrided for multi CFs

    Summary: There's bug that basically we only report the last CF's VerifySstUniqueIds() result: https://github.com/facebook/rocksdb/pull/9990#discussion_r877268810

    Test Plan: CI

  • Trivial Move

    Trivial Move

    trivial_move image I printed the log β€œEnter db_impl_compaction_flush.cc move.” in this part of the code, but the logic did not print the log during the stress test. Is its β€œTrivial Move” logic not here?

  • Update/clarify required properties for prefix extractors

    Update/clarify required properties for prefix extractors

    Summary: Most of the properties listed as required for prefix extractors are not really required but offer some conveniences. This updates API comments to clarify actual requirements, and adds tests to demonstrate how previously presumed requirements can be safely violated.

    This might seem like a useless exercise, but this relaxing of requirements would be needed if we generalize prefixes to group keys not just at the byte level but also based on bits or arbitrary value ranges. For applications without a "natural" prefix size, having only byte-level granularity often means one prefix size to the next differs in magnitude by a factor of 256.

    Test Plan: Tests added, also covering missing Iterator cases from #10244

  • Add API for writing wide-column entities

    Add API for writing wide-column entities

    Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called PutEntity that can be used to write a wide-column entity to the database. The new API is added to both DB and WriteBatch. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (Get, MultiGet, and iterator) return NotSupported when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like Merge, compaction filter, and timestamp support) will be added in later PRs.

    Test Plan: make check

Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.
Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Jun 23, 2022
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Jun 24, 2022
πŸ₯‘ ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.
πŸ₯‘ ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

?? ArangoDB is a native multi-model database with flexible data models for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

Jun 24, 2022
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

Jun 24, 2022
Nebula Graph is a distributed, fast open-source graph database featuring horizontal scalability and high availability
Nebula Graph is a distributed, fast open-source graph database featuring horizontal scalability and high availability

Nebula Graph is an open-source graph database capable of hosting super large-scale graphs with billions of vertices (nodes) and trillions of edges, with milliseconds of latency. It delivers enterprise-grade high performance to simplify the most complex data sets imaginable into meaningful and useful information.

Jun 24, 2022
FEDB is a NewSQL database optimised for realtime inference and decisioning application
FEDB is a NewSQL database optimised for realtime inference and decisioning application

FEDB is a NewSQL database optimised for realtime inference and decisioning applications. These applications put real-time features extracted from multiple time windows through a pre-trained model to evaluate new data to support decision making. Existing in-memory databases cost hundreds or even thousands of milliseconds so they cannot meet the requirements of inference and decisioning applications.

Jun 16, 2022
RediSearch is a Redis module that provides querying, secondary indexing, and full-text search for Redis.
RediSearch is a Redis module that provides querying, secondary indexing, and full-text search for Redis.

A query and indexing engine for Redis, providing secondary indexing, full-text search, and aggregations.

Jun 17, 2022
Kreon is a key-value store library optimized for flash-based storage
Kreon is a key-value store library optimized for flash-based storage

Kreon is a key-value store library optimized for flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness.

Jun 10, 2022
KVDK (Key-Value Development Kit) is a key-value store library implemented in C++ language

KVDK (Key-Value Development Kit) is a key-value store library implemented in C++ language. It is designed for persistent memory and provides unified APIs for both volatile and persistent scenarios. It also demonstrates several optimization methods for high performance with persistent memory. Besides providing the basic APIs of key-value store, it offers several advanced features, like transaction, snapshot as well.

Jun 17, 2022
Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.
Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a key-value NoSQL database based on RocksDB and compatible with Redis protocol.

Jun 23, 2022
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.
Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Kvrocks is a distributed key value NoSQL database based on RocksDB and compatible with Redis protocol.

Jun 24, 2022
BerylDB is a data structure data manager that can be used to store data as key-value entries.
BerylDB is a data structure data manager that can be used to store data as key-value entries.

BerylDB is a data structure data manager that can be used to store data as key-value entries. The server allows channel subscription and is optimized to be used as a cache repository. Supported structures include lists, sets, and keys.

Jun 24, 2022
FoundationDB - the open source, distributed, transactional key-value store
FoundationDB - the open source, distributed, transactional key-value store

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as

Jun 17, 2022
A high performance, shared memory, lock free, cross platform, single file, no dependencies, C++11 key-value store
A high performance, shared memory, lock free, cross platform, single file, no dependencies, C++11 key-value store

SimDB A high performance, shared memory, lock free, cross platform, single file, no dependencies, C++11 key-value store. SimDB is part of LAVA (Live A

Jun 8, 2022
An efficient, small mobile key-value storage framework developed by WeChat. Works on Android, iOS, macOS, Windows, and POSIX.
An efficient, small mobile key-value storage framework developed by WeChat. Works on Android, iOS, macOS, Windows, and POSIX.

δΈ­ζ–‡η‰ˆζœ¬θ―·ε‚ηœ‹θΏ™ι‡Œ MMKV is an efficient, small, easy-to-use mobile key-value storage framework used in the WeChat application. It's currently available on Andr

Jun 16, 2022
Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.

Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. It includes bot

Jun 19, 2022
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

Jun 22, 2022
Modern transactional key-value/row storage library.
Modern transactional key-value/row storage library.

Sophia is advanced transactional MVCC key-value/row storage library. How does it differ from other storages? Sophia is RAM-Disk hybrid storage. It is

Jun 25, 2022
Postmodern immutable and persistent data structures for C++ β€” value semantics at scale
Postmodern immutable and persistent data structures for C++ β€” value semantics at scale

immer is a library of persistent and immutable data structures written in C++. These enable whole new kinds of architectures for interactive and concu

Jun 16, 2022
Jun 16, 2022