OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning driven applications.

build status docker pulls slack discuss codecov release license gitee maven central maven central pypi

English version | 中文版

1. Introduction

OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning. A database for machine learning consists of two major tasks: feature extraction and feature access, which are served as data provisioning for offline training and online inference. Without OpenMLDB, there are two separate systems for online and offline data provisioning, which cost significant effort to verify the online-offline consistency. On the contrary, OpenMLDB supports the unified SQL programming and its execution engine for both online and offline data provisioning. As a result, the online-offline consistency is inherently guaranteed. Moreover, the system is carefully designed and optimized to ensure the efficiency. By taking advantages of OpenMLDB, database engineers are now able to write SQL scripts only to efficiently provide consistent data to machine learning, and an offline model can be immediately deployed for online serving with little cost involved.

image-20211103103052252

The above figure illustrates the OpenMLDB workflow. SQL engineers first write SQL scripts for offline feature extraction, which provides data for offline model training. When the model quality is satisfied, the online feature extraction and access can be enabled immediately for online serving without additional efforts involved. Thanks to the unified SQL programming and execution engine, the online-offline consistency verification is eliminated, which is inherently guaranteed by OpenMLDB. Furthermore, certain optimization techniques (e.g., data skew optimization and in-memory indexing for offline and online feature extraction, respectively) are adopted to ensure that the performance requirement can be met for both offline training and online inference. In summary, OpenMLDB enables SQL as the only programming interface for consistent and efficient data provisioning for both offline model training and online inference serving.

2. Highlight Features

2.1. SQL Programming APIs

We believe SQL is the most suitable programming APIs for feature engineering because of its elegant design and popularity. OpenMLDB enables SQL as the programming APIs for developers for both offline and online feature extraction. Besides, we extend the capability of standard SQL and make it more powerful for feature extraction.

2.2 Online-Offline Consistency

Based on the SQL programming APIs, we design an unified execution engine for both online and offline feature extraction. As a result, the online-offline consistency is inherently guaranteed by OpenMLDB with no other cost.

2.3. Efficiency

We propose a few techniques to improve the performance for both offline and online feature extraction. As a result, our offline feature extraction can be significantly faster than existing opensource bigdata processing frameworks. Moreover, our online service can provide low latency (tens of milliseconds) to meet the performance requirement of online inference.

You can read our below section (7. Publications & Blogs) for more technical detail.

2.4 Integrated CLI

We provide a powerful integrated CLI for SQL programming, job management, online and offline deployment, and database administration. Developers who are familiar with database's CLIs should be very comfortable with our tool.

Note that, the CLI of current release 0.3.0 supports the cluster mode partially. It will be fully supported in the next release of 0.4.0

3. Build & Install

👉 Read more

4. Demo & QuickStart

Since OpenMLDB v0.3.0, we have introduced two operating modes, which are cluster mode and standalone mode. The cluster mode is suitable for large-scale datasets and real-world applications, which provides the scalability and high-availability. On the other hand, the lightweight standalone mode running on a single node is ideal for small businesses and demonstration.

We demonstrate the workflow using the cluster and standalone modes:

5. Roadmap

We list a few highlight features that we have planned in the future releases. Please join our community to understand more about our planning and discuss your ideas.

Version Est. release date Highlight features
0.4.0 End of 2021 - Full support of standalone and cluster modes in the integrated CLI
0.5.0 2022 Q1 - Monitoring APIs and tools for online serving
- Efficient queries over a fairly long period of time by window functions
- Kafka/Pulsar connector support for online data source

6. Community

You may join our community for feedback and discussion

  • Email: [email protected]

  • Slack Workspace: You may find useful information of release notes, user support, development discussion and even more from our various Slack channels.

  • GitHub Issues and Discussions: If you are a serious developer, you are most welcome to join our discussion on GitHub. GitHub Issues are used to report bugs and collect new requirements. GitHub Discussions are mostly used by our project maintainers to publish and comment RFCs.

  • Blogs (Chinese)

  • WeChat Groups (Chinese):

    img

7. Publications & Blogs

Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
Comments
  • feat: support the SQL RLIKE expression

    feat: support the SQL RLIKE expression

    close #862

    Development

    • [x] SQL syntax https://github.com/4paradigm/zetasql/pull/41
    • [x] extend BinaryExpr to support RLIKE type
    • [x] add regexp_like builtin function
    • [x] Codegen: deal with RLIKE BinaryExpr

    Test

    • [x] ~~logic plan test in ast_node_converter_test to verify correctly plan generated for rlike expr (discuss needed)~~
    • [x] unit test for new functions in udf.cc -> udf_test.cc
    • [x] udf_ir_builder_test for the registered function regexp_like udf functions
    • [x] expr_ir_builder_test for the supported rlike expression
    • [x] and end2end, a full sql example to test correctly for both rlike expression and regexp_like function (discuss needed)
  • refactor: rm cmake modules

    refactor: rm cmake modules

    • Please check if the PR fulfills these requirements
    • [ ] The commit message follows our guidelines
    • [ ] Tests for the changes have been added (for bug fixes / features)
    • [ ] Docs have been added / updated (for bug fixes / features)
    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    • What is the current behavior? (You can also link to an open issue here)

    • What is the new behavior (if this is a feature change)?

    • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

    • Other information:

  • `lag()/at()/lead()` return show offset'th row, it is not related to window frame bound

    `lag()/at()/lead()` return show offset'th row, it is not related to window frame bound

    Bug Description

      - id: 6
        desc: window merge optimization
        inputs:
          - columns: [ "row_id int","ts timestamp","group1 string","val1 int" ]
            indexs: [ "index1:group1:ts" ]
            name: t1
            data: |
              1, 1612130400000, g1, 1
              2, 1612130401000, g1, 2
              3, 1612130402000, g1, 3
              4, 1612130403000, g1, 4
              5, 1612130404000, g1, 5
        sql: |
          select
          `row_id` as row_id_1,
          `row_id` as t1_row_id_original_0,
          case when !isnull(at(`val1`, 0)) over t1_group1_ts_0s_5s_10 then count_where(`val1`, `val1` = at(`val1`, 0)) over t1_group1_ts_0s_5s_10 else null end as t1_val1_window_count_1,
          case when !isnull(at(`val1`, 0)) over t1_group1_ts_1s_5s_10 then count_where(`val1`, `val1` = at(`val1`, 0)) over t1_group1_ts_1s_5s_10 else null end as t1_val1_window_count_2
          from
            `t1` WINDOW
            t1_group1_ts_0s_5s_10 as (partition by `group1` order by `ts` rows_range between 5s preceding and 0s preceding MAXSIZE 10),
            t1_group1_ts_1s_5s_10 as (partition by `group1` order by `ts` rows_range between 5s preceding and 1s preceding MAXSIZE 10);
        expect:
          columns: ["row_id_1 int", "t1_row_id_original_0 int", "t1_val1_window_count_1 int64", "t1_val1_window_count_2 int64"]
          order: row_id_1
          data: |
            1, 1, 1, 0
            2, 2, 1, 0
            3, 3, 1, 0
            4, 4, 1, 0
            5, 5, 1, 0
    

    Expected Behavior

    the case should pass. Current result is not:

    +----------+----------------------+------------------------+------------------------+
    | row_id_1 | t1_row_id_original_0 | t1_val1_window_count_1 | t1_val1_window_count_2 |
    +----------+----------------------+------------------------+------------------------+
    | 1        | 1                    | 1                      | NULL                   |
    | 2        | 2                    | 1                      | NULL                   |
    | 3        | 3                    | 1                      | NULL                   |
    | 4        | 4                    | 1                      | NULL                   |
    | 5        | 5                    | 1                      | NULL                   |
    +----------+----------------------+------------------------+------------------------+
    

    Work List

    • [x] add extra handling in logic plan for lag functions
      • working in #1605
    • [x] fix lag correctness in request mode & cluster environment
    • [x] fix window merge result correctness ( may related to #1587 )
  • style: enforece cpp style convention in hybridse

    style: enforece cpp style convention in hybridse

    This PR try to solve the problem that some api don't follow google cpp style.

    The thing this PR has done:

    • [x] Change the api of EngineOptions and JitOptions' methods to follow google cpp style name

    The related issues gitee related issue 286

  • style: enforece cpp style convention in hybridse

    style: enforece cpp style convention in hybridse

    This PR try to solve the problem that some api don't follow google cpp style.

    background cpp code should follow our style guide: https://github.com/4paradigm/rfcs/blob/main/style-guide/code-convention.md

    The thing this PR has done:

    • [x] Change the api of EngineOptions and JitOptions' methods to follow google cpp style name, just leave the corresponding part in openmldb-batch unchanged

    The related issues gitee related issue 286

  • segmentation fault when trying demo under the standalone mode

    segmentation fault when trying demo under the standalone mode

    image A segmentation fault occurs during the operation of Demo with The Standalone Mode ../openmldb/bin/openmldb --host 127.0.0.1 --port 6527, which may be caused by insufficient memory of the machine.
  • feat: support dayofyear() built-in function

    feat: support dayofyear() built-in function

    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    Adds the dayofyear() function which returns the day of the year for a given date (a number from 1 to 366).

    • What is the current behavior? (You can also link to an open issue here)

    dayofyear function doesn't exist, see issue for more details: #785

  • feat: implement function last_day

    feat: implement function last_day

    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    Feature: implement built-in function last_day

    • What is the new behavior (if this is a feature change)?

    Closes #821

  • feat: add a udf  function similar to Hive get_json_object

    feat: add a udf function similar to Hive get_json_object

    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    feature

    • What is the current behavior? (You can also link to an open issue here)

    https://github.com/4paradigm/OpenMLDB/issues/1639

    • What is the new behavior (if this is a feature change)? add a udf function similar to Hive get_json_object
  • feat: like prediate and like udf

    feat: like prediate and like udf

    Features

    resolve #224 detail sql rules found in #224 and #686

    What not implemented yet:

    • convert and into string if possible, currently only support string or null
    • data exception: invalid escape sequence is not checked

    Further work

    • Implicit conversion for like's <target expression> & <pattern expression>
    • support RLIKE/SIMILAR TO predicate (regexp_match)
  • docs(udf): how to generate udf documents (udfs_8h.md)

    docs(udf): how to generate udf documents (udfs_8h.md)

    Uncompleted.

    • add the document for the second task in #1707 : udf doc gen
      • there needed extra code to make steps work
    • also to metion #807, which related to linked udf doc problems
  • feat: update docker version

    feat: update docker version

    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    • What is the current behavior? (You can also link to an open issue here)

    • What is the new behavior (if this is a feature change)?

  • docs: move and optimize founction_boundary.md

    docs: move and optimize founction_boundary.md

    • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

    • What is the current behavior? (You can also link to an open issue here)

    • What is the new behavior (if this is a feature change)?

  • add example for CallablePreparedStatement

    add example for CallablePreparedStatement

    CallablePreparedStatement -> RequestPreparedStatement -> PreparedStatement It can do

    stmt = create()
    for i:
      stmt.setXX()
      stmt.executeQuery() # in dataBuild, it'll clearParameters()
    
    • [ ] Add a best practice and ut for it.

    It's not good to create the stmt every time, cuz ctor is a little complex.

    https://github.com/4paradigm/OpenMLDB/blob/f349d7242b1f08ac3b467bdbc2deb507ab73883e/java/openmldb-jdbc/src/main/java/com/_4paradigm/openmldb/jdbc/CallablePreparedStatement.java#L31

  • missing `disk table` configs in the english version of `conf.md`

    missing `disk table` configs in the english version of `conf.md`

    Bug Description The following example configs should also be added to the docs/en/deploy/conf.md

    #
    # HDD表数据文件路径(可选,默认为空)
    # 配置数据目录,多个磁盘使用英文符号, 隔开
    --hdd_root_path=./db_hdd
    # 配置数据回收站目录,drop表的数据就会放在这里
    --recycle_bin_hdd_root_path=./recycle_hdd
    #
    # SSD表数据文件路径(可选,默认为空)
    # 配置数据目录,多个磁盘使用英文符号, 隔开
    --ssd_root_path=./db_ssd
    # 配置数据回收站目录,drop表的数据就会放在这里
    --recycle_bin_ssd_root_path=./recycle_ssd
    

    Expected Behavior

    Relation Case

    Steps to Reproduce

Fast unidirectional synchronization - make or efficiently update a copy of a database, without slow dumping & reloading

Fast unidirectional synchronization - make or efficiently update a copy of a database, without slow dumping & reloading

Dec 23, 2022
Source code for the article "Code vs Data Driven Displacement"

Code vs Data Driven Displacement This repo contains the source code for all the demos from this article. It uses raylib or more specifically raygui so

Dec 29, 2022
Tuibox - A single-header terminal UI (TUI) library, capable of creating mouse-driven, interactive applications on the command line.
Tuibox - A single-header terminal UI (TUI) library, capable of creating mouse-driven, interactive applications on the command line.

tuibox tuibox ("toybox") is a single-header terminal UI library, capable of creating mouse-driven, interactive applications on the command line. It is

Dec 24, 2022
Open Source Cheat for Apex Legends, designed for ease of use. Made to understand reversing of Apex Legends and respawn's modified source engine as well as their Easy Anti Cheat Implementation.
Open Source Cheat for Apex Legends, designed for ease of use. Made to understand reversing of Apex Legends and respawn's modified source engine as well as their Easy Anti Cheat Implementation.

Apex-Legends-SDK Open Source Cheat for Apex Legends, designed for ease of use. Made to understand reversing of Apex Legends and respawn's modified sou

Jan 8, 2023
rax/RAX is a C++ extension library designed to provide new, fast, and reliable cross-platform class types.

rax rax/RAX is a C++ extension library designed to provide cross-platform new, fast, and reliable class types for different fields such as work with I

May 2, 2022
Project is to port original Zmodem for Unix to CP/M and provide binaries and source code for platform specific modification as needed. Based on 1986 C source code by Chuck Forsberg

Zmodem-CP-M This repository is intended to foster a RetroBrewComputers community effort to port the original Zmodem source code for Unix to CP/M so ev

Aug 31, 2022
Project is to port original Zmodem for Unix to CP/M and provide binaries and source code for platform specific modification as needed. Based on 1986 C source code by Chuck Forsberg

Zmodem4CPM This repository is intended to foster a RetroBrewComputers community effort to port the original Zmodem source code for Unix to CP/M so eve

Aug 31, 2022
Tightly coupled GNSS-Visual-Inertial system for locally smooth and globally consistent state estimation in complex environment.
Tightly coupled GNSS-Visual-Inertial system for locally smooth and globally consistent state estimation in complex environment.

GVINS GVINS: Tightly Coupled GNSS-Visual-Inertial Fusion for Smooth and Consistent State Estimation. paper link Authors: Shaozu CAO, Xiuyuan LU and Sh

Dec 30, 2022
fx is a workspace tool manager. It allows you to create consistent, discoverable, language-neutral and developer friendly command line tools.
fx is a workspace tool manager. It allows you to create consistent, discoverable, language-neutral and developer friendly command line tools.

fx is a workspace tool manager. It allows you to create consistent, discoverable, language-neutral and developer friendly command line tools.

Aug 27, 2022
Code accompanying our SIGGRAPH 2021 Technical Communications paper "Transition Motion Tensor: A Data-Driven Approach for Versatile and Controllable Agents in Physically Simulated Environments"
Code accompanying our SIGGRAPH 2021 Technical Communications paper

SIGGRAPH ASIA 2021 Technical Communications Transition Motion Tensor: A Data-Driven Framework for Versatile and Controllable Agents in Physically Simu

Apr 21, 2022
A simple tool that aims to efficiently and quickly parse the outputs of web scraping tools like gau

massurl is a simple tool that aims to parse the outputs of tools like gau, and extract the parameters for each URL, remove duplicates and do it all very quickly. Because web scraping tools' outputs can get very large very quickly, it is nice to have a tool that parses them and and outputs something clean and easy to read.

Jul 24, 2022
Coqui Inference Engine is a library for efficiently deploying speech models.

Coqui Inference Engine Coqui Inference Engine is a library for efficiently deploying speech models. This project is at an early proof-of-concept stage

Jan 5, 2023
Extension for PHP to interface efficiently with a Controller Area Network (CAN bus) 2.0A / 2.0B

PHP-CanBus Extension PHP-canbus is THE extension for PHP on Linux that allows PHP code to interface efficiently with a Controller Area Network (CAN bu

Sep 10, 2022
Separable Subsurface Scattering is a technique that allows to efficiently perform subsurface scattering calculations in screen space in just two passes.

Separable Subsurface Scattering Separable Subsurface Scattering is a technique that allows to efficiently perform subsurface scattering calculations i

Dec 22, 2022
Oct 6, 2021
Unix pager (with very rich functionality) designed for work with tables. Designed for PostgreSQL, but MySQL is supported too. Works well with pgcli too. Can be used as CSV or TSV viewer too. It supports searching, selecting rows, columns, or block and export selected area to clipboard.
Unix pager (with very rich functionality) designed for work with tables. Designed for PostgreSQL, but MySQL is supported too. Works well with pgcli too. Can be used as CSV or TSV viewer too. It supports searching, selecting rows, columns, or block and export selected area to clipboard.

Unix pager (with very rich functionality) designed for work with tables. Designed for PostgreSQL, but MySQL is supported too. Works well with pgcli too. Can be used as CSV or TSV viewer too. It supports searching, selecting rows, columns, or block and export selected area to clipboard.

Jan 4, 2023
OpenDCDiag is an open-source project designed to identify defects and bugs in CPUs.

OpenDCDiag is an open-source project designed to identify defects and bugs in CPUs. It consists of a set of tests built around a sophisticated CPU testing framework. OpenDCDiag is primarily intended for, but not limited to, Data Center CPUs.

Dec 14, 2022
Professor Terence Parr has taught us how to create a virtual machine Now it is time to pwn virtual machine

My First real world CTF Simple Virtual Machine Challenge description Professor Terence Parr has taught us how to create a virtual machine Now it is ti

Feb 17, 2022