Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.

Flashlight Text: Fast, Lightweight Utilities for Text

Quickstart | Installation | Python Documentation | Citing

CircleCI Join the chat at https://gitter.im/flashlight-ml/community codecov

Flashlight Text is a fast, minimal library for text-based operations. It features:

Quickstart

Flashlight Text has Python bindings for decoder and Dictionary components. To install the bindings from source, install KenLM, then clone the repo and build:

git clone https://github.com/flashlight/text && cd text
cd bindings/python
python3 setup.py install

To install without KenLM, set the environment variable USE_KENLM=0 when running setup.py.

See the full Python binding documentation for examples and more.

Building and Installing

From Source (C++) | From Source (Python) | Adding to Your Own Project (C++)

Requirements

At minimum, compilation requires:

  • A C++ compiler with good C++17 support (e.g. gcc/g++ >= 7)
  • CMake — version 3.10 or later, and make
  • A Linux-based operating system.

KenLM Support: If building with KenLM support, KenLM is required. To toggle KenLM support use the FL_TEXT_USE_KENLM CMake option or the USE_KENLM environment variable when building the Python bindings.

Tests: If building tests, Google Test >= 1.10 is required. The FL_TEXT_BUILD_TESTS CMake option toggles building tests.

Instructions for building/installing the Python bindings from source can be found here.

Building from Source

Building the C++ project from source is simple:

git clone https://github.com/flashlight/text && cd flashlight
mkdir build && cd build
cmake ..
make -j$(nproc)
make test    # run tests
make install # install at the CMAKE_INSTALL_PREFIX

To disable KenLM while building, pass -DFL_TEXT_USE_KENLM=OFF to CMake. To disable building tests, pass -DFL_TEXT_BUILD_TESTS=OFF.

KenLM can be downloaded and installed automatically if not found on the local system. The FL_TEXT_BUILD_STANDALONE option controls this behavior — if disabled, dependencies won't be downloaded and built when building.

Adding Flashlight Text to a C++ Project

Given a simple project.cpp file that includes and links to Flashlight Text:

#include <iostream>

#include <flashlight/lib/text/dictionary/Dictionary.h>

int main() {
  fl::lib::text::Dictionary myDict("someFile.dict");
  std::cout << "Dictionary has " << myDict.entrySize()
            << " entries."  << std::endl;
 return 0;
}

The following CMake configuration links Flashlight and sets include directories:

cmake_minimum_required(VERSION 3.10)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

add_executable(myProject project.cpp)

find_package(flashlight-text CONFIG REQUIRED)
target_link_libraries(myProject PRIVATE flashlight::flashlight-text)

Contributing and Contact

Contact: [email protected]

Flashlight Text is actively developed. See CONTRIBUTING for more on how to help out.

Citing

You can cite Flashlight using:

@misc{kahn2022flashlight,
      title={Flashlight: Enabling Innovation in Tools for Machine Learning},
      author={Jacob Kahn and Vineel Pratap and Tatiana Likhomanenko and Qiantong Xu and Awni Hannun and Jeff Cai and Paden Tomasello and Ann Lee and Edouard Grave and Gilad Avidov and Benoit Steiner and Vitaliy Liptchinsky and Gabriel Synnaeve and Ronan Collobert},
      year={2022},
      eprint={2201.12465},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Flashlight Text is under an MIT license. See LICENSE for more information.

Owner
A C++ standalone library for machine learning.
null
Comments
  • Clarify how multi-words are scored from kenlm

    Clarify how multi-words are scored from kenlm

    Question

    I have a wav2vec2 model that outputs uppercase characters. I trained a word-level kenlm that looks like

    ...
    -2.3280644      NOTE TAKING     -0.27668077
    -1.5878379      NOTE TAKING </s>        0
    -1.636697       NOTE TAKING IS  -0.04556015
    -1.6240125      NOTE TAKING A   0
    -2.2470238      NOTE TAKING THAT        0
    -1.671782       NOTE TAKING IN  -0.07720215
    -2.126965       NOTE TAKING YOU 0
    -1.4909623      NOTE TAKING THE 0
    -1.1026655      NOTE TAKING AND -0.007588452
    -1.9699614      NOTE TAKING IT  0
    ...
    -3.9249933      NO TAKING       -0.08624279
    -3.9410322      CASINO TAKING   0
    -3.6370378      PIANO TAKING    0
    -1.0506308      NO TAKING A     0
    -0.9676584      NO TAKING THE   0
    -1.3194044      NO TAKING IT    -0.08496775
    -4.113138       <s> NO TAKING   0
    -4.1317 IS NO TAKING    0
    -4.251807       WAS NO TAKING   0
    -3.4926496      THERE'S NO TAKING       0
    ...
    

    But it generates THIS IS GOING TO ACTIVATE THAT PANEL AND IN A FEW SECONDS YOU'LL NOTICE OUR FRIENDLY FATHOM NO TAKING BUT WILL JOIN US AS WELL

    It seems like multiword is not scored properly from beam search.

    This is my lexicon file that is generated from kenlm

    THERE   T H E R E |
    IS      I S |
    A       A |
    LOT     L O T |
    THAT    T H A T |
    GOES    G O E S |
    INTO    I N T O |
    DRUG    D R U G |
    LAB     L A B |
    CLEANUP C L E A N U P |
    

    Notice that note taking is much more common than no taking. I wonder whether unigram is preferred.

  • Fail to find KenLM when installing the python binding

    Fail to find KenLM when installing the python binding

    Question

    I try to install the Python bindings by python setup.py install. However, it fails to find the KenLM even if it has already been installed.

    Additional Context

    The error log is as following, How should I set the environment variable like CMAKE_LIBRARY_PATH, KENLM_LIB or KENLM_ROOT……

    python setup.py install
    running install
    /DB/rhome/chenyuyang/miniconda3/envs/cuda113/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    /DB/rhome/chenyuyang/miniconda3/envs/cuda113/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    running bdist_egg
    running egg_info
    writing flashlight_text.egg-info/PKG-INFO
    writing dependency_links to flashlight_text.egg-info/dependency_links.txt
    writing top-level names to flashlight_text.egg-info/top_level.txt
    package init file 'flashlight/__init__.py' not found (or not a regular file)
    package init file 'flashlight/lib/text/__init__.py' not found (or not a regular file)
    reading manifest file 'flashlight_text.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'flashlight_text.egg-info/SOURCES.txt'
    installing library code to build/bdist.linux-x86_64/egg
    running install_lib
    running build_py
    running build_ext
    -- Looking for KenLM
    -- kenlm library not found; if you already have kenlm installed, please set CMAKE_LIBRARY_PATH, KENLM_LIB or KENLM_ROOT environment variable
    -- kenlm utils library not found; if you already have kenlm installed, please set CMAKE_LIBRARY_PATH, KENLM_UTIL_LIB or KENLM_ROOT environment variable
    -- kenlm model.hh not found; if you already have kenlm installed, please set CMAKE_INCLUDE_PATH, KENLM_MODEL_HEADER or KENLM_ROOT environment variable
    -- Could NOT find kenlm (missing: KENLM_LIBRARIES) 
    CMake Error at flashlight/lib/text/decoder/lm/CMakeLists.txt:21 (message):
      KenLM not found but FL_TEXT_USE_KENLM enabled.  Install KenLM or set the
      KENLM_ROOT environment variable.
    Call Stack (most recent call first):
      flashlight/lib/text/decoder/CMakeLists.txt:3 (include)
      flashlight/lib/text/CMakeLists.txt:8 (include)
      CMakeLists.txt:54 (include)
    

    I've added export KENLM_ROOT=/DB/rhome/chenyuyang/tools/kenlm/build/bin but it seems not work..

  • Update flashlight text decoder extension for TorchAudio adoptation

    Update flashlight text decoder extension for TorchAudio adoptation

    Summary: Make changes to fill some gaps for adopting FL Text decoder in TorchAudio.

    1. Make the extension module flashlight_lib_text_decoder expose KenLM and ZeroLM.
    2. Add ZeroLMPtr alias in ZeroLM.h.

    Differential Revision: D37983766

  • Add Codecov

    Add Codecov

    Add codecov build; rename codecov flag; set standalone CI build to test with standalone disabled

    Test plan: CI + https://app.codecov.io/gh/flashlight/text

  • [build] Fix MSVC build

    [build] Fix MSVC build

    Fix MSVC nitpicks with std::filesystem::path implicit casting to std::string

    Test plan: local build on MSVC + CI baseline test

    Checklist

    • [x] Test coverage
    • [x] Tests pass
    • [x] Code formatted
    • [x] Rebased on latest master
    • [x] Code documented
  • [build] Use FetchContent to get gtest

    [build] Use FetchContent to get gtest

    Use FetchContent to download and build gtest. This requires a bump to CMake 3.14 minimum, but it's worth the cross platform and build speedups. Ubuntu 20.04 ships with CMake 3.16, so support's there. May augment this PR to enforce CMake 3.14 project-wide

    Test plan: CI

    Checklist

    • [x] Test coverage
    • [x] Tests pass
    • [x] Code formatted
    • [x] Rebased on latest master
    • [x] Code documented
  • [ci] Add macOS CI baseline

    [ci] Add macOS CI baseline

    CI baseline for macOS - matrix across KenLM and yes/no Python bindings.

    Summary

    [Explain the details for this change and the problem that the pull request solves]

    Test plan: CI

    Checklist

    • [x] Test coverage
    • [x] Tests pass
    • [x] Code formatted
    • [x] Rebased on latest matter
    • [x] Code documented
  • Bump gtest to 1.12.1

    Bump gtest to 1.12.1

    Update gtest to 1.12.1 — fixes build issues on some macOS LLVM versions.

    Test plan: CI

    Checklist

    • [x] Test coverage
    • [X] Tests pass
    • [X] Code formatted
    • [X] Rebased on latest matter
    • [X] Code documented
  • Include <string> in Trie.cpp

    Include in Trie.cpp

    Summary: std::to_string is defined in <string> header. https://en.cppreference.com/w/cpp/string/basic_string/to_string

    gcc and clang are fine without the header, but cl.exe is not.

    C:\...\flashlight\lib\text\decoder\Trie.cpp(32): error C2039: 'to_string': is not a member of 'std'
    C:\...\flashlight\lib\text\decoder\Trie.cpp(32): error C3861: 'to_string': identifier not found
    C:\...\flashlight\lib\text\decoder\Trie.cpp(31): error C2512: 'std::out_of_range': no appropriate default constructor available
    C:\...\flashlight\lib\text\decoder\Trie.cpp(32): note: No constructor could take the source type, or constructor overload resolution was ambiguous
    C:\...\flashlight\lib\text\decoder\Trie.cpp(54): error C2039: 'to_string': is not a member of 'std'
    C:\...\flashlight\lib\text\decoder\Trie.cpp(54): error C3861: 'to_string': identifier not found
    C:\...\flashlight\lib\text\decoder\Trie.cpp(53): error C2512: 'std::out_of_range': no appropriate default constructor available
    C:\...\flashlight\lib\text\decoder\Trie.cpp(54): note: No constructor could take the source type, or constructor overload resolution was ambiguous
    

    Differential Revision: D38194039

  • Fix gcc 12 build

    Fix gcc 12 build

    Summary: torchaudio builders report https://github.com/pytorch/audio/issues/2445 with gcc 12. Fix it upstream and add a CI baseline for gcc 12

    Differential Revision: D36952141

  • CI Baselines

    CI Baselines

    Add CircleCI baselines on Ubuntu. macOS and MSVC baselines coming soon.

    Add baselines across {[static, shared] x [KenLM, no KenLM] x [Python bindings, no bindings]}.

Related tags
This is the laplight software for enabling flashlight support on a laptop/netbook. For the specification, see: https://github.com/LapLight/

By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af Afrikaans Afrikaans |

Aug 26, 2022
Typesense is a fast, typo-tolerant search engine for building delightful search experiences.
 Typesense is a fast, typo-tolerant search engine for building delightful search experiences.

Fast, typo tolerant, fuzzy search engine for building delightful search experiences ⚡ ??

Sep 17, 2022
Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as gbk for chinese text search.

rgpre A tool for rg --pre. Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as

Mar 18, 2022
A collection of valorant cheating codes, including offsets, world to screen and much more!

Valorant External Cheating Help Always up to date Valorant Offsets + a wide variety of noob friendly helper functions. Functions are not heaviky teste

Jun 12, 2022
Typewriter Effect with Rich Text + *Correct* Text Wrapping
Typewriter Effect with Rich Text + *Correct* Text Wrapping

Typewriter Effect with Rich Text + Correct Text Wrapping I've spent way too long getting this right. This is meant as a base class for a UMG dialogue

Aug 22, 2022
Text - A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future.

ztd.text Because if text works well in two of the most popular systems programming languages, the entire world over can start to benefit properly. Thi

Aug 30, 2022
Simple text editor in C++ - Simple editor built upon kilo editor.

GUMBO editor Simple editor built upon kilo editor. Still big work in progress although this is just fun side project to learn more C/C++. From 0.0.2->

Sep 15, 2021
Decoding light morse code with a light dependent resistor and Arduino board
Decoding light morse code with a light dependent resistor and Arduino board

Morse decoder The project's idea is very simple, the Arduino program has the responsibility to upload the sensor's data to the USB serial port.

Mar 12, 2022
⛵ The missing small and fast image decoding library for humans (not for machines).
⛵ The missing small and fast image decoding library for humans (not for machines).

Squirrel Abstract Image Library The missing fast and easy-to-use image decoding library for humans (not for machines). Target Audience • Features • Im

Sep 22, 2022
A collection of DLLs that use search order hijacking to automatically inject specified DLLs.

?? Koaloader ?? A collection of DLLs that use search order hijacking to automatically inject specified DLLs. ?? Usage Simply place one of the proxy dl

Aug 30, 2022
A guide that teach you build a custom version of Chrome / Electron on macOS / Windows / Linux that supports hardware / software HEVC decoding.

enable-chromium-hevc-hardware-decoding A guide that teach you build a custom version of Chrome / Electron on macOS / Windows / Linux that supports har

Sep 17, 2022
Aug 29, 2022
3D scanning is becoming more and more ubiquitous.

Welcome to the MeshLib! 3D scanning is becoming more and more ubiquitous. Robotic automation, self-driving cars and multitude of other industrial, med

Sep 14, 2022
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.

wextract Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ te

Jan 5, 2022
LLVM IR and optimizer for shaders, including front-end adapters for GLSL and SPIR-V and back-end adapter for GLSL

Licensing LunarGLASS is available via a three clause BSD-style open source license. Goals The primary goals of the LunarGLASS project are: Reduce the

Sep 12, 2022
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

Sep 16, 2022
Firmware for DMR transceivers using the NXP MK22 MCU, AT1846S RF chip and HR-C6000 DMR chipset. Including the Radioddiy GD-77, Baofeng DM-1801 and Baofeng RD-5R.

OpenGD77 Firmware for DMR transceivers using the NXP MK22 MCU, AT1846S RF chip and HR-C6000 DMR chipset. This includes the Radioddiy GD-77, Radioddity

Sep 9, 2022
By putting in a lot of speed, the speed sequence is sorted and divided, three types of speed interval distribution maps are generated.(including broken line graph,histogram and curve graph)

Auto-drawing-speed-range-map By putting in a lot of speed, the speed sequence is sorted and divided, three types of speed interval distribution maps a

May 14, 2022
The home for algorithms ranging from searching to search all the way to dynamic programming, branch and bound, etc.

Algorithms The home for algorithms ranging from searching and sorting all the way to dynamic programming algorithms, divide and conquer, etc. What are

Dec 6, 2021