Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.

Flashlight Text: Fast, Lightweight Utilities for Text

Quickstart | Installation | Python Documentation | Citing

CircleCI Join the chat at https://gitter.im/flashlight-ml/community codecov

Flashlight Text is a fast, minimal library for text-based operations. It features:

Quickstart

Flashlight Text has Python bindings for decoder and Dictionary components. To install the bindings from source, install KenLM, then clone the repo and build:

git clone https://github.com/flashlight/text && cd text
cd bindings/python
python3 setup.py install

To install without KenLM, set the environment variable USE_KENLM=0 when running setup.py.

See the full Python binding documentation for examples and more.

Building and Installing

From Source (C++) | From Source (Python) | Adding to Your Own Project (C++)

Requirements

At minimum, compilation requires:

  • A C++ compiler with good C++17 support (e.g. gcc/g++ >= 7)
  • CMake — version 3.10 or later, and make
  • A Linux-based operating system.

KenLM Support: If building with KenLM support, KenLM is required. To toggle KenLM support use the FL_TEXT_USE_KENLM CMake option or the USE_KENLM environment variable when building the Python bindings.

Tests: If building tests, Google Test >= 1.10 is required. The FL_TEXT_BUILD_TESTS CMake option toggles building tests.

Instructions for building/installing the Python bindings from source can be found here.

Building from Source

Building the C++ project from source is simple:

git clone https://github.com/flashlight/text && cd flashlight
mkdir build && cd build
cmake ..
make -j$(nproc)
make test    # run tests
make install # install at the CMAKE_INSTALL_PREFIX

To disable KenLM while building, pass -DFL_TEXT_USE_KENLM=OFF to CMake. To disable building tests, pass -DFL_TEXT_BUILD_TESTS=OFF.

KenLM can be downloaded and installed automatically if not found on the local system. The FL_TEXT_BUILD_STANDALONE option controls this behavior — if disabled, dependencies won't be downloaded and built when building.

Adding Flashlight Text to a C++ Project

Given a simple project.cpp file that includes and links to Flashlight Text:

#include <iostream>

#include <flashlight/lib/text/dictionary/Dictionary.h>

int main() {
  fl::lib::text::Dictionary myDict("someFile.dict");
  std::cout << "Dictionary has " << myDict.entrySize()
            << " entries."  << std::endl;
 return 0;
}

The following CMake configuration links Flashlight and sets include directories:

cmake_minimum_required(VERSION 3.10)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

add_executable(myProject project.cpp)

find_package(flashlight-text CONFIG REQUIRED)
target_link_libraries(myProject PRIVATE flashlight::flashlight-text)

Contributing and Contact

Contact: [email protected]

Flashlight Text is actively developed. See CONTRIBUTING for more on how to help out.

Citing

You can cite Flashlight using:

@misc{kahn2022flashlight,
      title={Flashlight: Enabling Innovation in Tools for Machine Learning},
      author={Jacob Kahn and Vineel Pratap and Tatiana Likhomanenko and Qiantong Xu and Awni Hannun and Jeff Cai and Paden Tomasello and Ann Lee and Edouard Grave and Gilad Avidov and Benoit Steiner and Vitaliy Liptchinsky and Gabriel Synnaeve and Ronan Collobert},
      year={2022},
      eprint={2201.12465},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Flashlight Text is under an MIT license. See LICENSE for more information.

Owner
A C++ standalone library for machine learning.
null
Comments
  • Clarify how multi-words are scored from kenlm

    Clarify how multi-words are scored from kenlm

    Question

    I have a wav2vec2 model that outputs uppercase characters. I trained a word-level kenlm that looks like

    ...
    -2.3280644      NOTE TAKING     -0.27668077
    -1.5878379      NOTE TAKING </s>        0
    -1.636697       NOTE TAKING IS  -0.04556015
    -1.6240125      NOTE TAKING A   0
    -2.2470238      NOTE TAKING THAT        0
    -1.671782       NOTE TAKING IN  -0.07720215
    -2.126965       NOTE TAKING YOU 0
    -1.4909623      NOTE TAKING THE 0
    -1.1026655      NOTE TAKING AND -0.007588452
    -1.9699614      NOTE TAKING IT  0
    ...
    -3.9249933      NO TAKING       -0.08624279
    -3.9410322      CASINO TAKING   0
    -3.6370378      PIANO TAKING    0
    -1.0506308      NO TAKING A     0
    -0.9676584      NO TAKING THE   0
    -1.3194044      NO TAKING IT    -0.08496775
    -4.113138       <s> NO TAKING   0
    -4.1317 IS NO TAKING    0
    -4.251807       WAS NO TAKING   0
    -3.4926496      THERE'S NO TAKING       0
    ...
    

    But it generates THIS IS GOING TO ACTIVATE THAT PANEL AND IN A FEW SECONDS YOU'LL NOTICE OUR FRIENDLY FATHOM NO TAKING BUT WILL JOIN US AS WELL

    It seems like multiword is not scored properly from beam search.

    This is my lexicon file that is generated from kenlm

    THERE   T H E R E |
    IS      I S |
    A       A |
    LOT     L O T |
    THAT    T H A T |
    GOES    G O E S |
    INTO    I N T O |
    DRUG    D R U G |
    LAB     L A B |
    CLEANUP C L E A N U P |
    

    Notice that note taking is much more common than no taking. I wonder whether unigram is preferred.

  • Add Python build to MSVC CI

    Add Python build to MSVC CI

    Summary

    See title. Fixed by adding -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON to the Windows build which actually exported everything without having to change the source, hooray

    Enables building Flashlight Text wheels on Windows

    Test plan: CI

  • Fail to find KenLM when installing the python binding

    Fail to find KenLM when installing the python binding

    Question

    I try to install the Python bindings by python setup.py install. However, it fails to find the KenLM even if it has already been installed.

    Additional Context

    The error log is as following, How should I set the environment variable like CMAKE_LIBRARY_PATH, KENLM_LIB or KENLM_ROOT……

    python setup.py install
    running install
    /DB/rhome/chenyuyang/miniconda3/envs/cuda113/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    /DB/rhome/chenyuyang/miniconda3/envs/cuda113/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    running bdist_egg
    running egg_info
    writing flashlight_text.egg-info/PKG-INFO
    writing dependency_links to flashlight_text.egg-info/dependency_links.txt
    writing top-level names to flashlight_text.egg-info/top_level.txt
    package init file 'flashlight/__init__.py' not found (or not a regular file)
    package init file 'flashlight/lib/text/__init__.py' not found (or not a regular file)
    reading manifest file 'flashlight_text.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'flashlight_text.egg-info/SOURCES.txt'
    installing library code to build/bdist.linux-x86_64/egg
    running install_lib
    running build_py
    running build_ext
    -- Looking for KenLM
    -- kenlm library not found; if you already have kenlm installed, please set CMAKE_LIBRARY_PATH, KENLM_LIB or KENLM_ROOT environment variable
    -- kenlm utils library not found; if you already have kenlm installed, please set CMAKE_LIBRARY_PATH, KENLM_UTIL_LIB or KENLM_ROOT environment variable
    -- kenlm model.hh not found; if you already have kenlm installed, please set CMAKE_INCLUDE_PATH, KENLM_MODEL_HEADER or KENLM_ROOT environment variable
    -- Could NOT find kenlm (missing: KENLM_LIBRARIES) 
    CMake Error at flashlight/lib/text/decoder/lm/CMakeLists.txt:21 (message):
      KenLM not found but FL_TEXT_USE_KENLM enabled.  Install KenLM or set the
      KENLM_ROOT environment variable.
    Call Stack (most recent call first):
      flashlight/lib/text/decoder/CMakeLists.txt:3 (include)
      flashlight/lib/text/CMakeLists.txt:8 (include)
      CMakeLists.txt:54 (include)
    

    I've added export KENLM_ROOT=/DB/rhome/chenyuyang/tools/kenlm/build/bin but it seems not work..

  • Fix python namespace packaging + Flashlight pkgs

    Fix python namespace packaging + Flashlight pkgs

    See title. Stop using namespace_packages since it's deprecated, and follow the official docs which stipulate removing __init__.py from every namespace dir that doesn't actually have package assets.

    Test plan: tested install combos, i.e. text:

    conda create -n flashlight-python python=3.10
    conda activate flashlight-python
    cd text/bindings/python
    USE_KENLM=0 python setup.py install
    

    sequence:

    cd sequence/bindings/python
    USE_CUDA=0 python setup.py install
    

    then test incremental install

    python
    > from flashlight.lib.text import decoder
    > from flashlight.lib.sequence import criterion
    

    then remove things:

    pip uninstall flashlight-text
    python
    > from flashlight.lib.sequence import criterion
    

    remove everything

    pip uninstall flashlight-sequence
    python
    > import flashlight
    >> ModuleNotFoundError: No module named 'flashlight'
    

    Checklist

    • [x] Test coverage
    • [x] Tests pass
    • [x] Code formatted
    • [x] Rebased on latest matter
    • [x] Code documented
  • Python bindings for seq2seq decoders

    Python bindings for seq2seq decoders

    Summary

    Bind Seq2Seq/autoregressive beam search decoders from Flashlight Text to Python.

    Two notable subtleties in how the bindings are structured to avoid overhead:

    • EmittingModelStatePtr (a typedef'ed std::shared_ptr<void>) is exposed to Python interop via std::shared_ptr<py::object> (which itself is a reference counted wrapper around PyObject*.
      • shared_ptr will properly modify refcounts of the py::object such that there aren't lifetime issues round-trip -- if Python garbage collects autoregressive model state, refcount will be > 0 if being used in the decoder.
      • get_obj_from_emitting_model_state and create_emitting_model_state can create this type from arbitrary Python objects with ~no overhead.
      • This approach also avoids intermediate copies given that the passed py::object refers to the same underlying memory/handle/is COW
    • EmittingModelUpdateFunc, which is the autoregressive callback defined in Python but called in C++ to get incremental model token scores and model state. This closure is passed from Python --> C++ once at decoder construction; a function pointer's stored in C++ to the Python callable. Opaque types preclude copies of scores from args or return vals -- this will be more carefully investigated/improved over time.

    Tests are self-documenting for now for the LexiconFreeSeq2Seq variant.

    Checklist

    • [x] Test coverage
    • [x] Tests pass
    • [x] Code formatted
    • [x] Rebased on latest matter
    • [x] Code documented
  • Error in installing Python binding

    Error in installing Python binding

    Hi, I am trying to install python binding for Flashlight. I first installed the following packages (I am using python 3.9.12): pip install packaging pip install cmake # installed cmkae version of 3.25.0

    Now if I run the follwong command (I set (BUILD_SHARED_LIBS ON) in the begining of the CmakeList.txt file as it was complaining that the BUILD_SHARED_LIBS is off):

    cmake .

    I get the following error

    CMake Warning (dev) in CMakeLists.txt: No project() command is present. The top-level CMakeLists.txt file must contain a literal, direct call to the project() command. Add a line of code such as

    project(ProjectName)
    

    near the top of the file, but after cmake_minimum_required().

    CMake is pretending there is a "project(Project)" command on the first line. This warning is for project developers. Use -Wno-dev to suppress it.

    CMake Error at CMakeLists.txt:8 (include): include could not find requested file:

    /Buildpybind11.cmake
    

    CMake Error at CMakeLists.txt:9 (include): include could not find requested file: /pybind11Tools.cmake

    CMake Error at CMakeLists.txt:21 (pybind11_add_module): Unknown CMake command "pybind11_add_module". Call Stack (most recent call first): CMakeLists.txt:45 (add_pybind11_extension)

    Can you please help me ?

  • Add Pickle support for LexiconFreeDecoder and options

    Add Pickle support for LexiconFreeDecoder and options

    Summary: Adds support for pickling instances of LexiconFreeDecoderOptions and LexiconFreeDecoder which is needed for pyper training/integration.

    Lexicon-free decoding is the only decoding type currently supported for serialization; it's also the only type for which serialization of any kind makes sense given that decoding state is implemented with opaque pointer types, and reproducing it is expensive and requires breaking a lot of abstraction. Serializing a Lexicon/Trie is also difficult due to how they're efficiently constructed in memory, so it is likely more efficient to simply serialize an uncompressed token set, then deserialize when using a decoder.

    Since there's no way to reliably serialize LMs, only LexiconFreeDecoders with ZeroLMs can be serialized.

    Reviewed By: redraven984

    Differential Revision: D40951537

  • Fix casing of emittingModelUpdateFunc + docs formatting fixes

    Fix casing of emittingModelUpdateFunc + docs formatting fixes

    Summary

    Fix some rename issues with EmittingModelUpdateFunc and fix docs formatting issues.

    Test Plan (required)

    CI

    Checklist

    • [ ] Test coverage
    • [ ] Tests pass
    • [ ] Code formatted
    • [ ] Rebased on latest matter
    • [ ] Code documented
  • Rename rename acoustic-model-specific callbacks

    Rename rename acoustic-model-specific callbacks

    Summary: Rename speech-specific things (mostly acoustic model) to refer to an "emitting model" which might not be a speech-based model.

    Differential Revision: D39797755

  • Update flashlight text decoder extension for TorchAudio adoptation

    Update flashlight text decoder extension for TorchAudio adoptation

    Summary: Make changes to fill some gaps for adopting FL Text decoder in TorchAudio.

    1. Make the extension module flashlight_lib_text_decoder expose KenLM and ZeroLM.
    2. Add ZeroLMPtr alias in ZeroLM.h.

    Differential Revision: D37983766

  • Add Codecov

    Add Codecov

    Add codecov build; rename codecov flag; set standalone CI build to test with standalone disabled

    Test plan: CI + https://app.codecov.io/gh/flashlight/text

  • Is the sil token necessary?

    Is the sil token necessary?

    Question

    There're cases that there's no sil token like sentencepiece tokenizer and Asian languages where sequences are just consist of characters (Chinese, Japanese, etc.). And for cases there're sil, I think it has nothing difference to a meaningful token like a-z.

    AFAIK, other implementations like pyctcdecode and nvidia'nemo do not consider the sil in beam search. Can you explain why it is introduced in flashlight?

Related tags
This is the laplight software for enabling flashlight support on a laptop/netbook. For the specification, see: https://github.com/LapLight/

By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af Afrikaans Afrikaans |

Oct 25, 2022
Typesense is a fast, typo-tolerant search engine for building delightful search experiences.
 Typesense is a fast, typo-tolerant search engine for building delightful search experiences.

Fast, typo tolerant, fuzzy search engine for building delightful search experiences ⚡ ??

Jan 2, 2023
Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as gbk for chinese text search.

rgpre A tool for rg --pre. Read file to console, automatically recognize file encoding, include ansi, utf16le, utf16be, utf8. Currently output ansi as

Mar 18, 2022
A collection of valorant cheating codes, including offsets, world to screen and much more!

Valorant External Cheating Help Always up to date Valorant Offsets + a wide variety of noob friendly helper functions. Functions are not heaviky teste

Jun 12, 2022
Typewriter Effect with Rich Text + *Correct* Text Wrapping
Typewriter Effect with Rich Text + *Correct* Text Wrapping

Typewriter Effect with Rich Text + Correct Text Wrapping I've spent way too long getting this right. This is meant as a base class for a UMG dialogue

Nov 29, 2022
Text - A spicy text library for C++ that has the explicit goal of enabling the entire ecosystem to share in proper forward progress towards a bright Unicode future.

ztd.text Because if text works well in two of the most popular systems programming languages, the entire world over can start to benefit properly. Thi

Dec 25, 2022
Simple text editor in C++ - Simple editor built upon kilo editor.

GUMBO editor Simple editor built upon kilo editor. Still big work in progress although this is just fun side project to learn more C/C++. From 0.0.2->

Sep 15, 2021
A collection of DLLs that use search order hijacking to automatically inject specified DLLs.

?? Koaloader ?? A collection of DLLs that use search order hijacking to automatically inject specified DLLs. ?? Usage Simply place one of the proxy dl

Jan 4, 2023
Decoding light morse code with a light dependent resistor and Arduino board
Decoding light morse code with a light dependent resistor and Arduino board

Morse decoder The project's idea is very simple, the Arduino program has the responsibility to upload the sensor's data to the USB serial port.

Mar 12, 2022
⛵ The missing small and fast image decoding library for humans (not for machines).
⛵ The missing small and fast image decoding library for humans (not for machines).

Squirrel Abstract Image Library The missing fast and easy-to-use image decoding library for humans (not for machines). Target Audience • Features • Im

Dec 19, 2022
Dec 29, 2022
3D scanning is becoming more and more ubiquitous.

Welcome to the MeshLib! 3D scanning is becoming more and more ubiquitous. Robotic automation, self-driving cars and multitude of other industrial, med

Dec 31, 2022
A guide that teach you build a custom version of Chrome / Electron on macOS / Windows / Linux that supports hardware / software HEVC decoding.

enable-chromium-hevc-hardware-decoding A guide that teach you build a custom version of Chrome / Electron on macOS / Windows / Linux that supports har

Jan 1, 2023
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.
Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ text to use in a suitable player.

wextract Cross-platform tool to extract wavetables and draw envelopes from sample files, exporting the wavetable and generating the appropriate SFZ te

Jan 5, 2022
LLVM IR and optimizer for shaders, including front-end adapters for GLSL and SPIR-V and back-end adapter for GLSL

Licensing LunarGLASS is available via a three clause BSD-style open source license. Goals The primary goals of the LunarGLASS project are: Reduce the

Dec 8, 2022
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

Dec 28, 2022
Firmware for DMR transceivers using the NXP MK22 MCU, AT1846S RF chip and HR-C6000 DMR chipset. Including the Radioddiy GD-77, Baofeng DM-1801 and Baofeng RD-5R.

OpenGD77 Firmware for DMR transceivers using the NXP MK22 MCU, AT1846S RF chip and HR-C6000 DMR chipset. This includes the Radioddiy GD-77, Radioddity

Dec 31, 2022
By putting in a lot of speed, the speed sequence is sorted and divided, three types of speed interval distribution maps are generated.(including broken line graph,histogram and curve graph)

Auto-drawing-speed-range-map By putting in a lot of speed, the speed sequence is sorted and divided, three types of speed interval distribution maps a

May 14, 2022
The home for algorithms ranging from searching to search all the way to dynamic programming, branch and bound, etc.

Algorithms The home for algorithms ranging from searching and sorting all the way to dynamic programming algorithms, divide and conquer, etc. What are

Dec 6, 2021