Perl Incompatible Regular Expressions library

This is PIRE, Perl Incompatible Regular Expressions library.

This library is aimed at checking a huge amount of text against
relatively many regular expressions. Roughly speaking, it can just
check whether given text maches the certain regexp, but can do it
really fast (more than 400 MB/s on our hardware is common). Even more,
multiple regexps can be combined together, giving capability to
check the text against apx.10 regexps in a single pass (and mantaining
the same speed).

Since Pire examines each character only once, without any lookaheads
or rollbacks, spending about five machine instructions per each character,
it can be used even in realtime tasks.

On the other hand, Pire has very limited functionality (compared to
other regexp libraries). Pire does not have any Perlish conditional
regexps, lookaheads & backtrackings, greedy/nongreedy matches; neither
has it any capturing facilities.

Pire was developed in Yandex (http://company.yandex.ru/) as a part of its
web crawler.

More information can be found in README.ru (in Russian), which is
yet to be translated.

Please report bugs to [email protected] or [email protected]

Quick Start
=============

#include <stdio.h>
#include <vector>
#include <pire/pire.h>

Pire::NonrelocScanner CompileRegexp(const char* pattern)
{
	// Transform the pattern from UTF-8 into UCS4
	std::vector<Pire::wchar32> ucs4;
	Pire::Encodings::Utf8().FromLocal(pattern, pattern + strlen(pattern), std::back_inserter(ucs4));

	return Pire::Lexer(ucs4.begin(), ucs4.end())
		.AddFeature(Pire::Features::CaseInsensitive())	// enable case insensitivity
		.SetEncoding(Pire::Encodings::Utf8())		// set input text encoding
		.Parse() 					// create an FSM 
		.Surround()					// PCRE_ANCHORED behavior
		.Compile<Pire::NonrelocScanner>();		// compile the FSM
}

bool Matches(const Pire::NonrelocScanner& scanner, const char* ptr, size_t len)
{
	return Pire::Runner(scanner)
		.Begin()	// '^'
		.Run(ptr, len)	// the text 
		.End();		// '$'
		// implicitly cast to bool
}

int main()
{
	char re[] = "hello\\s+w.+d$";
	char str[] = "Hello world";

	Pire::NonrelocScanner sc = CompileRegexp(re);

	bool res = Matches(sc, str, strlen(str));

	printf("String \"%s\" %s \"%s\"\n", str, (res ? "matches" : "doesn't match"), re);
		
	return 0;
}

Owner
Yandex
Yandex open source projects and technologies
Yandex
Comments
  • DoRun: fix possible invalid read

    DoRun: fix possible invalid read

    If a string passed to DoRun is less than sizeof(size_t), then its content can't be read as if its type is size_t*. DoRun splits the string to 3 parts: pre-aligned, aligned and post-aligned. If there is no aligned part of a string, then whole string is treated as pre- and post-aligned.

    The error was caused by casting pre- and post-aligned parts to size_t*. Now they are processed by code operating on char* (function SafeRunChunk). (Actually pre-aligned part is allowed to be casted to size_t*, if aligned part exists, ie. length of the string > size(size_t).)

    Target strings:

    • length <= 6 - error
    • length >= 7 - no error

    String of length 7 fits in size_t (+1 byte for terminating '\0').

    Valgrind output before this patch applied:

    ==26448== Invalid read of size 8                                                                      
    ==26448==    at 0x60439F5: void Pire::Impl::DoRun<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >, Pire::Impl::RunPred<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > > >(Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > const&, Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >::State&, char const*, char const*, Pire::Impl::RunPred<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >) (run.h:129)                                                                                          
    ==26448==    by 0x603C906: void Pire::Run<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >(Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > const&, Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >::State&, char const*, char const*) (run.h:266)
    ==26448==    by 0x60438C6: Pire::RunHelper<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >::Run(char const*, char const*) (run.h:341)                                           
    ==26448==    by 0x603C899: Pire::RunHelper<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >::Run(char const*, unsigned long) (run.h:342) 
    

    Applying this patch the valgrind error was fixed.

  • Encoding issues

    Encoding issues

    I have a very simple regexp

    std::string r;
    r  = "(post|get|put|delete).*http/1\\.(1|0)\r\n.*\r\n\r\n";
    

    The test data is:

    GET / HTTP/1.1 User-Agent: chrome Host: ya.ru Accept: / Proxy-Connection: Keep-Alive

    Which has a proper format i.e. it contains \r\n after each line and extra \r\n after the header. Creation of the scanner is done with the following function:

    Pire::Scanner flow::scannerFor( std::string regexp ){
        if ( !regexp.size() ) {
            throw common::error( _ERR_EMPTY_REGEXP );
        }
    
        std::vector<Pire::wchar32> pattern;
        Pire::Scanner s;
    
        try {
            Pire::Encodings::Utf8().FromLocal( regexp.c_str(), regexp.c_str() + regexp.size(), std::back_inserter(pattern) );
    
            s = Pire::Lexer( pattern.begin(), pattern.end() )
                .SetEncoding( Pire::Encodings::Utf8() )
                .AddFeature( Pire::Features::CaseInsensitive() )
                .Parse()
                .Surround()
                .Compile<Pire::Scanner>();
    
        } catch ( ... ) {
            throw common::error( _ERR_REGEXP_COMPILE
                .arg( regexp ) 
            );
        }
    
        return s;
    }
    

    When I use scanner which is compiled for my regexp with the test data it fails to match. If I comment the line

    .SetEncoding( Pire::Encodings::Utf8() )

    in the scanner's creation function, scanner starts to match. Could you comment this situation?

  • Fix build (with newer autotools?)

    Fix build (with newer autotools?)

    On FreeBSD-9.1, with automake-1.14 autoconf-2.69, in generated pire/Makefile:in:

    ...
    pire_hdr_HEADERS = \
            align.h \
            any.h \
            defs.h \
            determine.h \
            easy.h \
            encoding.h \
            extra.h \
            fsm.h \
            fwd.h \
            glue.h \
            partition.h \
            pire.h \
            re_lexer.h \
            re_parser.h \
            run.h \
            static_assert.h \
            platform.h \
            vbitset.h
    ...
    BUILT_SOURCES = re_parser.h re_parser.cpp
    CLEANFILES = re_parser.h re_parser.cpp
    ...
    re_parser.hpp: re_parser.cpp
            @if test ! -f [email protected]; then rm -f re_parser.cpp; else :; fi
            @if test ! -f [email protected]; then $(MAKE) $(AM_MAKEFLAGS) re_parser.cpp; else :; fi
    ...
    

    e.g. a rule for generating re_parser.hpp is generated, but re_parser.h is expected.

    To keep header file naming uniform, either

    • all headers should be renamed to .hpp as well
    • maybe all sources may be renamed to .cc instead of this patch (haven't tested, just a wild guess)
    • there may be a way to tell autotools explicitly that .h should be generated
  • AdvancedCountingScanner as an alternative to CountingScanner

    AdvancedCountingScanner as an alternative to CountingScanner

    There are 3 main differences, see unit tests.

    1. AdvancedCountingScanner never enters a dead state, see test Count("[a-z\320\260-\321\217]+", " +", " \320\260\320\260\320\220 abc def \320\260 cd")
    2. AdvancedCountingScanner returns 1 for Count("a", "b", "aaa"), while CountingScanner returns 0 there.
    3. AdvancedCountingScanner better handles complex regular expressions, see test CountGreedy.
  • Compile Error

    Compile Error

    [email protected]:~/test/pire$ uname -a Linux ubuntu 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36 UTC 2009 x86_64 GNU/Linux [email protected]:~/test/pire$ g++ -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.3-5ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) [email protected]:~/test/pire$ make make all-recursive make[1]: Entering directory /home/yaoweibin/test/pire' Making all in pire make[2]: Entering directory/home/yaoweibin/test/pire/pire' make all-am make[3]: Entering directory /home/yaoweibin/test/pire/pire' /bin/bash ../ylwrap inline.lpp .c inline.cpp -- : make[3]: *** [inline.cpp] Error 1 make[3]: Leaving directory/home/yaoweibin/test/pire/pire' make[2]: *** [all] Error 2 make[2]: Leaving directory /home/yaoweibin/test/pire/pire' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory/home/yaoweibin/test/pire' make: *** [all] Error 2

  • Make a new release

    Make a new release

    Hi!

    Thank you for the such cool library! I'm interested in packaging the library into some dependecy managers. It will be much easier if you make a release on GitHub, so in a package recipe I will rely on some "stable" version instead of specific commit. E.g. it much easier to create a package for Conan with release.

    I found that from the last release (in 2013 - 7 years ago) there are a lot of changes. Will be fine if they will be released.

    Thank you!

  • Fix Parse() memory leak on invalid regexps

    Fix Parse() memory leak on invalid regexps

    All Any elements took place on bison stack. Normally they are deleting during parsing process but if error happens they will remain on stack. Bison provides destructor operator for cleaning stack. We can use it we but should assure that all stack elements are valid for delete operator.

  • Counting scanner with no limit on number of glued expressions

    Counting scanner with no limit on number of glued expressions

    Added new class NoGlueLimitCountingScanner. It works exactly like AdvancedCountingScanner, but does not have limit on the number of glued regular expressions. New class is added to existing unit tests.

    Added unit test for gluing many regular expressions. Current problem: classes CountingScanner and AdvancedCountingScanner don't check whether the limit on number of expressions is exceeded. Glueing too many expressions just leads to incorrect count results. Fix made: now Glue() function returns empty scanner when limit is exceeded. This corresponds to return value when glueing fails due to limit on number of states.

    Some minor fixes and improvements: added move constructors, fixed compiler warnings, etc.

  • Add SlowCapturingScanner

    Add SlowCapturingScanner

    Add the new version of CapturingScanner based on NFA, which is more correct than the previous one. Add new operators of 'non-greedy' repetitions, for example, '+?', '??', '*?', which give a chance for users to tell, that the text, captured by those repetitions, should be the shortest one Added c++11 flag to configure.ac

  • Clarify the comment on precedence of features

    Clarify the comment on precedence of features

    IMHO the wording "the higher the priority" is somewhat misleading, as if it naturally means "the greater the priority", the statement is wrong, and if it means "the less the priority", it is not obvious.

    Also fixed a typo in "eariler".

  • Python binding

    Python binding

    This is a cython-based python binding of PIRE. Mako template engine is used heavily to address repetitiveness of similar interfaces.

    What is not wrapped:

    • It is impossible yet to subclass Features and Scanners in python and pass the extension back to C++;
    • Most low-level operations with Fsm;
    • mmap- and Action-related methods and functions;
    • Run for pair of scanners;
    • Feature and Encoding classes.

    Interface of the binding is similar to the original one. Differences:

    • All C++-space global template functions are wrapped as python instance methods.
    • Fsm::operator * () is wrapped as Fsm.Iterated().
    • All scanners' states are represented as classes similar to Pire::RunHelper.
    • Encoding, Feature and Option abstractions are replaced with single Options abstraction, which is used to tweak Lexer behavior. Options can be either parsed from string such as "aiyu" or composed of predefined constants such as I and UTF8.
    • Instead of lexer::AddFeature(Capture(42)) you use lexer.AddCapturing(42).
    • An OverflowError is raised on unsuccessful glue operation instead of returning empty scanner.
  • Conan package

    Conan package

    Hello, Do you know about Conan? Conan is modern dependency manager for C++. And will be great if your library will be available via package manager for other developers.

    Here you can find example, how you can create package for the library.

    If you have any questions, just ask :-)

  • fix possible warning in multi.h

    fix possible warning in multi.h

    Fix the following warning:

    '*((void*)& s +72)' may be used uninitialized in this function
    /multi.h:243:11: note: '*((void*)& s +72)' was declared here
    Scanner s;
    
  • add LettersCount() to SlowScanner

    add LettersCount() to SlowScanner

    Other scanner types have this method. SlowScanner can implement it easily. I write template code parametrized with scanner type, in which I need LettersCount.

Onigmo is a regular expressions library forked from Oniguruma.

Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp

Dec 6, 2022
C++ regular expressions made easy

CppVerbalExpressions C++ Regular Expressions made easy VerbalExpressions is a C++11 Header library that helps to construct difficult regular expressio

Nov 29, 2022
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams

Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu

Dec 22, 2022
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.

SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co

Mar 9, 2022
High-performance regular expression matching library

Hyperscan Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre libra

Jan 1, 2023
regular expression library

Oniguruma https://github.com/kkos/oniguruma Oniguruma is a modern and flexible regular expressions library. It encompasses features from different reg

Jan 3, 2023
A Compile time PCRE (almost) compatible regular expression matcher.

Compile time regular expressions v3 Fast compile-time regular expressions with support for matching/searching/capturing during compile-time or runtime

Jan 5, 2023
A small implementation of regular expression matching engine in C

cregex cregex is a compact implementation of regular expression (regex) matching engine in C. Its design was inspired by Rob Pike's regex-code for the

Dec 6, 2022
The approximate regex matching library and agrep command line tool.

Introduction TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzz

Dec 26, 2022
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.

Oct 27, 2021
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

This is the source code repository for RE2, a regular expression library. For documentation about how to install and use RE2, visit https://github.co

Jan 4, 2023
Onigmo is a regular expressions library forked from Oniguruma.

Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp

Dec 6, 2022
A system to flag anomalous source code expressions by learning typical expressions from training data
A system to flag anomalous source code expressions by learning typical expressions from training data

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran

Dec 30, 2022
C++ regular expressions made easy

CppVerbalExpressions C++ Regular Expressions made easy VerbalExpressions is a C++11 Header library that helps to construct difficult regular expressio

Nov 29, 2022
A powerful and fast search tool using regular expressions
A powerful and fast search tool using regular expressions

A powerful and fast search tool using regular expressions

Jan 8, 2023
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams

Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu

Dec 22, 2022
A C++ Web Framework built on top of Qt, using the simple approach of Catalyst (Perl) framework.

Cutelyst - The Qt Web Framework A Web Framework built on top of Qt, using the simple and elegant approach of Catalyst (Perl) framework. Qt's meta obje

Dec 19, 2022
A single-header C/C++ library for parsing and evaluation of arithmetic expressions

ceval A C/C++ header for parsing and evaluation of arithmetic expressions. [README file is almost identical to that of the ceval library] Functions ac

Oct 10, 2022
A single-header C/C++ library for parsing and evaluation of arithmetic expressions

ceval A C/C++ header for parsing and evaluation of arithmetic expressions. [README file is almost identical to that of the ceval library] Functions ac

Oct 10, 2022
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.

SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co

Mar 9, 2022