This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular expressions. Roughly speaking, it can just check whether given text maches the certain regexp, but can do it really fast (more than 400 MB/s on our hardware is common). Even more, multiple regexps can be combined together, giving capability to check the text against apx.10 regexps in a single pass (and mantaining the same speed). Since Pire examines each character only once, without any lookaheads or rollbacks, spending about five machine instructions per each character, it can be used even in realtime tasks. On the other hand, Pire has very limited functionality (compared to other regexp libraries). Pire does not have any Perlish conditional regexps, lookaheads & backtrackings, greedy/nongreedy matches; neither has it any capturing facilities. Pire was developed in Yandex (http://company.yandex.ru/) as a part of its web crawler. More information can be found in README.ru (in Russian), which is yet to be translated. Please report bugs to [email protected] or [email protected] Quick Start ============= #include <stdio.h> #include <vector> #include <pire/pire.h> Pire::NonrelocScanner CompileRegexp(const char* pattern) { // Transform the pattern from UTF-8 into UCS4 std::vector<Pire::wchar32> ucs4; Pire::Encodings::Utf8().FromLocal(pattern, pattern + strlen(pattern), std::back_inserter(ucs4)); return Pire::Lexer(ucs4.begin(), ucs4.end()) .AddFeature(Pire::Features::CaseInsensitive()) // enable case insensitivity .SetEncoding(Pire::Encodings::Utf8()) // set input text encoding .Parse() // create an FSM .Surround() // PCRE_ANCHORED behavior .Compile<Pire::NonrelocScanner>(); // compile the FSM } bool Matches(const Pire::NonrelocScanner& scanner, const char* ptr, size_t len) { return Pire::Runner(scanner) .Begin() // '^' .Run(ptr, len) // the text .End(); // '$' // implicitly cast to bool } int main() { char re[] = "hello\\s+w.+d$"; char str[] = "Hello world"; Pire::NonrelocScanner sc = CompileRegexp(re); bool res = Matches(sc, str, strlen(str)); printf("String \"%s\" %s \"%s\"\n", str, (res ? "matches" : "doesn't match"), re); return 0; }
Perl Incompatible Regular Expressions library
Owner
Yandex
Comments
-
DoRun: fix possible invalid read
If a string passed to
DoRun
is less thansizeof(size_t)
, then its content can't be read as if its type issize_t*
.DoRun
splits the string to 3 parts: pre-aligned, aligned and post-aligned. If there is no aligned part of a string, then whole string is treated as pre- and post-aligned.The error was caused by casting pre- and post-aligned parts to
size_t*
. Now they are processed by code operating onchar*
(functionSafeRunChunk
). (Actually pre-aligned part is allowed to be casted tosize_t*
, if aligned part exists, ie. length of the string >size(size_t)
.)Target strings:
- length <= 6 - error
- length >= 7 - no error
String of length 7 fits in size_t (+1 byte for terminating '\0').
Valgrind output before this patch applied:
==26448== Invalid read of size 8 ==26448== at 0x60439F5: void Pire::Impl::DoRun<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >, Pire::Impl::RunPred<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > > >(Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > const&, Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >::State&, char const*, char const*, Pire::Impl::RunPred<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >) (run.h:129) ==26448== by 0x603C906: void Pire::Run<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >(Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > const&, Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> >::State&, char const*, char const*) (run.h:266) ==26448== by 0x60438C6: Pire::RunHelper<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >::Run(char const*, char const*) (run.h:341) ==26448== by 0x603C899: Pire::RunHelper<Pire::Impl::Scanner<Pire::Impl::Relocatable, Pire::Impl::ExitMasks<2ul> > >::Run(char const*, unsigned long) (run.h:342)
Applying this patch the valgrind error was fixed.
-
Encoding issues
I have a very simple regexp
std::string r; r = "(post|get|put|delete).*http/1\\.(1|0)\r\n.*\r\n\r\n";
The test data is:
GET / HTTP/1.1 User-Agent: chrome Host: ya.ru Accept: / Proxy-Connection: Keep-Alive
Which has a proper format i.e. it contains \r\n after each line and extra \r\n after the header. Creation of the scanner is done with the following function:
Pire::Scanner flow::scannerFor( std::string regexp ){ if ( !regexp.size() ) { throw common::error( _ERR_EMPTY_REGEXP ); } std::vector<Pire::wchar32> pattern; Pire::Scanner s; try { Pire::Encodings::Utf8().FromLocal( regexp.c_str(), regexp.c_str() + regexp.size(), std::back_inserter(pattern) ); s = Pire::Lexer( pattern.begin(), pattern.end() ) .SetEncoding( Pire::Encodings::Utf8() ) .AddFeature( Pire::Features::CaseInsensitive() ) .Parse() .Surround() .Compile<Pire::Scanner>(); } catch ( ... ) { throw common::error( _ERR_REGEXP_COMPILE .arg( regexp ) ); } return s; }
When I use scanner which is compiled for my regexp with the test data it fails to match. If I comment the line
.SetEncoding( Pire::Encodings::Utf8() )
in the scanner's creation function, scanner starts to match. Could you comment this situation?
-
Fix build (with newer autotools?)
On FreeBSD-9.1, with automake-1.14 autoconf-2.69, in generated pire/Makefile:in:
... pire_hdr_HEADERS = \ align.h \ any.h \ defs.h \ determine.h \ easy.h \ encoding.h \ extra.h \ fsm.h \ fwd.h \ glue.h \ partition.h \ pire.h \ re_lexer.h \ re_parser.h \ run.h \ static_assert.h \ platform.h \ vbitset.h ... BUILT_SOURCES = re_parser.h re_parser.cpp CLEANFILES = re_parser.h re_parser.cpp ... re_parser.hpp: re_parser.cpp @if test ! -f [email protected]; then rm -f re_parser.cpp; else :; fi @if test ! -f [email protected]; then $(MAKE) $(AM_MAKEFLAGS) re_parser.cpp; else :; fi ...
e.g. a rule for generating re_parser.hpp is generated, but re_parser.h is expected.
To keep header file naming uniform, either
- all headers should be renamed to .hpp as well
- maybe all sources may be renamed to .cc instead of this patch (haven't tested, just a wild guess)
- there may be a way to tell autotools explicitly that .h should be generated
-
AdvancedCountingScanner as an alternative to CountingScanner
There are 3 main differences, see unit tests.
- AdvancedCountingScanner never enters a dead state, see test
Count("[a-z\320\260-\321\217]+", " +", " \320\260\320\260\320\220 abc def \320\260 cd")
- AdvancedCountingScanner returns 1 for
Count("a", "b", "aaa")
, while CountingScanner returns 0 there. - AdvancedCountingScanner better handles complex regular expressions, see test CountGreedy.
- AdvancedCountingScanner never enters a dead state, see test
-
Compile Error
[email protected]:~/test/pire$ uname -a Linux ubuntu 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36 UTC 2009 x86_64 GNU/Linux [email protected]:~/test/pire$ g++ -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.3-5ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) [email protected]:~/test/pire$ make make all-recursive make[1]: Entering directory
/home/yaoweibin/test/pire' Making all in pire make[2]: Entering directory
/home/yaoweibin/test/pire/pire' make all-am make[3]: Entering directory/home/yaoweibin/test/pire/pire' /bin/bash ../ylwrap inline.lpp .c inline.cpp -- : make[3]: *** [inline.cpp] Error 1 make[3]: Leaving directory
/home/yaoweibin/test/pire/pire' make[2]: *** [all] Error 2 make[2]: Leaving directory/home/yaoweibin/test/pire/pire' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory
/home/yaoweibin/test/pire' make: *** [all] Error 2 -
Make a new release
Hi!
Thank you for the such cool library! I'm interested in packaging the library into some dependecy managers. It will be much easier if you make a release on GitHub, so in a package recipe I will rely on some "stable" version instead of specific commit. E.g. it much easier to create a package for Conan with release.
I found that from the last release (in 2013 - 7 years ago) there are a lot of changes. Will be fine if they will be released.
Thank you!
-
Fix Parse() memory leak on invalid regexps
All Any elements took place on bison stack. Normally they are deleting during parsing process but if error happens they will remain on stack. Bison provides destructor operator for cleaning stack. We can use it we but should assure that all stack elements are valid for delete operator.
-
Counting scanner with no limit on number of glued expressions
Added new class NoGlueLimitCountingScanner. It works exactly like AdvancedCountingScanner, but does not have limit on the number of glued regular expressions. New class is added to existing unit tests.
Added unit test for gluing many regular expressions. Current problem: classes CountingScanner and AdvancedCountingScanner don't check whether the limit on number of expressions is exceeded. Glueing too many expressions just leads to incorrect count results. Fix made: now Glue() function returns empty scanner when limit is exceeded. This corresponds to return value when glueing fails due to limit on number of states.
Some minor fixes and improvements: added move constructors, fixed compiler warnings, etc.
-
Add SlowCapturingScanner
Add the new version of CapturingScanner based on NFA, which is more correct than the previous one. Add new operators of 'non-greedy' repetitions, for example, '+?', '??', '*?', which give a chance for users to tell, that the text, captured by those repetitions, should be the shortest one Added c++11 flag to configure.ac
-
Clarify the comment on precedence of features
IMHO the wording "the higher the priority" is somewhat misleading, as if it naturally means "the greater the priority", the statement is wrong, and if it means "the less the priority", it is not obvious.
Also fixed a typo in "eariler".
-
Python binding
This is a cython-based python binding of PIRE. Mako template engine is used heavily to address repetitiveness of similar interfaces.
What is not wrapped:
- It is impossible yet to subclass
Feature
s andScanner
s in python and pass the extension back to C++; - Most low-level operations with Fsm;
mmap
- andAction
-related methods and functions;Run
for pair of scanners;Feature
andEncoding
classes.
Interface of the binding is similar to the original one. Differences:
- All C++-space global template functions are wrapped as python instance methods.
Fsm::operator * ()
is wrapped asFsm.Iterated()
.- All scanners' states are represented as classes similar to
Pire::RunHelper
. Encoding
,Feature
andOption
abstractions are replaced with singleOptions
abstraction, which is used to tweak Lexer behavior.Options
can be either parsed from string such as"aiyu"
or composed of predefined constants such asI
andUTF8
.- Instead of
lexer::AddFeature(Capture(42))
you uselexer.AddCapturing(42)
. - An
OverflowError
is raised on unsuccessful glue operation instead of returning empty scanner.
- It is impossible yet to subclass
-
fix possible warning in multi.h
Fix the following warning:
'*((void*)& s +72)' may be used uninitialized in this function /multi.h:243:11: note: '*((void*)& s +72)' was declared here Scanner s;
-
add LettersCount() to SlowScanner
Other scanner types have this method. SlowScanner can implement it easily. I write template code parametrized with scanner type, in which I need
LettersCount
.
Related tags
Onigmo is a regular expressions library forked from Oniguruma.
Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp
C++ regular expressions made easy
CppVerbalExpressions C++ Regular Expressions made easy VerbalExpressions is a C++11 Header library that helps to construct difficult regular expressio
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams
Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.
SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co
High-performance regular expression matching library
Hyperscan Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre libra
regular expression library
Oniguruma https://github.com/kkos/oniguruma Oniguruma is a modern and flexible regular expressions library. It encompasses features from different reg
A Compile time PCRE (almost) compatible regular expression matcher.
Compile time regular expressions v3 Fast compile-time regular expressions with support for matching/searching/capturing during compile-time or runtime
A small implementation of regular expression matching engine in C
cregex cregex is a compact implementation of regular expression (regex) matching engine in C. Its design was inspired by Rob Pike's regex-code for the
The approximate regex matching library and agrep command line tool.
Introduction TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzz
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.
Glob pattern to regex translator in C++11. Optionally, directory traversal with glob pattern in C++17. Header-only library.
RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
This is the source code repository for RE2, a regular expression library. For documentation about how to install and use RE2, visit https://github.co
Onigmo is a regular expressions library forked from Oniguruma.
Onigmo (Oniguruma-mod) https://github.com/k-takata/Onigmo Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new exp
A system to flag anomalous source code expressions by learning typical expressions from training data
A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niran
C++ regular expressions made easy
CppVerbalExpressions C++ Regular Expressions made easy VerbalExpressions is a C++11 Header library that helps to construct difficult regular expressio
A powerful and fast search tool using regular expressions
A powerful and fast search tool using regular expressions
A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams
Name libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams Table of Contents Name Statu
A C++ Web Framework built on top of Qt, using the simple approach of Catalyst (Perl) framework.
Cutelyst - The Qt Web Framework A Web Framework built on top of Qt, using the simple and elegant approach of Catalyst (Perl) framework. Qt's meta obje
A single-header C/C++ library for parsing and evaluation of arithmetic expressions
ceval A C/C++ header for parsing and evaluation of arithmetic expressions. [README file is almost identical to that of the ceval library] Functions ac
A single-header C/C++ library for parsing and evaluation of arithmetic expressions
ceval A C/C++ header for parsing and evaluation of arithmetic expressions. [README file is almost identical to that of the ceval library] Functions ac
SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and complex regex expressions.
SRL-CPP SRL-CPP is a Simple Regex Language builder library written in C++11 that provides an easy to use interface for constructing both simple and co