languages/regex: Compile regular expressions into optimized bytecode. NOTE: This Is Not The Matching Engine You're Looking For. See the bottom of this readme, in the section "WHAT ABOUT P6GE?" RUNNING ======= To run a bunch of tests: make test To compile an arbitrary regexp down to code: perl regex.pl '(a*|b)a' (Note that on some platforms, including Windows, you may need to use double quotes in place of the single quotes.) To turn off optimization so you tell what's going on a little better: perl regex.pl --no-optimize '(a*|b)a' Actually, there are two optimization phases: one for tree-based optimizations (reorganizing the parse tree, adding and deleting nodes) and list-based optimizations (really just peephole optimizations). You can enable just one or the other using 't' for Tree and 'l' for List: perl regex.pl --optimize=t '(a*|b)a' perl regex.pl --optimize=l '(a*|b)a' To do the same thing, only generating perl5 backend code instead of PIR code: perl regex.pl --language=perl5 '(a*|b)a' For development, it may also be useful to dump out the tree (remember, by default it will be the optimized tree): perl regex.pl '(a*|b)a' dump Or you can have it render the tree back to a (Perl5 syntax) regular expression: perl regex.pl '(a*|b)a' render All of this really ought to be in a usage message. To run any of the generated code, you need the Match and MatchRange PMCs. They should be generated automatically now by the root makefile, but if for some reason they aren't, run 'make' in src/dynpmc/ to generate them. The above commands only generate subroutines for matching regular expressions. You will need to provide your own code to call the matching routine with some input. Or, just to try it out, you can use the --main argument to regex.pl to generate a sample driver program that accepts an input string on the command line: perl regex.pl --main -o mytest.pir 'a*x' ../../parrot mytest.pir baax (output) Match found 0: 1..3 Or for the Perl5 backend: perl regex.pl --language=perl5 --main -o mytest.pl 'a*x' perl mytest.pl baax Note that the output in this case is a bunch of nasty Perl data structures encoding the matching information. New stuff: now you can use the compiler as an embedded Parrot compiler. Run 'make regex-compiler.pbc' to generate regex-compiler.pbc. Then run ../../parrot 01_basic.pir, ../../parrot 02_date.pir, etc. These will dynamically load in the regex-compiler as the compiler for the "regex" language, and then use it to compile some code and execute it. Also, I have hacked multiple rule definitions into the silly frontend parser. I don't intend to stick with this syntax for long, since I want to get to a point soon where I can bootstrap this engine's own parser, but for now you can define rules with '&rulename=...'. So to compile a simple expression parser, put this in a file 'expr.rx': &expr = \* | \/ | &term = \+ | \- | &atom = [0-9]+ | \( \) &default = and run (on Unix): perl regex.pl --main --language=perl5 --file=expr.rx > expr.pl perl expr.pl '3+(45*2)' All of the things are there so that the rules get nicely captured -- by default, nested rule calls are *not* capturing. You can also insert actions (coded in Perl5, or the language of whichever backend you are using), but currently there is no nice way of setting the value of a rule, so they are of limited usefulness. An example: perl regex.pl --language=perl5 --main 'a* { print "Matched some as\n" } b' In some cases, you can also access the current state of the match: perl regex.pl --language=perl5 --main 'a* { print "Matched $MATCH{0}\n" } b' The first example should now also work with --language=pir (the default) as well. The second example wouldn't, because it isn't valid PIR. Examples of using regex.pl to generate code that is compiled on the fly are in 01_basic.pir and 02_date.pir. If you use the -d flag, you'll get a very verbose description of the matching progress as you run. STATUS ====== Everything should be more or less working, though many operators are untested. Theoretically, this release should support (using Perl5 syntax here): RS - sequences R|S - alternation R* - greedy Kleene closure R*? - nongreedy (parsimonious) Kleene closure R? - greedy optional R?? - nongreedy (parsimonious) optional R+ - greedy one or more ("more or one") R+? - nongreedy one or more (R) - capturing groups (?:R) - noncapturing grouping a - codepoint literals [a-z] - character classes {n,m} - greedily match n..m times {n,m}? - nongreedily match n..m times - rule invocation (Perl6 feature) (?{ }) - embedded code (just {...} in Perl6) Missing perl5 features: \w - named character classes \U - arbitrary codepoints \p - properties \b - word boundary /ism - flags \1 - back references \G - start position (?=R) - look-ahead assertion (?!R) - negative look-ahead assertion (?<=) - look-behind assertion (?R) - independent subexpression (?(cond)R|S) - conditional expression (R?)* - empty match suppression Missing perl6 features: : :: ::: hypotheticals - code refs with two entry points (enter and backtrack) @var - run-time alternatives <@var> - run-time alternative regexps (requires run-time regex compiler) %var - hash key alternatives <%var> - as above, but values are always regexps - negation (what is this?) (is this just negative lookahead?) ... - and many, many, many more Other features I'd like to have: @A =~ - array matching ? - reentrant, suspendable, coroutinish regexes ? - two-dimensional regexes Regular expressions are compiled down to regular opcodes (not the rx_* set of opcodes). A *Array PMC of some sort is used as the backtracking stack. See rx.ops for a good description of how operators are converted to code sequences (except I don't use the rx_* ops that it describes). Marks are the value -1; indices are nonnegative integers. (Except in debugging mode, when marks are instead strings describing what they're marking.) For details on exactly how things are translated with this compiler, read Rewrite.pm. "Local" variables: * One integer temporary, not preserved between tree op rewrites (it's just a very short-term temp register) * The current position within the input string * The length of the input string * A Match PMC, for holding start and stop positions for matched groups * Other stuff, but this list is hopelessly out of date at the moment Optimizations implemented (notation: parentheses here are non-capturing): aR|aS -> a(R|S) Rb|Sb -> (R|S)b R| -> R? |R -> R?? Future plans: Relatively soon, I would like to add array-based regular expressions. A simple cut of this should be nearly trivial. Near-term optimizations planned: Simple subexpression alternation: the code for alternations can be simplified if the subexpressions do not contain backtrack points. Disjunctive alternation: if you see R|S, and know that only one of R or S will ever hold at a given point in any input, then no backtracking information needs to be kept. For example, consider cat|fish (or somewhat more generally, cR|fS). The input cannot both start with c and f, so just matching 'c' first. If it matches, keep it and never go back to trying 'f'. Otherwise, forget about it completely and try 'f'. As a follow-on to the above, implement jump tables. c -> $start_R f -> $start_S else -> backtrack Length check suppression: if you try to match /abc/, the unoptimized engine will check to be sure there's at least one character left, then match 'a', then check to be sure there's still at least one character left, then match 'b', etc. The tree optimization pass will transform this into: check if there are at least 3 characters left, then match 'a', then match 'b', then match 'c'. Obviously, this particular case might be further optimized by doing some kind of fixed substring match (or a Boyer-Moore "skip a number of characters determined by the input and the fixed substring"), but this optimization works for more general cases (such as C<< a.[0-9](x|\w) >>). Longer-term optimization vague ideas: Find maximal subtrees of regex ops that can be converted to DFAs. Translate them into in-line DFAs. The jump tables above are a primitive form of this. The hard part is figuring out whether a DFA would produce exactly the same results as an NFA for a given expression. (It's a cost/benefit game. Some expressions trivially behave the same, some trivially behave differently, and some are difficult to determine.) BUGS ==== - The miserable state of documentation - The ad-hoc list op data structure - The lack of complex test cases DEVELOPER NOTES =============== If you make changes to Grammar.yp, you'll need Parse::Yapp to regenerate Grammar.pm. Run 'make' with no options to pass the correct command-line parameters. For a painfully detailed, blow-by-blow description of the latter part of the development of this compiler, see http://0xdeadbeef.net/wiki/wiki.pl?FinkBlog Original author: Steve Fink WHAT ABOUT P6GE? ================ This rule compiler is not the officially blessed one. That would be Patrick Michaud's PGE (Parrot Grammar Engine) , which you can find in compilers/pge. This engine (in languages/regex, referred to in the following text as l/rx) predates pge by a year or two, but never managed to generate sufficient interest to get anyone else involved. I am assuming that PGE is going to get the momentum of the community behind it, so I would advise anyone interested in working on a rule engine to look there first. (Look for discussion on the parrot-porters and perl6-compiler mailing lists.) However, I still intend to work on this engine for a while longer, and welcome any interested participants. (Send any requests/comments/suggestions either to parrot-porters or directly to me at steve@fink.com.) So far, I have only briefly looked at PGE, but I think this languages/regex engine has enough of a different approach that it is still valuable for gathering lessons -- and may still make the most sense in the long run. My take on the comparison between l/rx and PGE as of the first public release of PGE: It sounds like l/rx handles pretty much exactly the same things as PGE, probably a few more and a few less. I haven't actually looked at the code, but from the description I'd guess that the main differences are: - PGE is implemented in C; l/rx in Perl5. Both are reaching towards the "bootstrap point", when they'll be implemented in PIR. - PGE generates PIR; l/rx has both PIR and Perl5 backends - PGE uses coroutines and continuations; I have always been too wary of their stability, so I use plain subs (with a 'mode' parameter to tell it whether to try to match or backtrack) - Both allow you to "continue" a match to find all other possible matches, but that's really because it falls out of any sane implementation (you have to keep all that state around somehow anyway) - l/rx uses match objects (src/dynpmc/match.pmc) and automatically generates a parse tree out of them. PGE has a dynamically created Parrot class "PGE::Match" that I assume does something similar. - PGE has built-in "dump out the matching info" routines; I make my test harnesses generate their own. I'm jealous. - The feature sets are nearly identical. Makes sense, I suppose -- low-hanging fruit and all that. - It sounds like the internal design is rather different. I try hard to compile down to very minimalistic PIR ops. It sounds like PGE uses lots of higher-level operations, to do things like processing a whole chunk of input at a time. (Although on the other hand, PGE uses more native Parrot flow control mechanisms than I do.) (And I really haven't looked closely enough to substantiate this.) - Closely related to the above, I have a number of optimizations already implemented, but I suspect PGE will end up with a very different set of optimizations.