9fans archive / 2008 / 10 / 384 /    prev next

From: Eris Discordia <eris.discordia@gma...>
Subject: Re: [9fans] non greedy regular expressions
Date: Mon, 27 Oct 2008 13:13:57 +0000

> practical application.  now there are big books on `regular expressions'
> mainly because they are no longer regular but a big collection of ad-hoc

I thought they were "regular" because they "regularly" occurred in the 
target text. Turns out other interpretations are possible. Though, mine has 
the advantage of being correct :-D

The set of "big books on regular expressions" includes Jeffrey Friedl's 
"Mastering Regular Expressions" that happens to contain a chapter by the 
title "NFA, DFA, and POSIX" wherein he says:

> DFA Speed with NFA Capabilities: Regex Nirvana?
>
> I've said several times that a DFA can't provide capturing parentheses or
> backreferences. This is quite true, but it certainly doesn't preclude
> hybrid approaches that mix technologies in an attempt to reach regex
> nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from
> the theoretical straight and narrow in search of more power, and it's
> only natural that the same happens with DFAs. A DFA's construction makes
> it more difficult, but that doesn't mean impossible.
>
> GNU grep takes a simple but effective approach. It uses a DFA when
> possible, reverting to an NFA when backreferences are used. GNU awk does
> something similar—it uses GNU grep's fast shortest-leftmost DFA engine
> for simple "does it match" checks, and reverts to a different engine for
> checks where the actual extent of the match must be known. Since that
> other engine is an NFA, GNU awk can conveniently offer capturing
> parentheses, and it does via its special gensub function.
>
> Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom
> you may remember having played an important part in the early development
> and popularization of regular expressions see Section 3.1.1.6). The Tcl
> engine sometimes appears to be an NFA— it has lookaround, capturing
> parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX
> longest-leftmost match (see Section 4.6.1), and doesn't suffer from some
> of the NFA problems that we'll see in Chapter 6. It really seems quite
> wonderful.
>
> Currently, this engine is available only to Tcl, but Henry tells me that
> it's on his to-do list to break it out into a separate package that can
> be used by others.

Again, turns out the "big books on regular expressions" can give the 
lowlife--that's me--things "hackers" deny them.

--On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth 
<forsyth@ter...> wrote:

>> Both systems are complex enough that essentially no one completely
>> understands them.
>
> this touches on an important point. the first introduction of regular
> expressions to editors was great, because it took some formal language
> theory and made it useful in an `every day' way. conversely, the theory
> provided significant underpinning (in a structural sense) to that
> practical application.  now there are big books on `regular expressions'
> mainly because they are no longer regular but a big collection of ad-hoc
> additions. (i think there is a similarity to Perl's relationship to the
> history of programming languages.) the result really is hard to reason
> about ... because it hasn't got that underpinning and the algebra has
> gone. it's worse than that: the code is a mess too, i supposed in
> consequence.
>