9fans archive / 2008 / 10 / 384 / prev next From: Eris Discordia <eris.discordia@gma...> Subject: Re: [9fans] non greedy regular expressions Date: Mon, 27 Oct 2008 13:13:57 +0000 > practical application. now there are big books on `regular expressions' > mainly because they are no longer regular but a big collection of ad-hoc I thought they were "regular" because they "regularly" occurred in the target text. Turns out other interpretations are possible. Though, mine has the advantage of being correct :-D The set of "big books on regular expressions" includes Jeffrey Friedl's "Mastering Regular Expressions" that happens to contain a chapter by the title "NFA, DFA, and POSIX" wherein he says: > DFA Speed with NFA Capabilities: Regex Nirvana? > > I've said several times that a DFA can't provide capturing parentheses or > backreferences. This is quite true, but it certainly doesn't preclude > hybrid approaches that mix technologies in an attempt to reach regex > nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from > the theoretical straight and narrow in search of more power, and it's > only natural that the same happens with DFAs. A DFA's construction makes > it more difficult, but that doesn't mean impossible. > > GNU grep takes a simple but effective approach. It uses a DFA when > possible, reverting to an NFA when backreferences are used. GNU awk does > something similar—it uses GNU grep's fast shortest-leftmost DFA engine > for simple "does it match" checks, and reverts to a different engine for > checks where the actual extent of the match must be known. Since that > other engine is an NFA, GNU awk can conveniently offer capturing > parentheses, and it does via its special gensub function. > > Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom > you may remember having played an important part in the early development > and popularization of regular expressions see Section 3.1.1.6). The Tcl > engine sometimes appears to be an NFA— it has lookaround, capturing > parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX > longest-leftmost match (see Section 4.6.1), and doesn't suffer from some > of the NFA problems that we'll see in Chapter 6. It really seems quite > wonderful. > > Currently, this engine is available only to Tcl, but Henry tells me that > it's on his to-do list to break it out into a separate package that can > be used by others. Again, turns out the "big books on regular expressions" can give the lowlife--that's me--things "hackers" deny them. --On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth <forsyth@ter...> wrote: >> Both systems are complex enough that essentially no one completely >> understands them. > > this touches on an important point. the first introduction of regular > expressions to editors was great, because it took some formal language > theory and made it useful in an `every day' way. conversely, the theory > provided significant underpinning (in a structural sense) to that > practical application. now there are big books on `regular expressions' > mainly because they are no longer regular but a big collection of ad-hoc > additions. (i think there is a similarity to Perl's relationship to the > history of programming languages.) the result really is hard to reason > about ... because it hasn't got that underpinning and the algebra has > gone. it's worse than that: the code is a mess too, i supposed in > consequence. >