Talk:Comparison of regular expression engines
This article is rated List-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||
|
This article links to one or more target anchors that no longer exist.
Please help fix the broken anchors. You can remove this template after fixing the problems. | Reporting errors |
python regex module missing
[edit]https://pypi.python.org/pypi/regex supports a broader set of features than pythons standard re module, especially recursive pattern matching --ThomasKalka (talk) 11:02, 28 March 2016 (UTC)
- Linked in the "Remarks" column for Python in the Languages table. rootsmusic (talk) 22:36, 14 December 2023 (UTC)
Jan-2010 Update
[edit]Made some updates -- not enough room to document them on the 1 line summary. Some of these address comments made previously (below).
Updates include:
- removing notes that something has only been available since '2007' (version was mentioned, but released 5 years ago)....
- remove notes on Unicode support where the note was (supports ALL, including binary )... if it supports it, it supports it, a special note should not be required for 'all', rather, only 'partial' support cases should be noted.
- modified ref-note for 'partial support' to indicate that such status was transient and that 'complete compliance' was dependent on updates after new versions are released.
- removed comparison columns for 'partial' matches and 'fuzzy matches' -- neither a feature of "*Regular* Expressions. The latter, which was defined, is specifically NOT a feature of regular expressions but rather an extra feature of some *products* that include a regex engine. Comparing products that include regex's isn't the same thing as comparing regex'. Though it wasn't defined, (how can you compare something that is 'undefined'), I believe it referred to a *product feature* in, for example something like 'vim', where it matches in your source text, whatever you have typed into a *search string*, in real time. This isn't a feature of a regex engine, but a product feature of some *EDITORS* -- I.e. it's not about the 'RegEx Engine'. I would wager that all of the products can match valid substring comprising a larger RegEx. Equally, I would wager any program using those regex's could be programmed to ignore errors while things were being 'typed in'. Thus -- irrelevant to this article.
- Notes: 1) 'Fuzzy' matching, involves fuzzy logic, which by definition is not a state-driven logic, and thus by definition cannot be part of a *Regular* expression engine. It may, however be a product feature to allow various levels of 'mistakes' on matching the given RegEx, but the magnitude of mistake is not a binary property (thus fuzzy) -- thus not Regular. Fuzzy logic is generally in the domain of "AI" - not regex's.
- Note 2: (not mentioned in the article) -- PCRE gets it's code from Perl -- so it's features generally track Perl's. PCRE is an acronym for Perl Compatible Regular Expression. and the engine in ruby derives from PCRE -- and tries to track it's features. The Ruby engine was done, specifically to add Japanese support BEFORE UTF-8 started becoming prevalent. Thus it supported 16-bits early on, but for locale-based charsets for Japanese. It wasn't really until it added UTF-8 support that it got full Unicode support.
(This is written after updating ""Part 2"" . I'm looking at ""Part 3"" to see what is salvageable there... started to writeup comments, but better I do it and then say what was done, as if I get hung up on saying what I'll do, I may not get it done...(am getting a bit tired of this update stuff already)... Astara Athenea (talk) 21:44, 22 January 2012 (UTC)
Ill-defined terms
[edit]Too many of the terms used as headings are vague or apply only to the terminology used for one RE engine. What this article really needs is a glossary of its terms.
There's also a fair point to be made that many of the tables here could be prose, and that would facilitate citing them. -Harmil 19:47, 27 April 2007 (UTC)
- I agree a terminology description would be useful. However I strongly disagree some of the tables should be converted to text. First because that takes away this articles main feature - the ability to see differences within seconds without reading for hours - and secondly citing Wikipedia is discouraged anyway. // Sping 17:20, 11 July 2007 (UTC)
- Care to give examples of terminology you consider too vague or applicable only to "the terminology used for one RE engine?" (I'm not really sure what you mean by that.) I think the terms are fairly straightforward. IMO, a bigger problem is that a very large number of significant features supported by some regex libraries are not currently represented in the comparison tables here. --Monger 04:03, 12 July 2007 (UTC)
- What on earth is a "Lazy Quantifier"? I can't find mention of it anywhere else. 72.220.174.159 20:24, 26 July 2007 (UTC)
- You must not have looked very far. In the content of regular expression quantifiers, lazy is the opposite of greedy. See http://www.regular-expressions.info/repeat.html for more info. I've also seen lazy quantifiers described as "non-greedy" or "reluctant". --Monger 01:17, 27 July 2007 (UTC)
- Lazy is not the opposite of greedy, that is a poor name. Also, I've never seen it called "lazy" before, non-greedy is the standard term.mathrick (talk) 23:42, 2 February 2008 (UTC)
- You must not have looked very far. In the content of regular expression quantifiers, lazy is the opposite of greedy. See http://www.regular-expressions.info/repeat.html for more info. I've also seen lazy quantifiers described as "non-greedy" or "reluctant". --Monger 01:17, 27 July 2007 (UTC)
Removing flavors with no information
[edit]Unless others disagree, I plan to remove from the comparison tables any flavors and engines which currently have no information about their features listed. Currently, this includes the following:
- ActionScript3.0
- Boost.Xpressive
- Grep
- GRETA
- Jakarta/Regexp
- Oniguruma
- SubEthaEdit
- Tcl 8.1
- TextMate
I would encourage others to list information about these engines' features, especially since a few of them are very significant and commonly used. However, I do not see any value in listing them without any information (none include any more than a couple "no"s). --Monger 00:54, 17 July 2007 (UTC)
- I've gone ahead and done this. --Monger 01:00, 20 July 2007 (UTC)
Unicode property support
[edit]I have not found any evidence, that Python supports unicode properties (like \p{L}
). I'm not sure how it is about another implementations, so I am fixing only the Python item. See e.g. [1]. Mykhal (talk) 21:10, 9 January 2008 (UTC)
Only ICU and Perl offer full Unicode property support as of this writing; notes added. I cannot find any evidence that vim supports Unicode properties (like \pL
, \p{Lu}
, \p{Alphabetic}
, \p{Script=Latin}
, or \p{Line_Break=A_Letter}
. I have removed its support.
I strongly suggest that just mentioning Unicode property support is far too broad a brush for usefulness. The most important thing is whether a regex system is or is not compliant with the requirements spelt out in Unicode Regular Expressions. This is quite specific about formal requirements, such as Level 1, Level 2, or Level 3. Suggestions? Standards compliance is easily referenceable through specific claims in each language's documentation.
Even mentioning whether things like \w
, \s
, and \b
work with Unicode or whether thye're ASCII-only would be much more useful than the current column features. 17:50, 5 February 2010 (UTC) —Preceding unsigned comment added by 98.245.82.12 (talk)
Languages?
[edit]What exactly is the Languages table supposed to show? Languages which have regexes builtin? Languages for which a regex library exists? Something else? As it stands today, it's completely meaningless. mathrick (talk) 00:30, 3 February 2008 (UTC)
- Languages which have regexes builtin --208.80.119.67 (talk) 03:32, 14 July 2011 (UTC)
Table footnotes
[edit]I found the footnotes on these tables to be nigh on useless. Why are they using refun? I can see using refun when there are only one or two notes, but not when there are 7. I was forced to compare the link names on the endnotes with the notes themselves to figure out which note I was interested in reading. Argonel (talk) 21:43, 28 May 2008 (UTC)
Speed
[edit]Another interesting point of comparison could of course be speed (or type of implementation); some references in paper Regular Expression Matching Can Be Simple And Fast . --Lapo Luchini (talk) 14:58, 31 August 2008 (UTC)
Inconsistency between Language Features Part 1 and Note above
[edit]The note above Language Features part 1 states:
- NOTE: An application using a library for regular expression support does not necessarily offer the full set of features of the library, e.g. GNU Grep which uses PCRE does not offer lookahead support, though PCRE does.
However, the table shows that GNU Grep does support lookahead. Unfortunately, I'm not sure which is true, perhaps someone else who knows can correct it. — Preceding unsigned comment added by 99.42.116.61 (talk) 22:03, 28 April 2013 (UTC)
External links modified
[edit]Hello fellow Wikipedians,
I have just modified 4 external links on Comparison of regular expression engines. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://archive.is/20081203133158/http://jeff.bleugris.com/journal/projects/ to http://jeff.bleugris.com/journal/projects/
- Added archive https://web.archive.org/web/20081201072631/http://www.regexlab.com/en/deelx/ to http://www.regexlab.com/en/deelx/
- Added archive https://web.archive.org/web/20131122023923/http://www2.tcl.tk/461 to http://www2.tcl.tk/461
- Added archive https://web.archive.org/web/20110715032327/https://www.p6r.com/software/rgx.html to https://www.p6r.com/software/rgx.html
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 17:41, 11 August 2017 (UTC)
Javascript regular expression features recently added.
[edit]In new implementations, as seen in a proposal for named capture groups has been added. [1]
ES2018 has a proposal for lookbehind, which was already implemented in some engines.[2]
Also v8 has had unicode for a while now. [3][4]
References
- ^ "Issue 2050343002: [regexp] Experimental support for regexp named captures - Code Review". codereview.chromium.org. Retrieved 2018-02-02.
- ^ Hablich, Michael (2016-02-26). "V8 JavaScript Engine: RegExp lookbehind assertions". V8 JavaScript Engine. Retrieved 2018-02-02.
- ^ "3a2fbc3a4ed2802b52659df2209b930200d63b29 - v8/v8 - Git at Google". chromium.googlesource.com. Retrieved 2018-02-02.
- ^ "e1c645d1f41febae014b4d0dfe7dc6e4549fab5e - v8/v8 - Git at Google". chromium.googlesource.com. Retrieved 2018-02-02.
Engines could be categorized
[edit]- There are official types of engines: DFA / NFA with the distinction Traditional NFA, Posix NFA (see. Mastering Regular Expressions, 3rd Edition by Jeffrey E.F. Friedl, chapter 4) - And there is a strong grouping of Perl compatibility (which drove regex developments some years ago). Perl 5.005 introduced new features (Perl 5.005 Regular Expression improvements) like Lookbehinds, Conditional Expressions, Atomic Groups. Perl 5.10 introduced other new features (Perl 5.1 Regular Expression improvements) many years later like Named Capture Buffers, Possessive Quantifiers, Relative Backreferences, \K, among others. The regex engine in version 5.10 was developed in collaboration with the PCRE project, the most interesting features were added beween 1997 and 2007 (Curated PCRE history).
As Perl is/was the defacto standard for regex, most of the engines in this Wikipedia article have a grammar and feature clearly set before the Perl 5.005 release, between Perl 5.005 and 5.10, or after Perl 5.10.
Sebastian --88.217.185.170 (talk) 21:47, 12 October 2019 (UTC)
- A bit of digging provided: PCRE was created 1997, when Perl 5.004 was out. PCRE 2.0 was created in 1998, when Perl 5.005 (with regex updates) was released. PCRE 7.0-7.3 were done in 2006-2007 in co-development for Perl 5.10 (with more groundbreaking regex updates). Sebastian --88.217.185.170 (talk) 22:20, 12 October 2019 (UTC)
About Java Regex Variable-length lookaround
[edit]I just tested regex "(?<=[a-z]+)[0-9]+" on some Java platforms, including Oracle JDK and OpenJDK 1.8, Oracle JDK and OpenJDK 17 at the text "abcd12345", and it gave the correct result "12345"; the variable-width look-ahead regex "[0-9]+(?=[a-z]+)" also works fine on Java on my machine. But on Android platform with API level 29 and Java source/target compatibility version 1.8, this regular expression has a compilation error for the reason of "non-fixed width look-behind". Also on the website regex test it fails with Java 8 flavor.
I don't know how these work on Java, and the different results above, very confused now. — Preceding unsigned comment added by Bczhc (talk • contribs) 02:52, 5 February 2022 (UTC)
- and is there a need to add "variable-length lookahead" future property on regular expression features part 2? Bczhc (talk) 02:57, 5 February 2022 (UTC)
possessive quantifiers
[edit]I'm missing the feature "possessive quantifiers" that some RegEx dialects have. I can only find the distinction between greedy and non-greedy, but technically there is greedy, lazy/reluctant and possessive.