php - What literal characters should be escaped in a regex?


Keywords:php 


Question: 

I just wrote a regex for use with the php function preg_match that contains the following part:

[\w-.]

To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

[\w\-.]

Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?


5 Answers: 

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • \ (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([\w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [\^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]
 
[\w.-]
  • the . usually means any character but between [] has no special meaning
  • - between [] indicates a range unless if it's escaped or either first or last character between []
 

While there are indeed some characters should be escaped in a regex, you're asking not about regex but about character class. Where dash symbol being special one.

instead of escaping it you could put it at the end of class, [\w.-]

 

The full stop loses its meta meaning in the character class.

The - has special meaning in the character class. If it isn't placed at the start or at the end of the square brackets, it must be escaped. Otherwise it denotes a character range (A-Z).

You triggered another special case however. [\w-.] works because \w does not denote a single character. As such PCRE can not possibly create a character range. \w is a possibly non-coherent class of symbols, so there is no end-character which could be used to create the range Z till .. Also the full stop . would preceed the first ascii character a that \w could match. There is no range constructable. Hencewhy - worked without escaping for you.

 

If you are using php and you need to escape special regex chars, just use preg_quote:

An example from php.net:

<?php
// In this example, preg_quote($word) is used to keep the
// asterisks from having special meaning to the regular
// expression.

$textbody = "This book is *very* difficult to find.";
$word = "*very*";
$textbody = preg_replace ("/" . preg_quote($word, '/') . "/",
                          "<i>" . $word . "</i>",
                          $textbody);
?>