Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm already aware that w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

Is this true, or did someone simply get [a-z] confused with w?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
566 views
Welcome To Ask or Share your Answers For Others

1 Answer

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ? would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ? is a separate letter in Spanish.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.

Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said —?framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...