Skip to content Skip to sidebar Skip to footer

\b+ Vs [\b]+ Vs [^\b]+ In Python Regex

I ran across an issue I don't understand while answering an SO question. I've created a simplified example to illustrate the problem: THE SCENARIO: I'm testing that two tokens (no

Solution 1:

The \B+ pattern causes nothing to repeat error that is a usual error when you try to quantify a special regex operator that is a zero-width assertion. Any of these - (*, |*, \b+, \B+ - will cause this error. Repeating a zero-width assertion makes no sense as it does not consume any characters and the regex index remains at the same position. Note that a{1,2}+ and f*+ (possessive quantifiers that Python re does not support) cause another, but similar error - multiple repeat.

Now, the \b and \B cannot be used inside a character class. See re Python reference:

Note that \b is used to represent word boundaries, and means “backspace” only inside character classes. ... Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Also, FYI,

\number ... Inside the [ and ] of a character class, all numeric escapes are treated as characters.

In the same way, you cannot use \B, \A, \Z and backreferences like \1 inside character classes. They just lose their special regex meaning and are treated as whatever Python sees right. Actually, since Python parses invalid escape sequences as \ + char, the [\B] matches only B char, since \ is escaping a literal symbol and the symbol is matched as such. Thus,

print(re.findall(r'[\B]+', "BBB \\Bash"))

outputs ['BBB', 'B'] only.

And r"[^\b]+" only matches all chars that are not a backspace char:

print(re.findall(r'[^\b]+', "bbb \\bash\baaa"))

outputs ['bbb \\bash', 'aaa'].

Solution 2:

  • \B+ causes an error because there's no point in repeating a boundary - one boundary is the same as two boundaries. It's more likely that you've done this by mistake, so the error makes sense.
  • [\B]+ is something completely different. (Most) Escape sequences do not work inside a character class, which is why this is a character set that matches the character \ or B, so obviously repeating this is possible.

Post a Comment for "\b+ Vs [\b]+ Vs [^\b]+ In Python Regex"