python - Why does this re.split() return white space as separate items in the list?


Keywords:python 


Question: 

This question already has an answer here:


1 Answer: 

To get your expected output, you'd want to use a non-greedy match, changing:

re.split(r'(\\.*[\s$])', markup)

to:

re.split(r'(\\.*?[\s$])', markup)

The reason is that .* will happily match as much of the string as possible (so long as it can still match the fixed anchors around it), and since your fixed anchors are so simple (a leading backslash, any characters, then trailing whitespace or $ character), it will match from the first backslash to the final whitespace character.

That gets output of:

['', '\\{caption ', 'Figure 1: Leaf shapes', '\\} ', '\\image:leaf_shapes.tiff']

which is almost what you want (aside from the leading empty string, created because your regex matches at the very beginning of the string). You can manually pop it off if needed, e.g. to remove leading and trailing empty strings:

tokens = re.split(r'(\\.*?[\s$])', markup)
if tokens and not tokens[0]:
    tokens.pop(0)
if tokens and not tokens[-1]:
    tokens.pop()

Note: If your intent was to match until whitespace or end of string, not whitespace or literal $, you need to change [\s$] to (?:\s|$); inside a character class $ isn't special, so you need to use a (non-capturing) grouped alternation instead.