ruby - How to I split a string without getting an empty string inserted in the array


Keywords:ruby 


Question: 

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.

I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.

I tried:

2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
 => ["-", " to "] 
2.4.0 :008 > str = "M14-19"
 => "M14-19" 
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
 => ["", "M", "19"] 

Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").

How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?


4 Answers: 

The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).

Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.

Use something like

MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]

See Ruby demo

 

I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:

string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]
 
 TOKENS = ["-", " to "]

 r = /
     (?<=\A[mMfF])             # match the beginning of the string and then one
                               # of the 4 characters in a positive lookbehind
     (?=                       # begin positive lookahead
       \d+                     # match one or more digits
       [[:space:]]*            # match zero or more spaces
       (?:#{TOKENS.join('|')}) # match one of the tokens
     )                         # close the positive lookahead
     /x                        # free-spacing regex definition mode

(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).

This can of course be written in the usual way.

r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/

When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.

"M14-19".split r
  #=> ["M", "14-19"]
"M14     to 19".split r
  #=> ["M", "14     to 19"]
"M14     To 19".split r
  #=> ["M14     To 19"]

If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.

 

You have a bug brewing in your code. Don't get in the habit of doing this:

#{Regexp.union(MY_SEPARATOR_TOKENS)}

You're setting yourself up with a very hard to debug problem.

Here's what's happening:

regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/

/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.

Consider this situation:

'CAT'[/#{regex}/i] # => nil

We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.

Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:

'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"

See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.