SplitSubtleties

July 30, 2006
I've been dinking recently on a new, tiny DSL in Ruby for generating use cases from combinations of variables and have uncovered some subtleties of regexps with String#split that I hadn't realized before.

With a simple regexp, the split command will consume the character you're splitting on. In some cases, this is desirable, like parsing a simple comma-delimited string:
irb(main):001:0> "a, b, c, d".split(/,/)
=> ["a", " b", " c", " d"]


In parsing my DSL, I'm finding cases where I need to not consume the piece I'm splitting on. For example:
irb(main):001:0> "foo = bar".split(/=/)
=> ["foo ", " bar"]


That's great, I got the foo and bar, but I've lost the operator. Obviously, in this case, I know what the operator is, I just split on it, but I'd really like to have the = remain in the resulting array. A simple tweak on the regexp will get it:
irb(main):002:0> "foo = bar".split(/(=)/)
=> ["foo ", "=", " bar"]


Putting the before and after space into the regexp will even get me a bit of trimage:
irb(main):003:0> "foo = bar".split(/ (=) /)
=> ["foo", "=", "bar"]


In another case, I want to split up paragraphs, splitting each time a new line starts without any indentation. For example, with this input:
exclude foo = bar
should foo equal bar, we want to exclude that combination

exclude bar = foo
as well, if bar equals foo,
that case has got to go


I want this output:
["exclude foo = bar\n  should foo equal bar, we want to exclude that combination\n\n", 
"exclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]


Not having this newfound split awareness, I started dealing with scan and a regexp like so:
data.scan(/(^\S.*$)?/m)


Well ... I got closer than that regexp actually gets me, I can't remember it now, but it wasn't working well. I wanted to use split, but I knew I'd lose the thing I was splitting on:
data.split(/^\S/)


Gives me:
["", 
"xclude foo = bar\n should foo equal bar, we want to exclude that combination\n\n",
"xclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]


...and using the parens for grouping doesn't give me what I need either:
data.split(/(^\S)/)

["",
"e",
"xclude foo = bar\n should foo equal bar, we want to exclude that combination\n\n",
"e",
"xclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]


But in the process of re-reading some more advanced stuffs on regexp, I re-learned about ‘zero-width positive lookahead’. Using it, my split works perfectly and the regexp is nice and tidy:
data.split(/(?=^\S)/)


tags: ComputersAndTechnology