cLabs: SplitSubtleties

I've been dinking recently on a new, tiny DSL in Ruby for generating use cases from combinations of variables and have uncovered some subtleties of regexps with String#split that I hadn't realized before.

With a simple regexp, the split command will consume the character you're splitting on. In some cases, this is desirable, like parsing a simple comma-delimited string:

irb(main):001:0> "a, b, c, d".split(/,/)
=> ["a", " b", " c", " d"]

In parsing my DSL, I'm finding cases where I need to not consume the piece I'm splitting on. For example:

irb(main):001:0> "foo = bar".split(/=/)
=> ["foo ", " bar"]

That's great, I got the foo and bar, but I've lost the operator. Obviously, in this case, I know what the operator is, I just split on it, but I'd really like to have the = remain in the resulting array. A simple tweak on the regexp will get it:

irb(main):002:0> "foo = bar".split(/(=)/)
=> ["foo ", "=", " bar"]

Putting the before and after space into the regexp will even get me a bit of trimage:

irb(main):003:0> "foo = bar".split(/ (=) /)
=> ["foo", "=", "bar"]

In another case, I want to split up paragraphs, splitting each time a new line starts without any indentation. For example, with this input:

exclude foo = bar
  should foo equal bar, we want to exclude that combination

exclude bar = foo
  as well, if bar equals foo,
  that case has got to go

I want this output:

["exclude foo = bar\n  should foo equal bar, we want to exclude that combination\n\n", 
 "exclude bar = foo\n  as well, if bar equals foo,\n  that case has got to go\n"]

Not having this newfound split awareness, I started dealing with scan and a regexp like so:

data.scan(/(^\S.*$)?/m)

Well ... I got closer than that regexp actually gets me, I can't remember it now, but it wasn't working well. I wanted to use split, but I knew I'd lose the thing I was splitting on:

data.split(/^\S/)

Gives me:

["", 
 "xclude foo = bar\n  should foo equal bar, we want to exclude that combination\n\n", 
 "xclude bar = foo\n  as well, if bar equals foo,\n  that case has got to go\n"]

...and using the parens for grouping doesn't give me what I need either:

data.split(/(^\S)/)

["", 
 "e", 
 "xclude foo = bar\n  should foo equal bar, we want to exclude that combination\n\n", 
 "e", 
 "xclude bar = foo\n  as well, if bar equals foo,\n  that case has got to go\n"]

But in the process of re-reading some more advanced stuffs on regexp, I re-learned about ‘zero-width positive lookahead’. Using it, my split works perfectly and the regexp is nice and tidy:

data.split(/(?=^\S)/)

tags: ComputersAndTechnology