I've been dinking recently on a new, tiny DSL in Ruby for generating use cases from combinations of variables and have uncovered some subtleties of regexps with String#split that I hadn't realized before.
With a simple regexp, the split command will consume the character you're splitting on. In some cases, this is desirable, like parsing a simple comma-delimited string:
irb(main):001:0> "a, b, c, d".split(/,/)
=> ["a", " b", " c", " d"]
In parsing my DSL, I'm finding cases where I need to not consume the piece I'm splitting on. For example:
irb(main):001:0> "foo = bar".split(/=/)
=> ["foo ", " bar"]
That's great, I got the foo and bar, but I've lost the operator. Obviously, in this case, I know what the operator is, I just split on it, but I'd really like to have the = remain in the resulting array. A simple tweak on the regexp will get it:
irb(main):002:0> "foo = bar".split(/(=)/)
=> ["foo ", "=", " bar"]
Putting the before and after space into the regexp will even get me a bit of trimage:
irb(main):003:0> "foo = bar".split(/ (=) /)
=> ["foo", "=", "bar"]
In another case, I want to split up paragraphs, splitting each time a new line starts without any indentation. For example, with this input:
exclude foo = bar
should foo equal bar, we want to exclude that combination
exclude bar = foo
as well, if bar equals foo,
that case has got to go
I want this output:
["exclude foo = bar\n should foo equal bar, we want to exclude that combination\n\n",
"exclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]
Not having this newfound split awareness, I started dealing with scan and a regexp like so:
data.scan(/(^\S.*$)?/m)
Well ... I got closer than that regexp actually gets me, I can't remember it now, but it wasn't working well. I wanted to use split, but I knew I'd lose the thing I was splitting on:
data.split(/^\S/)
Gives me:
["",
"xclude foo = bar\n should foo equal bar, we want to exclude that combination\n\n",
"xclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]
...and using the parens for grouping doesn't give me what I need either:
data.split(/(^\S)/)
["",
"e",
"xclude foo = bar\n should foo equal bar, we want to exclude that combination\n\n",
"e",
"xclude bar = foo\n as well, if bar equals foo,\n that case has got to go\n"]
But in the process of re-reading some more advanced stuffs on regexp, I re-learned about ‘zero-width positive lookahead’. Using it, my split works perfectly and the regexp is nice and tidy:
data.split(/(?=^\S)/)
tags: ComputersAndTechnology