TheCaseOfTheIntermittentBuffer

June 23, 2008
At my current gig, we have a custom Java .jar that we also cross-compile to .NET with IKVM. Recently, we ran across a bug that occurred only intermittently with the .NET assembly (and very infrequently - maybe 3 times out of a 100). The .jar never had the problem. Here's some example code that demonstrates the issue:
package org.clabs.bufferexample;

public class Processor {

private IDataProvider data;
private StringBuffer dataRead;

public Processor(IDataProvider data) {
this.data = data;
}

public void execute() {
byte[] buffer = new byte[10];
StringBuffer stringData = new StringBuffer();
while (true) {
int bytesRead = data.read(buffer);
if (bytesRead == -1) {
break;
}
stringData.append(new String(buffer));
}
processXML(stringData.toString());
}

private void processXML(String string) {
// pretend this is real code
dataRead = new StringBuffer();
int fromIndex = 0;
while (true) {
int index = string.indexOf("foo=", fromIndex);
if (index > -1) {
String s = string.substring(index + 4, index + 5);
if (dataRead.length() > 0) {
dataRead.append(",");
}
dataRead.append(s);
fromIndex = index + 4;
} else {
break;
}
}
}

public String getDataRead() {
return dataRead.toString();
}
}


Here's a passing test showing the code works fine:
package org.clabs.bufferexample;

import java.util.ArrayList;

import junit.framework.TestCase;

public class ProcessorTest extends TestCase {

public class MockProvider implements IDataProvider {

private ArrayList<String> reads = new ArrayList<String>();
private int readsIndex;

public MockProvider(ArrayList<String> reads) {
this.reads = reads;
readsIndex = 0;
}

public int read(byte[] buffer) {
if (readsIndex < reads.size()) {
String thisRead = reads.get(readsIndex++);
byte[] theseBytes = thisRead.getBytes();
System.arraycopy(theseBytes, 0, buffer, 0, theseBytes.length);
return theseBytes.length;
}

return -1;
}
}

public void testBufferedRead() {
ArrayList<String> reads = new ArrayList<String>();
String data = "<data><bar foo=1/><bar foo=2/></data>";

reads.add(data.substring(0, 10));
reads.add(data.substring(10, 20));
reads.add(data.substring(20, 30));
reads.add(data.substring(30, data.length()));

MockProvider mockProvider = new MockProvider(reads);
Processor processor = new Processor(mockProvider);
processor.execute();
assertEquals("1,2", processor.getDataRead());
}
}


(Since the production IDataProvider is essentially an InputStream, the MockProvider here is setup to hand over the data in chunks via multiple calls to read(byte[]). Also, the real code is dealing with much larger chunks of data, and uses a 1KiB buffer.)

Trouble with this test, is it's not real enough. The test passes, in Java or .NET 100% of the time. “1,2” is always returned, but sometimes in production, when the problem would occur, we'd only get “2”. So ... where's the problem?

After looking over this a couple of times, my eye was drawn to these lines in the production code:
            int bytesRead = data.read(buffer);
if (bytesRead == -1) {
break;
}
stringData.append(new String(buffer));


When the read method reaches the end of the data, the buffer isn't always going to be full - only if the total amount of data is a multiple of 10. Yet the last line in the above snippet appends the complete buffer everytime. So, the last time through, there'll be some leftover bytes from the prior call added to the end of the data.

That would explain the intermittency wouldn't it? Well ... no. Even in our example code, the data would have to be a multiple of 10 most of the time, and the production data was not guaranteed to have any consistency in length, so it'd probably avg. out to 90% of the time there being garbage added to the end of the data. In the production code, the buffer is 1 KiB in length, which means on average, only 1 out of 1,024 times would the data align itself just so. And our test, which doesn't have aligned data, is passing 100% of the time.

Breaking out the debugger at this point proved futile, since the bug would never occur in the debugger. This could either be because it happened so rarely, or because we had a heisenbug on our hands**. It was time for some old school debugging.

I added some logging to the code:
...
private void processXML(String string) {
// pretend this is real code
System.out.println(string);
...


...then setup a Ruby script which would run the code in question over and over again, parsing the console output and breaking whenever the console output was not what it should be.

Here's the correct string when output:
<data><bar foo=1/><bar foo=2/></data>2/>


(There's the garbage at the end, that doesn't seem to cause problems for our parser).

And here's an incorrect string when the bug was finally caught in the act:
<data><bar fata><baroo=1/><bar foo=2/></data>2/></


Ahhh ... interesting. Now we're getting somewhere. Naturally, with that string, we'd only be able to parse out the foo=2, since somehow the foo=1 was corrupt. And, I realize I'd been a bit short-sighted with the flaw in the buffer append.

Some more debug output to confirm my new thought:
            int bytesRead = data.read(buffer);
System.out.println(bytesRead);


New console output now when it works:
10
10
10
7
-1
<data><bar foo=1/><bar foo=2/></data>2/>


And when it doesn't:
10
2
10
10
5
-1
<data><bar fata><baroo=1/><bar foo=2/></data>2/></


Proof of my short-sightedness.

Earlier, I'd stated, “When the read method reaches the end of the data, the buffer isn't always going to be full.” True, but incomplete. The read method is never guaranteed to return the full buffer of data.

Here's a failing test to flush out the problem:
    public void testBufferedReadJagged() {
ArrayList<String> reads = new ArrayList<String>();
String data = "<data><bar foo=1/><bar foo=2/></data>";

reads.add(data.substring(0, 10));
reads.add(data.substring(10, 12));
reads.add(data.substring(12, 22));
reads.add(data.substring(22, 32));
reads.add(data.substring(32, data.length()));

MockProvider mockProvider = new MockProvider(reads);
Processor processor = new Processor(mockProvider);
processor.execute();
assertEquals("1,2", processor.getDataRead());
}


The production code in question was reading data over HTTP. Apparently, the code in Java never performed in such a way as to loop fast enough over the read calls to get a less-than-full buffer in the middle of the run, but the IKVM'd code in .NET would occasionally.


** We had a heisenbug. The reason we could never catch this in the debugger was because it slowed down the loop calling read over and over, and there was always a full packet of data waiting to be read by the time we stepped through that line of code.

tags: ComputersAndTechnology