TheCaseOfTheMissingCertificate

November 15, 2010
At a client recently I was asked to help troubleshoot a problem in a .NET app that they could only recreate in load testing. The original problem only happened occasionally while trying to load an X.509 certificate from the Windows Certificate store. The exception thrown complained that it simply couldn't find the certificate in question. The certificate was obviously there and everything worked fine on the same box when running under a lighter load. Thinking they were perhaps expecting too much of the Cert store, they tried caching the cert to see if that fixed the problem.

It appeared to fix the first problem, but now a second exception was occurring: “Safe handle has been closed.” The person describing the problem to me wondered if there was some issue re-using the certificate now that it was cached and perhaps the instance needed to be cloned each time.

I wanted to see if I could recreate the problem on my dev box, not just because the load test could only be run in a certain QA environment that required 3 promotions to get to. Taking pot shots at bugs is usually a waste of time and sometimes dangerous if the attempt results in merely masking the symptoms without fixing the actual problem.

I had stack traces from both problems, both very similar. I decided to start recreating the call as best I could from the top of the trace. It took a while, but I was able to setup an integration test that passed. Next I needed to throw some threads at it to see if that would cause the error to occur.

Spinning up a few threads to call the method under test is not a hard thing to do as long as you keep a couple of things in mind:

- join all threads so the test doesn't fall through before the threads themselves have finished.

- catch exceptions inside the threads and push them back to the main thread. Unhandled exceptions on threads crash things. The following will likely crash the test runner process:


[TestClass]
public class UnhandledTestClass
{
[TestMethod]
public void UnhandledTest()
{
new Thread(() => {throw new Exception();}).Start();
}
}


I created a simple helper class for this which reduced the test code down to this:


[TestMethod]
public void TestThreadSafe()
{
var runner = new ConcurrentMethodRunner(() => _loader.LoadFromThumbprint(ServerThumbprint), 10);
runner.Execute();
Assert.AreEqual(0, runner.Exceptions.Count, runner.AllExceptionStrings());
}


Here's the helper class:


public class ConcurrentMethodRunner
{
private readonly ThreadStart _threadStart;
private readonly int _threadCount;
private readonly List<Exception> _exceptions;

public ConcurrentMethodRunner(ThreadStart threadStart, int threadCount)
{
_threadStart = threadStart;
_threadCount = threadCount;
_exceptions = new List<Exception>();
}

public void Execute()
{
var threads = new List<Thread>();
for (int i = 0; i < _threadCount; i++)
{
var thread = new Thread(ExecuteThreadStart);
threads.Add(thread);
thread.Start();
}

foreach (var thread in threads)
{
thread.Join();
}
}

private void ExecuteThreadStart()
{
try
{
_threadStart.Invoke();
}
catch (Exception ex)
{
_exceptions.Add(ex);
}
}

public List<Exception> Exceptions { get { return _exceptions; } }

public string CombinedExceptionStrings()
{
var sb = new StringBuilder();
foreach (var exception in _exceptions)
{
sb.AppendLine(exception.ToString());
}
return sb.ToString();
}
}


(this could probably be reduced even more with the Parallel class in the Tasks namespace in .NET 4.0)

Keep in mind, proving thread safety in this way is not very reliable. If the test fails, it's likely a race condition exists; if it passes, you just have more testing to do. The number of threads being run, the number of cores on the machine and many other factors can influence race conditions.

Now that I was able to reproduce the problem on just my dev box, I was in good position to continue and dig deeper into the diagnosis. I ran the test several times, adjusting the number of threads until I found the smallest number that appeared to reliably work. This had the added benefit of revealing a new bit of information: occasionally both the original problem (can't find the certificate) and the subsequent one (safe handle is closed) would occur. It turns out caching the certificate added the second exception to the mix, without fixing the first one.

The safe handle problem occurred more frequently and I had a feeling it would be simpler to unravel. The exception was being thrown inside a helper class that decrypted text from the loaded certificate (here's a simplified version):


using System;
using System.Security.Cryptography;
using System.Security.Cryptography.X509Certificates;
using System.Text;

namespace CLabs.Certificate
{
public class CertificateRsaCrypter
{
private readonly X509Certificate2 _certificate;

public CertificateRsaCrypter(X509Certificate2 certificate)
{
_certificate = certificate;
}

public string Decrypt(byte[] encrypted)
{
byte[] bytes;
using (var algorithm = (RSACryptoServiceProvider) _certificate.PrivateKey)
{
bytes = algorithm.Decrypt(encrypted, false);
}
var decoded = Encoding.ASCII.GetString(bytes);
return decoded;
}
}
}





[integration test was depending on magick data - a magick cert installed - needed a self-contained unit test]






lessons:
- diagnose your problems thoroughly. (chen link)
- don't be cowed by a strange problem. see if you can recreate it.
- unit testing threads and UIs aren't perhaps straight-forward, but they're not as difficult as they may seem.
- environments that are hard to get to could perhaps benefit from some continuous delivery.
- once you've got a diagnosis, don't pile in a quick fix with an awkward test, focus the test further to just the bug itself.

tags: ComputersAndTechnology