## Statistics of partial test cases run

In software development, the number of test cases is increased dramatically especially when you generate the test cases automatically. It becomes a problem when the execution time of the test suites is the bottle neck of the development. For example, assume you have user interface with 20 check boxes. The number of test cases to cover the all possible parameters is roughly one million (=2 ^ 20). If one test takes one second, the whole test suite takes 12 days to complete!

The problem leads me to idea to use statistics. Ultimately, we want to say something like:

“I’ve performed n test cases (n is small enough to run quickly) and success rate was 98%. Therefore, the overall success rate is estimated to be 98% +/- 3% with the confidence level 95%”

The test suite can be considered as Bernoulli trial where the probability p is the success rate of overall test suite. The binomial proportion confidence interval is known to be:

$\hat{p} \pm z_{0.95} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$

where $\hat{p}$ is a proportion of success in the Bernoulli trial and $n$ is the sample size. (Binomial proportion confidence interval)

For example, assume you pick 100 test cases randomly from the test suite. The success rate was 80% (=80 tests of 100 cases were successful). You can say the estimated success rate of the test suite with the 95% confidence level is:

$0.8 \pm z_{0.95} \sqrt{\frac{0.8(1 - 0.8)}{100}}$
$= 0.8 \pm 0.0784$
$= 80\% \pm 7.8\%$

Now, does it really work? (Actually, it doesn’t. Read on!)

To test this, I wrote a simple Ruby script. It plots the estimated success rate and the error range when the actual success rate is 95% and sample size is 1000. The script loops 64 times.

SAMPLE_SIZE = 1000
PROPORTION = 0.95
TRIAL_SIZE = 64

(1..TRIAL_SIZE).each {|trial|
success = 0
(1..SAMPLE_SIZE).each{
if rand() < PROPORTION
success += 1
end
}

p = success.to_f / SAMPLE_SIZE
e = 1.96 * Math.sqrt(p * (1 - p) / SAMPLE_SIZE)
puts "#{trial} #{p} #{p - e} #{p + e}"
}


Here is the plot. You can see most of them is correctly estimating the true success rate (95%). Actually, it’s expected that 95% of the estimates should hit the true value as the confidence level is 95%.

Let’s verify the confidence level. Increase the number of trial to get the better accuracy. Also, record the number of successful estimation to show the overall estimate accuracy.

SAMPLE_SIZE = 1000
PROPORTION = 0.95
TRIAL_SIZE = 1000

hit = 0
(1..TRIAL_SIZE).each {|trial|
success = 0
(1..SAMPLE_SIZE).each{
if rand() < PROPORTION
success += 1
end
}

p = success.to_f / SAMPLE_SIZE
e = 1.96 * Math.sqrt(p * (1 - p) / SAMPLE_SIZE)
if (p - e < PROPORTION) and (PROPORTION < p + e)
hit += 1
end
}
puts "Successful Estimate: #{hit} / #{TRIAL_SIZE} = #{hit * 100.0 / TRIAL_SIZE} %"


Here is the result of three run. It meets the expectation (95% confidence level).

 Successful Estimate: 950 / 1000 = 95.0 % Successful Estimate: 947 / 1000 = 94.7 % Successful Estimate: 951 / 1000 = 95.1 % 

So far so good. However, 95% of test success rate is pretty bad from product quality perspective. At the end of product development, almost all tests should pass (say, 99.9%). Let’s see how this ruby script estimates the success rate in such situation.

SAMPLE_SIZE = 1000
PROPORTION = 0.999
TRIAL_SIZE = 1000

hit = 0
(1..TRIAL_SIZE).each {|trial|
success = 0
(1..SAMPLE_SIZE).each{
if rand() < PROPORTION
success += 1
end
}

p = success.to_f / SAMPLE_SIZE
e = 1.96 * Math.sqrt(p * (1 - p) / SAMPLE_SIZE)
if (p - e < PROPORTION) and (PROPORTION < p + e)
hit += 1
end
}
puts "Successful Estimate: #{hit} / #{TRIAL_SIZE} = #{hit * 100.0 / TRIAL_SIZE} %"


The three runs result in pretty bad result.

 Successful Estimate: 653 / 1000 = 65.3 % Successful Estimate: 622 / 1000 = 62.2 % Successful Estimate: 636 / 1000 = 63.6 % 

The plot shows that the estimated range is sometimes zero.

The problem is that the formula used to calculate the binomial proportion confidence interval is not suitable when the value of p is extreme (Very closed to 0 or 1). We can use better formula: Agresti-Coull Interval.

$\tilde{n} = n + z^2_{0.95}$
$\tilde{p} = \frac{X + z^2_{0.95} / 2}{\tilde{n}}$
$\tilde{p} \pm z_{0.95} \sqrt{\frac{\tilde{p}(1-\tilde{p})}{\tilde{n}}}$

Let’s see the plot of Agresti-Coull formula. The Ruby script is like this.

POPULATION_SIZE = 1000000
SAMPLE_SIZE = 1000
PROPORTION = 0.999
TRIAL_SIZE = 64

(1..TRIAL_SIZE).each {|trial|
success = 0
(1..SAMPLE_SIZE).each{
if rand() < PROPORTION
success += 1
end
}

nt = SAMPLE_SIZE + 1.96 ** 2
p = (success + (1.96 ** 2) / 2) / nt
e = 1.96 * Math.sqrt(p * (1 - p) / nt)
puts "#{trial} #{p} #{p - e} #{p + e}"
}


The plot is here. You can see majority of the estimates accurately hit the real success rate.

The actual successful estimate rate is shown here:

 Successful Estimate: 975 / 1000 = 97.5 % Successful Estimate: 988 / 1000 = 98.8 % Successful Estimate: 982 / 1000 = 98.2 % 

To make sure the Agresti-Coull formula also works when success rate is low (30%), here is the plot of both simple estimation (green) and Agresti-Coull (red). It tries 1000 times. You can see both result match pretty well.

Conclusively, here is the Ruby function to meet the original goal. Given that you have many test cases, you randomly pick N tests. The success rate of the sampled test cases is P. The estimated overall success rate with the confidence level 95% is:

N = ... # Number of test cases actually performed
SUCCESS = ... # Number of successful test cases

nt = N + 1.96 ** 2
pt = (SUCCESS + (1.96 ** 2) / 2) / nt
e = 1.96 * Math.sqrt(pt * (1 - pt) / nt)

puts "The estimated success rate is #{pt * 100}% with error range +/- #{e * 100}% at the confidence level 95%"