Minimum sample size for 'n' allowed errors, given required accuracy and confidence (statistics)

SerenityNetworks · Oct 19, 2017

I use the workbook referenced below to provide me with the minimum sample sizes needed for a range of allowed errors, given a required accuracy and required confidence.

In a round-about way, I fill-down the BINOMDIST function in a column to provide me with the probability/confidence that my accuracy requirement will be met for a given sample size and given number of errors. This allows me to then look up the sample size where the probability/confidence I'm looking has been met (by the BINOMDIST formula).

The solution works perfectly. The results have been validated many times. However, it produces a huge (100+ mb) Excel file that is slow to calculate. In the example file provided I'm only giving 5 columns of formulas that are filled-down 200 rows. To accommodate the ranges I need to work with, I must fill-down 50k or more rows and across 75 or more columns. That makes for huge workbooks that are slow to calculate.

I'm wondering if anyone can show me a more elegant method to accomplish my goal. I've tried, but haven't been able to get away from using a massive number of columns and rows. It seems there should be a single formula that will do what I need, but I'm not mathematically inclined enough to be able to figure it out. Any help will be appreciated.

Thanks in advance,
Andrew

Dropbox link to workbook directory

or link to workbook directly if above doesn't work.

shg · Oct 19, 2017

First, I'd like to say that this is not the right solution to your problem; there is surely a closed-form solution that would be much, much faster. If you go ask on math.stackexchange.com, someone will likely give you the answer in minutes, and if you link to it, someone here can help you implement it in Excel. (I suggest you compose the question very carefully and thoroughly; they are not likely to go look at your workbook.)

That said,

	A	B	C
1	Accuracy	90%
2	Confidency	80%
3
4	Allowed Errors	Min Samples
5	0	16	B5: {=MATCH(TRUE, 1 - BINOMDIST($A5, RowVec(1, 50000), 1-$B$1, TRUE) >= $B$2, 0)}
6	1	29
7	2	42
8	3	54
9	4	66
10	5	78
11	6	90
12	7	101
13	8	113
14	9	124
15	10	135
465	460	4,780
466	461	4,791
467	462	4,801
468	4943	50,000

The (brutal) formula in B5 does exactly what your workbook does without the intermediate data. It uses a UDF (RowVec) just to generate a constant array:

Code:

Function RowVec(iBeg As Long, iEnd As Long, _
                Optional ByVal iStep As Long = 1&) As Variant

  ' shg 2006
  ' Returns a 1-based, 1D array from iBeg to iEnd stepping iStep

  Dim nOut          As Long
  Dim aiOut()       As Long
  Dim iOut          As Long

  If iStep <> 0 Then
    iStep = Sgn(iEnd - iBeg) * Abs(iStep)

    nOut = (iEnd - iBeg) \ iStep + 1
    ReDim aiOut(1 To nOut)
    aiOut(1) = iBeg

    For iOut = 2 To nOut
      aiOut(iOut) = aiOut(iOut - 1) + iStep
    Next iOut

    RowVec = aiOut
  End If
End Function

The UDF avoids the ROW(INDIRECT(1:50000)) construct that would make the formula volatile, which you seriously do not want to do.

SerenityNetworks · Oct 19, 2017

Thank you very much. Your UDF helps a lot, even if no other solutions are provided. But I will definitely check out math.stackexchange.com. I'm aware that the brute force solution I created is not ideal. But it works and believe it or not has been a great improvement over previous tools I was given to use.

I'll post back what I find out.

Thanks again,
Andrew

SerenityNetworks · Oct 20, 2017

I haven't received a reply yet at stackoverflow, but I'm still trying to understand your UDF.

The array formula in the worksheet is pretty straightforward. But in the UDF I'm not seeing how iBeg is any value other than 1. Obviously it increments, but how? If iBeg increments then why not iEnd? How does iStep increment? I'm not following.

In the same vein, it would seem that I could increase the array from 1,50000 to 1,70000 if I wanted to increase the permissible range of allowed errors. That's not the case. Changing 50k to 70k actually decreases the range from 4943 to 428. I'm clueless as to why.

Would you mind helping me in understanding the UDF?

Thanks,
Andrew

shg · Oct 20, 2017

The UDF is brainless -- RowVec(3,7) returns the sequence {3,4,5,6,7}

shg · Oct 20, 2017

Post a link to your stackexchange question?

SerenityNetworks · Oct 20, 2017

shg said:
The UDF is brainless -- RowVec(3,7) returns the sequence {3,4,5,6,7}

Aaaaah. Okay. Now I can follow - I think.

Still, I'm at a loss as to why it throws #N/A when I enter larger values. If RowVec(1,50000) returns {1,2,3,...49999,50000} then why do I get #N/A if I use RowVec(1,70000)?

For example, I enter an accuracy of 99.95% with a confidence of 80%. I use RowVec(1,65536). I start in row 6 with 23. It makes sense that I get #N/A at Allowed Errors of 28 as the minimum sample would be 66814 and that is higher than my defined sequence. But if I enter RowVec(1,66815) or even RowVec(1,70000) then I get #N/A. I am not understanding why.

My post at stackexchange is here.

shg · Oct 20, 2017

Plan B:

	A	B	C
1	Accuracy	99.00%
2	Confidence	80.00%
3
4	Allowed Errors	Min Samples
5	0	161	B5: =NumTrials(A5, $B$2, 1-$B$1)
6	1	299
7	2	427
8	5	790
9	10	1,364
10	20	2,471
11	50	5,686
12	100	10,930
13	200	21,277
14	500	51,964
15	1,000	102,739
16	2,000	203,836
17	5,000	506,012
18	10,000	1,008,465

Code:

Function NumTrials(numSucc As Long, Conf As Double, p As Double) As Long
  Dim cdf           As Double   ' cumulative distribution function
  Dim n             As Long     ' trials

  n = numSucc
  
  With WorksheetFunction
    Do
      cdf = .Binom_Dist(numSucc, n, p, True)
      n = n + 1
    Loop While 1 - cdf < Conf
  End With

  NumTrials = n - 1
End Function

The good news is, it only calculates as many as it needs. The bad news is, worksheet functions called from VBA are not as fast as called from formulas.

I still feel certain there's a mathematical simplification, like the one user Dap gave me at https://math.stackexchange.com/questions/2446752/erlang-c-for-large-numbers.

SerenityNetworks · Oct 21, 2017

Excellent! And it's even a UDF I can easily follow. Thank you very much. The speed is not an issue. With the ability to apply it to just the allowed errors I desire to use, I usually won't need to use it more than four times in a workbook. Eight uses in a workbook would be the maximum.

Did you see the update I made to my post on stackexchange? It contains the original function used by the online tool I employed in validating my Excel tool. I've just not been able to reduce it to where I could apply it successfully. Forty plus years ago it was within my ability, but not today. If you can apply it that would be cool. Ultimately, I'd like to (1) calculate minimum sample size given allowed errors, required accuracy, and required confidence, (2) calculate the confidence level given the count of errors, required accuracy, and sample size, and (3) calculate the number of errors allowed given the sample size, required accuracy, and required confidence. All of that should be able to be derived from the function I provided. I just don't know how.

Thanks again,
Andrew

shg · Oct 21, 2017

Glad it helps.

Yes, I did see the update. I posted a question of my own at https://math.stackexchange.com/q/2483017/ for a recurrence relation for the binomial CDF, and got one. I was excited because the UDF runs about 60 times faster than the last one:

Code:

Function NumTrials2(k As Long, Conf As Double, p As Double) As Long
  Dim cdf           As Double   ' cumulative distribution function
  Dim pmf           As Double
  Dim n             As Long     ' trials

  n = k
  cdf = 1#      ' F(k, k, p)
  pmf = p ^ k   ' f(k, k, p)

  Do
    ' F(k, n+1, p) = F(k, n, p) - p * f(k, n, p)
    cdf = cdf - p * pmf
    ' pmf(k, n+1, p) via recurrence relation
    pmf = (n + 1) / (n + 1 - k) * (1 - p) * pmf
    If pmf = 0 Then Stop
    n = n + 1
  Loop While 1 - cdf < Conf

  NumTrials2 = n
End Function

It gives the same answers to a point, but dies when the PMF falls below the level at which it can be subtracted from the CDF.

Minimum sample size for 'n' allowed errors, given required accuracy and confidence (statistics)

SerenityNetworks

Board Regular

Excel Facts

shg

MrExcel MVP

SerenityNetworks

Board Regular

SerenityNetworks

Board Regular

shg

MrExcel MVP

shg

MrExcel MVP

SerenityNetworks

Board Regular

shg

MrExcel MVP

SerenityNetworks

Board Regular

shg

MrExcel MVP

Similar threads

Forum statistics

Share this page

Minimum sample size for 'n' allowed errors, given required accuracy and confidence (statistics)

Board Regular

Excel Facts

MrExcel MVP

Board Regular

Board Regular

MrExcel MVP

MrExcel MVP

Board Regular

MrExcel MVP

Board Regular

MrExcel MVP

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock