Thanks:  0
Likes:  0

# Thread: random sampling simulation model problem/challenge

1. I've got a simulation model that I'm using to simulate detection of ones (i.e., positives) from a sample taken from a population. For the sake of the discussion, please assume that the detection tool is 100% sensitive (i.e., if there is a one (positive) present in the sample, the detection tool will know it) and 100% specific (i.e., it won't call any of the points in the sample positive if they are not really a one). Sorry if that's a bit confusing, but if you want to understand sensitivity and specificity better, I can suggest some epidemiology texts to look at.

First, I have a sheet that creates a column of 0's or 1's based on two user inputs (N=population size, p=expected prevalence of 1's). For example, if N=20 and p=5% then the column contains 1 one and 19 zeros.

Second, I have another sheet in the workbook that uses RANDBETWEEN in conjunction with HLOOKUP, AND, & IF to randomly select & display the contents of one of the 20 cells from the first sheet, something like this:

=IF(AND(\$C27<>"",D\$7<>""),HLOOKUP(\$D\$4,\$DC\$7:\$DC\$27,RANDBETWEEN(2,\$D\$2+1) ,TRUE),"")

Where:
C27 = the 20th member of the population of 20
D7 = the 1st run in my set of "X" runs
D4 = the percent of cells from my population (of 20, in this case) in the first sheet that contain a "1" (5% or 1 of 20 cells, in this case)
DC7:DC27 = the range containing the 19 zeros and 1 one for my example population of 20
D2 = the size of my population (20, in this case)

I'm trying to simulate detection of ones (i.e., a "positive", where positive=1 and negative=0) with this model, and using the model to, among other things, demonstrate the relationship between the confidence for detection and: [1] sample size, [2] population size, and [3] true prevalence of positives (ones) in that population.

Excel seems to use a "sample with replacement" method, which can result in the same cell(s) sometimes being sampled multiple times in the same simulation run. When I use a percent of "positives" greater than around 5%-10% and/or when the population and sample size are large (500 & 500 or higher), this isn't much of a problem, and the model stays right on the expected detecton (of ones, or positives) frequency (i.e., "confidence", or probability of detection) described in the statistical sample size tables (like those described by Cannon and Roe, 1982). However, when I use a sample size equal to the size of the entire population at smaller populations (300 & 300 or smaller) and especially where I use relatively low prevalence values (2% or less), I don't get the expected confidence probabilities (again, we're assuming 100% sensitive detection tool).

For example, I ran a simulation of 1000 replicates using the model, and the frequency of sampling the same cell containing the "1" out of the 20 in the example population (where the other 19 cells were zero) was:

1 = 371 of 1000
2 = 214 of 1000
3 = 64 of 1000
4 = 14 of 1000
5 = 1 of 1000
6 = 0 of 1000
7 = 1 of 1000

This problem occurs because Excel seems to use the "sample with replacement" method using the above equation and sheet structure. Using a "sample without replacement" method would resolve this problem, and what I'm looking for is a way to do "sampling without replacement" using Excel equations & functions, or macros/VBA.

I'd appreciate any wisdom from anyone on how to get Excel set up to do "sampling without replacement".

Dale

2. Hi Dale,

Posted below are two user-defined functions created by Dave Braden, an Excel MVP many times over, for sampling without replacement.

The posts can be found at this link:

Dave is awesome. He is leading the charge to fix the statistical functions in the Analysis ToolPak. He, together with Jerry W. Lewis, have posted some amazing code. (I worked on the Hypergeometric function for them, but put it away for awhile.)

Anyway, please post back if this isn't what you need.

First function uses a sheet range. Second is more flexible. Please see Dave's instructions at the end.

-------------------------------
Function HGSample(data As Range, ByVal SampleSize As Integer) As Variant
' Returns a hypergeometric sample of Samplesize from a range, assumed to be columnar
'Written 1999/4/1 David J. Braden tmy@fiastl.net

Dim hiP1 As Long, i As Long, j As Long
Dim ret() As Variant, temp As Variant

'Application.Volatile 'to make this a variate generator, uncomment this line

temp = data
hiP1 = data.Rows.Count + 1
If SampleSize > UBound(temp) Then SampleSize = UBound(temp)
ReDim ret(1 To SampleSize, 1 To 1)
For i = 1 To SampleSize
j = i + Int(Rnd * (hiP1 - i))
ret(i, 1) = temp(j, 1): temp(j, 1) = temp(i, 1)
Next i
HGSample = ret
End Function

Function HGSample2a(ByVal lo As Long, ByVal hi As Long, ByVal SampleSize As Long) As Variant
'Returns a sample, without replacement, of SampleSize from the range lo to hi
'Written 1999/7/7 David J. Braden tmy@fiastl.net

Dim hiP1 As Long, i As Long, j As Long
Dim ret() As Variant, temp As Variant

Application.Volatile 'to make this a variate generator, uncomment this line

'The following allows more flexible specification of the support.
If lo > hi Then temp = lo: lo = hi - 1: hi = temp Else lo = lo - 1

ReDim temp(1 To hi - lo)
For i = hi - lo To 1 Step -1
temp(i) = i
Next
hiP1 = UBound(temp) + 1
If SampleSize > UBound(temp) Then SampleSize = UBound(temp)
ReDim ret(1 To SampleSize, 1 To 1)
For i = 1 To SampleSize
j = i + Int(Rnd * (hiP1 - i))
ret(i, 1) = temp(j) + lo: temp(j) = temp(i)
Next i
HGSample2a = ret
End Function

'As with the earlier function, this returns an array, so you need to do a
'ctrl-shift-enter, no matter how you use it (in another function, or for explicit
'display of the sample, or whatever). To see a draw, select B1:B25, array-enter
'=HGsample2a(1,50,25), and you have it. This function allows flexible
'specification of the set of integers being sampled from. It could be more
'robust still by specifying lo and hi as variants, then checking that they are in
'fact integers, but I'll leave that overhead to others.
---------------------------------

HTH,
Jay

3. Thanks Jay.

I'll digest this and let you know. Appreciate the help.

Dale

4. I set up the HGSample & HGSample2a functions, debugged, and compiled.

When I use the HGSample function =HGSample(data,samplesize) I get duplication of values, which was expected.

But when I use the HGSample2a function =HGSample2a(lo,hi,samplesize) I still get duplication.

I've rechecked the code and am pretty sure I got it plugged in right -

Here is the HGSample code I put in the function pasted right from the edit screen:

Function HGSample(Data As Range, ByVal SampleSize As Integer) As Variant
'Written 4/1/99 (no fooling!) ny David J. Braden
'Plz retain the following comment:
' Posted to microsoft.public.excel.misc
Dim hiP1 As Long, i As Long, j As Long
Dim ret() As Variant, temp As Variant

'Application.Volatile 'to make this a variate generator, uncomment this line

temp = Data
hiP1 = Data.Rows.Count + 1
SampleSize = Application.Min(Data.Rows.Count, SampleSize)
ReDim ret(1 To SampleSize, 1 To 1)
For i = 1 To SampleSize
j = i + Int(Rnd * (hiP1 - i))
ret(i, 1) = temp(j, 1): temp(j, 1) = temp(i, 1)
Next i
HGSample = ret

End Function

Here is the HGSample2a code I put in the function pasted right from the edit screen:

Function HGSample2a(ByVal lo As Long, ByVal hi As Long, ByVal SampleSize As Long) As Variant
'Returns a sample, without replacement, of SampleSize from the range lo to hi
'Written 1999/7/7 David J. Braden tmy@fiastl.net

Dim hiP1 As Long, i As Long, j As Long
Dim ret() As Variant, temp As Variant

Application.Volatile 'to make this a variate generator, uncomment this line

'The following allows more flexible specification of the support.
If lo > hi Then temp = lo: lo = hi - 1: hi = temp Else lo = lo - 1

ReDim temp(1 To hi - lo)
For i = hi - lo To 1 Step -1
temp(i) = i
Next
hiP1 = UBound(temp) + 1
If SampleSize > UBound(temp) Then SampleSize = UBound(temp)
ReDim ret(1 To SampleSize, 1 To 1)
For i = 1 To SampleSize
j = i + Int(Rnd * (hiP1 - i))
ret(i, 1) = temp(j) + lo: temp(j) = temp(i)
Next i
HGSample2a = ret

End Function

Also, I'm not seeing what reason the samplesize input is needed in either function. If I copy the function down a column, specifying sample size doesn't seem to matter - only the number of cells containing the HGSample2a function equation seems to matter (along with the lo and hi bounds).

Any thought on what I'm missing here, or am doing wrong?

Also, with the HGSample function, F-9 doesn't seem to generate a new set of values, but if I change any value in the data range it does...this doen't make sense to me. I was expecting that F-9 would generate a new set of random values from the same data range values. Same question...any thoughts on what I am missing here, or am doing wrong?

Dale

5. Hi Dale,

On 2002-04-14 06:01, dpolson wrote:
I set up the HGSample & HGSample2a functions, debugged, and compiled.

When I use the HGSample function =HGSample(data,samplesize) I get duplication of values, which was expected.

----------------
JP: Expected? Why? How so? I cannot replicate this duplication. No matter how many times I sample this, my returned set has unique values.
----------------

But when I use the HGSample2a function =HGSample2a(lo,hi,samplesize) I still get duplication.

-------------
JP: I cannot replicate this, either.
-------------

I've rechecked the code and am pretty sure I got it plugged in right -

Here is the HGSample code I put in the function pasted right from the edit screen:

-------------
JP: You changed the comments Dave would like retained. Not an issue, but why?
-------------

---> SNIP functions previously posted <---

Also, I'm not seeing what reason the samplesize input is needed in either function. If I copy the function down a column, specifying sample size doesn't seem to matter - only the number of cells containing the HGSample2a function equation seems to matter (along with the lo and hi bounds).

------------
JP: This might be the problem. You have to group the results in a worksheet array, which I do not think you have done (I could be wrong here).
------------

Any thought on what I'm missing here, or am doing wrong?

---------------------
JP: How are you entering this? Suppose you have the following:
1. In A1:A150 enter =ROW() for each cell.
2. Highlight B1:B10, with B1 the active cell
3. In the formula box type =HGSAMPLE(A1:A50,10) and enter it with Ctrl+Shift+Enter. All 10 highlighted cells should fill up with values from the list.
4. In C1 enter =COUNTIF(\$B\$1:\$B\$10,B1) and copy down the list.

All the values in column C should be 1.
-----------------------

Also, with the HGSample function, F-9 doesn't seem to generate a new set of values, but if I change any value in the data range it does...this doen't make sense to me. I was expecting that F-9 would generate a new set of random values from the same data range values. Same question...any thoughts on what I am missing here, or am doing wrong?

------------------
JP: Uncomment (delete the single quote ') the Application.Volatile line.
-------------------

Dale
--------------------
JP: These unique random numbers/sampling without replacement seem to be causing a bit of confusion in a number of threads these days. I cannot replicate these duplications in functions/procedures which were designed to return a set of unique elements.

What am I missing? Could it be that the duplicates are from a resampled run, meaning that you have 100 data points and take 10, then you want another 10 from the remaining 90, and so on? If that is the case, then I understand how you are getting duplication.
---------------------

Regards,
Jay

[ This Message was edited by: Jay Petrulis on 2002-04-14 08:25 ]

6. Thanks, Jay. You've helped me recognize some gaps in my understanding. I've inserted some comments below following yours. Hopefully others can learn to avoid the mistakes I made not grouping the results of these functions as worksheet arrays. Obviously I've got some more learning to do.

Hi Dale,

On 2002-04-14 06:01, dpolson wrote:
I set up the HGSample & HGSample2a functions, debugged, and compiled.

When I use the HGSample function =HGSample(data,samplesize) I get duplication of values, which was expected.

----------------
JP: Expected? Why? How so? I cannot replicate this duplication. No matter how many times I sample this, my returned set has unique values.

Dale: I think I incorrectly understood the HGSample function to be sampling with replacement and the HGSample2a to be sampling without replacement. I get the duplication when I treat the HGSample and HGSample2a like a regular function (instead of an array). My mistake.
----------------

But when I use the HGSample2a function =HGSample2a(lo,hi,samplesize) I still get duplication.

-------------
JP: I cannot replicate this, either.

Dale: I think I'm not using the arrays correctly (see further comments below).
-------------

I've rechecked the code and am pretty sure I got it plugged in right -

Here is the HGSample code I put in the function pasted right from the edit screen:

-------------
JP: You changed the comments Dave would like retained. Not an issue, but why?

Dale: I went to the Google site to read the context for the HGSample and HGSample2a code you posted, and I used the HGSample code from message 5 from David Braden dated 1999/04/01 that was posted in the Google thread, and also the Google code for HGSample2a code from Dave's response to Anne on 2000/02/17. I went back and looked again and compared it to what you sent me and it looks like the comments are a bit different. I'll change the comments in the HGSample and HGSample2a function code I input into my Excel to the most recent ones in the code that you sent me. It also looks like one line of the HGSample code you sent me has changed sowewhat from what I found at Google in message 5 (1999/04/01): was... SampleSize = Application.Min(Data.Rows.Count, SampleSize) but looks to now be... If SampleSize > UBound(temp) Then SampleSize = UBound(temp) I'll make those changes also to be consistent with what you posted.
-------------

---> SNIP functions previously posted <---

Also, I'm not seeing what reason the samplesize input is needed in either function. If I copy the function down a column, specifying sample size doesn't seem to matter - only the number of cells containing the HGSample2a function equation seems to matter (along with the lo and hi bounds).

------------
JP: This might be the problem. You have to group the results in a worksheet array, which I do not think you have done (I could be wrong here).
------------

Any thought on what I'm missing here, or am doing wrong?

---------------------
JP: How are you entering this? Suppose you have the following:
1. In A1:A150 enter =ROW() for each cell.
2. Highlight B1:B10, with B1 the active cell
3. In the formula box type =HGSAMPLE(A1:A50,10) and enter it with Ctrl+Shift+Enter. All 10 highlighted cells should fill up with values from the list.
4. In C1 enter =COUNTIF(\$B\$1:\$B\$10,B1) and copy down the list.
All the values in column C should be 1.

Dale: I followed steps 1-4 and got what you described. In doing so, I've recognized that I wasn't correctly using the HGSample and HGSample2a functions or control-shift-enter. Thanks for pointing that out. I haven't worked enough with arrays (only a bit with conditional sums), so the problem was my ingnorance. Thanks for the help.
-----------------------

Also, with the HGSample function, F-9 doesn't seem to generate a new set of values, but if I change any value in the data range it does...this doen't make sense to me. I was expecting that F-9 would generate a new set of random values from the same data range values. Same question...any thoughts on what I am missing here, or am doing wrong?

------------------
JP: Uncomment (delete the single quote ') the Application.Volatile line.

Dale: I understand, did the uncomment, and that took care of it. Thanks.
-------------------

Dale

--------------------
JP: These unique random numbers/sampling without replacement seem to be causing a bit of confusion in a number of threads these days. I cannot replicate these duplications in functions/procedures which were designed to return a set of unique elements.

What am I missing? Could it be that the duplicates are from a resampled run, meaning that you have 100 data points and take 10, then you want another 10 from the remaining 90, and so on? If that is the case, then I understand how you are getting duplication.
---------------------

Regards,
Jay

[ This Message was edited by: Jay Petrulis on 2002-04-14 08:25 ]

7. Hi Dale,

No need to do anything regarding the comments. Dave posted these in a few posts, and they changed.

The real substantive comments were intact, so that's the important part for others to use.

Glad everything is working for you. Please post back if you have any other troubles.

In addition, when you get really, really, really good with arrays on the worksheet, you can begin to get into the realm of Alan Beban and Harlan Grove. Search google for their posts on MMULT and Alan's array functions (he has a download from his website which is awesome for this). I don't pretend to understand the details of what they argue about.

They are *the* array giants, IMO.

Bye,
Jay

## User Tag List

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•