macro for stratified random sampling?

notwhoiam

New Member
Joined
Nov 29, 2004
Messages
30
Hi,

Does anyone know how to get Excel to pull a stratified random sample?

For example, let's say my dataset contains records/measurements from sixteen groups of buffalo, representing nine entirely different populations across the US. The data is organised into 4 columns ("Population", "Family size", "Adult weight", & "Density") and 16 rows.
Now, what I'd want to do is to generate a subsample by randomly pulling ONE record per population (=9). Then I'd want to repeat that 99 times to get 100 randomly chosen subsamples, each containing just 9 (not 16!) records.

Now I suppose I could use labels to identify the various populations but I worry that it'd get really complicated once I apply the macro to my real (& much more complex) dataset, especially since I'd be wanting each subsample to contain data from ALL the columns (i.e. each of the 9 records per subsample would still include "Fam Size", Weight" & "Density" estimates).

Understand what I'm needing to do? Unfortunately, the usual answers (simple random sampling between X & Y, or using a random number generator) won't work b/c I need everything to be stratified by population. I'd also prefer to avoid having to run random sampling functions/macros for EACH population individually.

I know this can be done via S+ & R, but I'd really rather stick to excel if at all possible.
Any suggestions you can give will be enthusiastically & gratefully appreciated!! :pray:
Thanks!!
 
:LOL: IT WORKS! IT WORKS! IT WORKS!

I'm SO excited!! You've helped me to no end PA HS Teacher and I worship the ground you type on :wink:

Apparently the macro needed those formulas to work correctly before it'd run successfully (tho' I still don't understand why) - meaning that I now have two versions of the spreadsheet to play around with - yay!

The only (temporary) glitch was that my comp didn't like the thought of pulling 10,000 subsamples & so kept freezing mid-macro. Yet it worked like a dream when I only asked it to run three iterations... meaning that your macro works & my comp lacks the serious power &/or space needed to run a complex code that many times. But that's not such a big deal since I can always run it 10x, asking for 1000 subsamples each time - or even 50x asking for 200 each. Sure it's more time consuming, but NO WHERE NEAR as slow as doing this all by hand (which I had nearly resigned myself to do - until you came along).

I've also been playing around with the randomisation step since (apparently) excel's rnd() function isn't truly random - especially when you call it over & over in the course of a single macro, or when you break the analysis up into a series of separate steps. Haven't found a sure-fire way to ensure perfect randomness yet, but am scanning online forums etc etc for ideas. Most likely, I'll just end up using:
Code:
'Store Data Sets
Randomize 1234   ' or Randomize Timer?
For i = 1 To iNumDataSets
For j = 1 To iNumPopulations
since this keeps the 'Randomize' code outside of the For-To loop and so (hopefully) ensures the computer only re-seeds once per push of the macro button.
However I definitely don't expect you to attempt this extra step - I only mention it for the benefit of anyone who might read this thread later on...

Other than fiddling with the randomisation, I suppose the only piece I'm still a littttttle confused about is the weighting - and, apart from saving me some time, it isn't even a vital piece for me to understand or solve. The file you sent me helped clear away some of the cobwebs but I'm still not entirely sure how your 'Weight factor' differs from the three columns of 'Value's:
RandomBuffalowithvarand.xls
EFGHIJ
16***IwasthinkingofusinganarrayofWeightingFactorsinthecodesuchas
17WeightFactorValueValueValue
1811.052.452.45
1920.873.453.45
2030.734.214.21
2142.72.562.56
2251.343.13.1
23
24*Youcouldsavealotoftimewiththeweightingfactors,bydoingitwithcode
25*Factorsfromatablesuchasthisone,couldeasilybeputinto
26anarray.Forexample.WhenW=4intheloopingofthecode
27youcouldhaveaWeightFactorArray,where
28WeightFactorArray(w,3)wouldbe=2.56
29WeightFactorArray(w,1)wouldbe2.7
Population Table


As I've said, it isn't actually vital for me to sort out the weighting issue but, if you've got a bit more time on your hands, could you pls explain what the rows/columns in your workbook refer to & how the various cells differ from each other?
Note: I'm planning to weight each observation by a specific value, w (based on 20+ measures of individual observation quality) and this value changes depending on which of the four weighting schemes is used... ranging from 'scheme1' (in which all quality scores are identical, so w=1.00 for all observations) to 'scheme4' (in which quality scores vary considerably, so w might be anything from 0.2 to 1.00). Thus to weight variable K, as recorded for observation Z, I'd multiply it by the appropriate value of w, as calculated via scheme S. And then I'd repeat that process for all the other observations (& their w's) in the subsample. Do you follow?
So -if possible- what I'd still like to find out is how your 'weight factors' work and how I might use them to incorporate my weighting schemes into your fabulous resampling macro....


But even if nothing more is done on this macro or to add to this forum thread, I'll still be EXTREMELY GRATEFUL b/c I can finally resample my data the way I need to and (knock on wood) can possibly even finish ALL of my analyses before Xmas. That's AMAZING and you've earned yourself a mention in the acknowledgements of my dissertation. Honestly.

:pray: Thanks again for EVERYTHING.
 
Upvote 0

Excel Facts

Round to nearest half hour?
Use =MROUND(A2,"0:30") to round to nearest half hour. Use =CEILING(A2,"0:30") to round to next half hour.
I've got to go teach now, but the table is non functional right now. It was to illustrate how easy it would be to incorporate weighting factors into your calculations.

I assumed you would know the weighting factors before you ran the code:
- In which case you could include in the code a section that would pull the values from the chart into an array such as:
WeightFactorArr(w, a, b, c, ...)
where:
w is the weighting scheme (ranging from 1 to iNumWeightFactors)
a is the first weighting factor in scheme w
b is the second weighting factor in scheme w
c is the third weithing factor in scheme w etc.

You could then weight an observation by using say:

Taking the existing line which chooses a random observation of population j:
DataSetsArr(i, j, k) = DataPoolArr(j, iRandom, k)

and modify it to something like:
DataSetsArr(i, j, 4) = DataPoolArr(j, iRandom, 4) * WeightFactorArray(w,4)
DataSetsArr(i, j, 5) = DataPoolArr(j, iRandom, 5) * WeightFactorArray(w,5)
etc.

This would completely automate your weighting process too.

If you do not know the weighting factors ahead of time:
You can perform some calculations on DataPoolArr to calculate your weighting factors, right after DatPoolArr is filled.

As for the computer power, I reccommend making the w loop do 1 to 1,
then run code again with w from 2 to 2, 3 to 3 etc.
or alternatively, just make iNumWeightFactors and iNumVariables = 1.
play around with it and see how many subsamples you can get before the macro freezes. (btw, to pause the macro at any time, hit escape, you can mouse over the control variables, w,v,i,j,k to see how far along the macro is)

Look forward to hearing about your progress, I have to run and teach class.
 
Upvote 0
I hope your classes went well. Your students are certainly lucky to have a patient, helpful teacher like you! (what do you teach?)

Thanks for the clarification about your weighting table but I still don't understand what the 'Values' columns are supposed to be; i.e. how does the 1st differ from the 2nd & the 3rd?? Where will those values come from? And, either way, are they equivalent to 'a', 'b', & 'c' in your most recent post?
Basically, I don't understand how the a/b/c columns are applied. If the 'Weight Factor' (w) column represents the four weighting schemes I plan to use, then do 'Values'/a/b/c represent the individual weights of each observation? If that's the case, then wouldn't I need separate columns for all ~1105 observations since I need each observation to get its OWN quality-based weight? If so, then I wouldn't think the macro would save me all that much time after all... or have I completely misunderstood the process??

Nonetheless, this is a FANTASTIC code and I can't tell you how grateful I am for ALL of your help (although I hope my gushing praise conveys just a fraction of my enthusiasm & appreciation for your work). I definitely couldn't have done any of this without you!

:pray:
 
Upvote 0
We had a "momenutous" day. I'm a high school physics teacher. Today we did the old Egg into a sheet routine. You have a student throw an egg as hard as they can at a sheet. (it does not break) I then "accidentally" drop one on the floor it breaks (& brakes). It has to do with momentum and the amount of time it takes the egg to stop. The egg is analogous to a person's head, and the sheet is analogous to an air bag.

There are multiple columns only to provide flexibility. They don't actually mean anything yet. I don't really know what you mean by weighting scheme. Is it the same number for each field. I left it open.

WeightFactorArr(w, a, b, c, ...)
where:
w is the weighting scheme (ranging from 1 to iNumWeightFactors)
a is the first weighting factor in scheme w
b is the second weighting factor in scheme w
c is the third weithing factor in scheme w etc.
Yes, a,b,c are equivalent to the three columns.

For example:

Say you want to to manipulate the data in the following way:

Scheme 1:
Multiply Adult Size Field by 1.2
Multiply Density by .95
Subtract 1 from Family Size

Put the following values in the chart:
Random Buffalo with var and weight 12-5-04.xls
EFGH
17WeightFactorValueValueValue
1811.20.95-1
1920.873.453.45
2030.734.214.21
Population Table






Then add the following Code to the existing code
Code:
' Pulls Weighting Factors From the Population Table Worksheet to an Array
ReDim WeightFactorArray(1 to iNumWeightFactors,1 to 3)
For w = 1 to iNumWeightFactors
   WeightFactorArray(w,1) = Worksheets("Population Table").Cells(17 + w,6)
   WeightFactorArray(w,2) = Worksheets("Population Table").Cells(17 + w,7)
   WeightFactorArray(w,3) = Worksheets("Population Table").Cells(17 + w,8)
Next w

Then take the existing line:
Code:
DataSetsArr(i, j, k) = DataPoolArr(j, iRandom, k)
Which chooses a random observation of population j:
and modify it to something like:

Code:
For k = 1 to iNumFields
' Perform the desired Data Manipulations/Weightings
If k<> 3 And k<> 4 And k<> 6 Then
    DataSetsArr(i, j, 5) = DataPoolArr(j, iRandom, 6) * WeightFactorArray(w,1) ' Adult Size Manipulation
    DataSetsArr(i, j, 6) = DataPoolArr(j, iRandom, 6) * WeightFactorArray(w,2) ' Density Manipulation
    DataSetsArr(i, j, 4) = DataPoolArr(j, iRandom, 4) + WeightFactorArray(w,3) ' Family Size Manipulation
Next k

Now this assumes you will be randomly choosing populations as you apply these weighting factors. That is, rather than pick one DataSet of say 10,000 subsamples and manipulate it (iNumWeightingFactors * iNumVariables) times. This addition would choose a completely new data set for each manipulation. I believe if the samples are truly random that shouldn't really matter.

Otherwise, we could pretty easily change the code so that it picks one DataSetsArr() and put only the manipulations inside of the w and v loops.
I have also heard that pseudorandom number generators can break down if called too often, quickly etc, but I leave that to the real experts and professionals around here. Excel is a hobby of mine and VBA is a recently aquired addiction.

I hope I have conveyed useful information in a meaningful way.
 
Upvote 0
For the sanctity of your data, or at least because its cool, you could use real random numbers. I've been looking through several sites that provide real numbers in many forms, though many like to give them in bytes.

Here are two:
http://www.fourmilab.ch/hotbits/
http://www.random.org/nform.html

The first one appears to be a site associated with fourmilab
- they use the timing between geiger counts of an element undergoing radioactive decay, a natural and inherently spontaneous process. Unfortunately you can only get 2k of bytes at a time.

The second site uses "atmopheric noise" to get real random integers, you can get 10,000 at a time. ( it may be an archive, I'm not sure, but in any case you don't want to reuse numbers) You would need to download more random numbers every time you run a new simulation.

I'm sure there are people here familiar with the ins and out of random numbers, when they are necessary and when pseudorandom numbers are appropriate. As your work seems to require mass quantities of truly random numbers, you may consider downloading some random numbers, and writing an array that pulls in these random numbers.
 
Upvote 0
Sorry - I have been and still am quite sick. Flu.
I'll reply in a day or two. Thanks though! You've been amazing!! :pray:
 
Upvote 0
I'm back & feeling better than ever. I've finally managed to get my ENTIRE code to work and am VERY excited (& relieved). In terms of the random number generator etc etc, I've decided to use an excel add-in for the Mersenne Twister algorithm which seems to be easier to impliment & more efficient than downloading bytes (which I attempted, struggled with & abandoned yesterday). Anyhoo, everthing works WONDERFULLY and I'm so grateful for all of your help along the way PA HS Teacher.
Thank you!!!! Thank you!!!! Thank you!!!! :pray:

Have a great holiday break & good luck next year!
F
 
Upvote 0

Forum statistics

Threads
1,214,979
Messages
6,122,550
Members
449,088
Latest member
davidcom

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top