# macro for stratified random sampling?

#### notwhoiam

##### New Member
Hi,

Does anyone know how to get Excel to pull a stratified random sample?

For example, let's say my dataset contains records/measurements from sixteen groups of buffalo, representing nine entirely different populations across the US. The data is organised into 4 columns ("Population", "Family size", "Adult weight", & "Density") and 16 rows.
Now, what I'd want to do is to generate a subsample by randomly pulling ONE record per population (=9). Then I'd want to repeat that 99 times to get 100 randomly chosen subsamples, each containing just 9 (not 16!) records.

Now I suppose I could use labels to identify the various populations but I worry that it'd get really complicated once I apply the macro to my real (& much more complex) dataset, especially since I'd be wanting each subsample to contain data from ALL the columns (i.e. each of the 9 records per subsample would still include "Fam Size", Weight" & "Density" estimates).

Understand what I'm needing to do? Unfortunately, the usual answers (simple random sampling between X & Y, or using a random number generator) won't work b/c I need everything to be stratified by population. I'd also prefer to avoid having to run random sampling functions/macros for EACH population individually.

I know this can be done via S+ & R, but I'd really rather stick to excel if at all possible.
Any suggestions you can give will be enthusiastically & gratefully appreciated!! ray:
Thanks!!

### Excel Facts

How to fill five years of quarters?
Type 1Q-2023 in a cell. Grab the fill handle and drag down or right. After 4Q-2023, Excel will jump to 1Q-2024. Dash can be any character.

#### tusharm

##### MrExcel MVP
I assume the Population column identifies which population each of the 16 groups belongs to.

If so, why not sort the data so that all the groups from each population are together. Then, select a random number between x and y for each of the 9 populations. That will give you one sample with one representative from each population.

#### notwhoiam

##### New Member
True, that would work... in theory. But not once I applied it to my real dataset: 75 populations containing 1-15 records/observations each. That's a lot of fussy work, especially since this is'll be only one of several complex analytical/statistical steps I'll need to do.

So what I'm really looking for is some way to sample randomly according to a secondary criterion (="Population") - and to do so via a single macro or command, instead of 75 (example=16) separate steps...
I hope someone's got some ideas b/c I sure don't

Thx

ps: as I confusingly mentioned earlier, I'd like to get data for the entire row (using 'SampleRow'??) instead of for a single column or cell per record...

#### PA HS Teacher

##### Well-known Member
I believe I've come up with a way to create 100 (or more) Random Data Sets as you've described with each Data Set containing one Row of information for each Population. (With no VBA, addins (other than the analysis tool pack), or user defined functions!)

My Assumptions:
- You will have up to 75 Observations from 1 to 15 populations
- In Each Data Set you would like 1 Observation for each of up to 15 populations.

I appologize, I will have to present my table in two posts:

Your Pool of up to 75 records should be entered as follows:
Notice, I added a column called Unique Data ID.
Book1.xls
ABCDE
2
TypeABuffalo_2
TypeABuffalo32543.4
3
TypeABuffalo_1
TypeABuffalo42434.5
4
TypeBBuffalo_1
TypeBBuffalo22625.6
Sheet1

I've named the following Ranges
A2:A76 is Named Unique_Data_ID
B2:B76 is Named Population
A2:E76 is Named DataPool

The Unique ID column Creates a Unique ID Number by Counting the Instance of this population in the data pool and concatenating that with the Population Name. This will be used later by a Match Function.

If recreating this spreadsheet, Copy Row 2 and Paste Down to Row 76

For the cells that pick the desired Random Sample, Please see my next post.

 Whoops! Cell A2 should be:
=B2& "_" &COUNTIF(\$B\$2:\$B2,B2) notice:3 changed to 2.

#### PA HS Teacher

##### Well-known Member
Because the table in my last post extends more than 30 cells, I couldn't fit it all in the same post, even if I hid rows.

This Next Table is Data Set Number 1
Column A Picks at random, one of the Observation Numbers (vertical position within Data Pool) Coresponding to the Population Name in Column B.

The Match function matches (????) in Unique_Data_ID (A\$1:A\$76)
where: ??? is the population Name in column B concatenated with a Random number between 1 and the number of times this population appears in the DataPool.

The other columns use this randomly generated observation number (column A) in an index function to return information from the desired column of DataPool.
Book1.xls
ABCDE
78
DataSet1
79
80Observation#
Population
FamilySize
Density
81
2
TypeABuffalo
4
243
4.5
82
3
TypeBBuffalo
2
262
5.6
83
4
TypeCBuffalo
5
197
2.5
Sheet1

To recreate this sheet, copy Row 81 and Paste down to Row 95.

To generate more Data Sets Copy Rows 78 to 97 and paste down as far as you need.

(Just remember that the data sets change every time excel recalculates)
I recommend that after you've copied down and have the desired data sets, copy all the data sets and paste special as values somewhere else.

#### notwhoiam

##### New Member
ray: That's fantastic 'PA HS Teacher'!
Thank you for your very clear & useful suggestion - I think it'll be a huge help to my work since it means that I won't have to pull out a million random samples by hand.

However, as great as your idea is :wink: , it still contains two possible snags:
- You will have up to 75 Observations from 1 to 15 populations
- In Each Data Set you would like 1 Observation for each of up to 15 populations.
In fact, I have data from 75 populations and then, for each population, I've got 1-15 observations. But, from what I can tell, this shouldn't cause any/many problems provided I'm careful how your formulas are applied to my dataset.

(2) More of an issue for me is the fact that (I think) your plan would require that I copy & paste the 'Dataset' formulae/cells (=your 2nd table) over & over & over. In this example I'd need to do it 100 times (for the 100 random subsamples I requested) but, in actual fact, I need 10,000 subsamples x 8 different combinations of variables x 5 weighting schemes... ouch. :banghead: Don't worry - I know how do do all the rest of those steps Anyway, with so many subsamples, I'd also have to either manually 'stack' the subsamples vertically (i.e. all in the same columns, as in your example) or else I'd have to repeat them horizontally (in the same rows but in neighbouring columns). Either way it'd be really complicated & really slow to set up, especially given the size of my dataset & the number of repetitions I need. That's why I was considering macros (tho' don't ask me how they'd get around the stacking dilemma...).

Anyway, you've been a HUGE help to me & given me a bunch of great ideas. Hopefully I can find a way to automate or simplify the process - but even if I can't, your plan will have already eliminated a TON of extra steps for me.

THANKS!!![/b]

#### tusharm

##### MrExcel MVP
I suppose that if you, notwhoami, are happy, then I should probably leave this discussion alone. So, here are some parting thoughts...

You want 10,000 subsamples x 8 combinations of variables x 5 weighting schemes. That is 400,000 subsamples. An Excel worksheet only has 65,536 rows.

Even if you look at just the 10,000 subsamples for 1 combination of variables and 1 weighting scheme, my understanding is that each subsample will have one observation drawn from each of the 75 populations. That means you will have 750,000 observations in the 10,000 subsamples. Again, way outside XL's limits.

A final note. Not to take anything away from PA HS Teacher's XL-native solution (it's quite creative), but I don't understand your enthusiasm, notwhoami. S/he had 3 MATCH functions to get 3 observations, one from each population. For 75 populations, you would need 75 MATCH functions, one in each of 75 rows. And, that will give you one subsample.

The bottom line is that no matter how you cut it, you have to develop a method to pick a random element from each of the 75 populations. Whether you do it in a loop in a macro or through 75 rows in a worksheet, that fact remains unchanged. For that matter, even if you buy a package that will do this kind of sampling (not that I know of any), it still doesn't change the fact. Someone or something has to make 75 selections, one from each of the 75 populations and someone -- you -- will have to describe how the sampling has to be done.

#### notwhoiam

##### New Member
hehe - in fact I DID realise just how inefficient PA HS Teacher's plan really was, but it was far better than the options I'd come up with - so for that I was extremely grateful.

However I'd DEFINITELY still be interested in hearing other ideas & suggestions, particularly if they could speed-up or automate the process for me.

As for the problem of excel's limited sheet size, I had already resigned myself to using multiple sheets: perhaps having 100 or 1000 subsamples per sheet or workbook (depending on how many columns etc I end up needing), so I wasn't surprised by the fact that I'd still have to do that. The procedure doesn't need to make a beautiful spreadsheet - it just needs to work & to be relatively quick to carry out.

Do you have any better suggestions Tusharm?? Does anyone else?? I could DEFINITELY use your help!!

Thanks - for anything & everything!

##### MrExcel MVP
a question, notwhoiam. Other than satisfying a fondness for large datasets, what problem is all this data actually designed to help you solve?

#### notwhoiam

##### New Member
Ok, well you asked for it.... :wink:

I'm studying wildlife populations but, because of the unusual nature of the data itself (it's non-normal etc etc), I can't run standard statistical tests on it. Instead I'm planning to use a non-parametric resampling method to test the data against all its possible random permutations; the distribution of test statistics (each calculated for one randomly re-paired combo of X & Y variables) will then let me test whether the original/real X-Y relationship could have occured via chance alone or if there's some deeper pattern in the data.

But b/c the data is also non-independent (there's spatio-temporal autocorrelation between all the observations from a single population), I can't just run a standard bootstrap or permutation test. And, b/c the populations don't each contain the same number of observations, I can't just tell excel to randomly sample 1 observation from every block of, say, 10 rows in the dataset.

So, after consulting various eminent statisticians, I came up with a protocol where only ONE randomly-chosen observation is pulled per population per subsample, and where 10000 (a standard #) of those subsamples would then be analysed before their results are pooled into a comparison "chance" distribution.

And you're right Paddy D, there isn't any one computer program that will do ALL of the steps I need done. Knowing that, I'd planned to do this all in three steps: get excel to pull random subsamples for me. get a stats program called BLOSSOM to run the individual analyses, and then get excel to condense all the BLOSSOM results into a single coherent answer... I've already worked out the 2nd & 3rd parts of that process, but still need to find a simple yet efficient way to pull the subsamples in the first place.

Hope that helps explain all my questions & why I still need help sorting this out... Thx!