Random Number based on Normal Distribution

daveWP

New Member
Joined
Sep 29, 2009
Messages
28
I'm trying to reverse engineer tabular data points in order to arrange data that converges to a certain point.

Lets say there are 150,000 participants in a survey that I have to plug values for. 100,000 are coded as 5. 50,000 are coded as 3.

I found this thread: http://www.mrexcel.com/forum/showthread.php?t=425799, but couldn't make sense of it.

I'm trying to generate random numbers for those 100,000 participants between 1-5, so that the average would converge towards 5, and generate numbers for the 50,000 that converge toward 3.

Anyone have any idea how to use 3 and 5 as seats in a rand() function that include normal distribution consideration?
 

Excel Facts

Show numbers in thousands?
Use a custom number format of #,##0,K. Each comma after the final 0 will divide the displayed number by another thousand
An Introduction to Excel's Normal Distribution Functions
http://www.exceluser.com/explore/statsnormal.htm

#VALUE!
Excel 2003

The formulas are:
Cell Formulas
RangeFormula
F1=STDEVP(INDIRECT("F5:F"&D2+4))
F2=AVERAGE(INDIRECT("F5:F"&D2+4))
F5=VLOOKUP(RAND(),INDIRECT("C$5:D$"&D$2+4),2)
A3=SUM(INDIRECT("A5:A"&D2+4))
A5=NORMDIST(ROW()-5+D$1,B$2,B$1,0)
B3=SUM(INDIRECT("B5:B"&D2+4))
B5=A5/A$3
E3=SUM(INDIRECT("E5:E"&D2+4))
C5=IF(ROW()=5,0,SUM(B4:B$5))
D5=ROW()-5+D$1
#VALUE!
Entered with Ctrl+Shift+Enter. If entered correctly, Excel will surround with curly braces {}. Note: Do not try and enter these manually yourself

Note that the array formula in Cells E5:E13 is a special case.
First delete any array formula starting in Cell E5.
Then select Range E5:E13, paste the formula, and CONTROL SHIFT ENTER to get the { }.

In Cell A5 to D5, the formulas are copied down to row 13.

In Cell F5, the formula is copied down so that 400 rows are filled.

For your example with 100,000 participants instead of 400 you will need to either use Excel 2007 with more rows than Excel 2003, or make other arrangements.

The sheet will recalculate with any Cell change because volatile FUNCTIONS are used.

Directions for plotting two charts:

Create two named ranges Insert Name Define:
Norm1 =INDIRECT("B$5:B$"&$D$2+4)Norm2 =INDIRECT("E$5:E$"&$F$2+4)
Norm2 =INDIRECT("E$5:E$"&$D$2+4)
Also set the Range for the Category X Labels if it is not a set of integers starting with 1. In that case set the Range for the column that starts in Cell D5.

Two charts can be created with plot ranges of:
=NormDistSimulate.xls!Norm1
=NormDistSimulate.xls!Norm2
Where 'NormDistSimulate' is the name of the Excel file.

The first chart plots the PDF and the second plots the randomly generated data.
 
Upvote 0
Here is another version where the INDIRECT Function has been eliminated and replaced with INDEX.

Also the Cell Array Formula in Column E starting in Cell E5 has been replaced with a formula that does not use the FREQUENCY Function.

Excel Workbook
ABCDEF
1STDin2Start1STDout1.7159181
2AVGin5End9AVGout4.825
30.97701581Select40Selected40
4NormDistNormalizeNormSumBinCountDraw
50.02699550.02763060106
60.06475880.06628220.0276306234
70.12098540.12383150.0939128362
80.17603270.18017380.21774434103
90.19947110.20416370.3979182576
100.17603270.18017380.6020818624
110.12098540.12383150.7822557774
120.06475880.06628220.9060872815
130.02699550.02763060.9723694913
Sheet1

Cell Formulas:
Excel Workbook
ABCDEF
1STDin2Start1STDout1.7159181
2AVGin5End9AVGout4.825
30.97701581Select40Selected40
4NormDistNormalizeNormSumBinCountDraw
Sheet1
Cell Formulas
RangeFormula
A3=SUM(A5:INDEX(A:A,D2+4))
B3=SUM(B5:INDEX(B:B,D2+4))
F1=STDEVP(F5:INDEX(F:F,D3+4))
F2=AVERAGE(F5:INDEX(F:F,D3+4))
F3=COUNT(F5:INDEX(F:F,D3+4))

Cell Formulas in Row 5:
Excel Workbook
ABCDEF
50.02699550.02763060106
Sheet1
Excel 2003
Cell Formulas
RangeFormula
A5=NORMDIST(ROW()-5+D$1,B$2,B$1,0)
B5=A5/A$3
C5=IF(ROW()=5,0,SUM(B4:B$5))
D5=ROW()-5+D$1
E5=COUNTIF(F5:INDEX(F:F,D$3+4),D5)
F5=VLOOKUP(RAND(),C$5:INDEX(D:D,D$2+4),2)

The formulas in Cell A5 to E5 are copied and pasted for as many rows as are needed based on the Start and End in Cell D1 and D2.
The formula in Cell F5 is copied and pasted for as many rows as are needed based the Select number in D3.

NAMED FORMULAS FOR RANGES FOR PLOTS:
Norm1 =$B$5:INDEX($B:$B,$D$2+4)
Norm2 =$E$5:INDEX($E:$E,$D$2+4)

TESTING For:
STDin = 1
AVGin = 5
Select = 1000
where 0.9999970 of the bell curve is modeled
The calculation was done 1000 times with these resulting averages of the 1000 STDout and AVGout:
STDout = 0.981297785
AVGout = 4.99860900

NOTE:
In the previous post the cell formulas in Cell F1 and F2 should have had "D3" not "D2".

ILLUSTRATIONS BELOW:

This below shows the PDF charted in the upper plot, and the Resulting Count in Column E charted in the lower plot, with 40 Selected.



This below shows that the plot of the Resulting Count in Column E more closely resembles the PDF as the number Selected goes up (500 here).



Sometimes a #N/A Error results in Column E as below after the calculation repeats hundreds of times, possibly from the RAND Function.
If this occurs it may be necessary to exit Excel and restart the program.

 
Last edited:
Upvote 0
CORRECTION<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:eek:ffice:eek:ffice" /><o:p></o:p>
<o:p></o:p>
The Cell Formula in Cell E5 in the previous post was not correct.

Incorrect:
<o:p></o:p>
<o:p>
Cell Formulas
RangeFormula
E5=COUNTIF(F5:INDEX(F:F,D$3+4),D5)

</o:p>
<o:p>Corrected:</o:p>
<o:p></o:p>
<o:p><o:p>
Cell Formulas
RangeFormula
E5=COUNTIF(F$5:INDEX(F:F,D$3+4),D5)


This caused the numbers in column E to be undercounted.

The Start, End, and Select numbers should be integers.
Start should be at least 1 for the formulas posted to work.
If Start = 0 then the End number will not occur.</o:p>
</o:p>
 
Upvote 0
This question has come up several times.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:eek:ffice:eek:ffice" /><o:p></o:p>
This is a more complete solution than previous post.

Adjustments have been made so that:<o:p></o:p>
1) Start (and End) can be negative<o:p></o:p>
2) Bin width can be any number (previously limited to 1 –- depends now on Start, End, Bins)<o:p></o:p>
3) Input Cells have green background and blue font<o:p></o:p>
4) Conditional formatting Error color for Cell B3 and F3<o:p></o:p>
4a) For Cell B3 if Cell B3 contents<> Sum of Column E or F<o:p></o:p>
4b) For Cell F3 if SUM Column B<> 1<o:p></o:p>
<o:p></o:p>
Excel Workbook
ABCDEF
1STDin3Start-5STDout2.3343562
2AVGin0End5AVGout-0.9375
3Select20Bins9NormSum0.7528544
4NormDistNormalizeCumulativeBinCountDraw
50.0331590.04404440-51-5
60.0608830.08086960.0440444-3.7530
70.09397060.12481910.124914-2.541.25
80.12192410.16194910.2497331-1.2525
90.13298080.17663540.4116823050
100.12192410.16194910.58831771.254-2.5
110.09397060.12481910.75026692.50-3.75
120.0608830.08086960.8750863.7501.25
130.0331590.04404440.9559556510
141.25
15-3.75
16-2.5
17-3.75
18-2.5
191.25
20-1.25
21-1.25
22-2.5
230
240
...
Excel 2003


<o:p></o:p>
Cell Formulas:<o:p></o:p>
<o:p></o:p>
Excel Workbook
ABCDEF
1STDin3Start-5STDout2.3343562
2AVGin0End5AVGout-0.9375
3Select20Bins9NormSum0.7528544
4NormDistNormalizeCumulativeBinCountDraw
...


Cell Formulas
RangeFormula
F1=STDEVP(F5:INDEX(F:F,B3+4))
F2=AVERAGE(F5:INDEX(F:F,B3+4))
F3=SUM(A5:INDEX(A:A,D3+4))

<o:p></o:p>
Cell A5:E5 copied down so the # Rows filled = Bins in Cell D3<o:p></o:p>
Cell F5 copied down so the # Rows filled = Select in Cell B3<o:p></o:p>
Cell Formulas to be copied down the column:<o:p></o:p>
<o:p></o:p>
Excel Workbook
ABCDEF
50.0331590.04404440-51-5
...


Cell Formulas
RangeFormula
A5=NORMDIST(D5,B$2,B$1,0)
B5=A5/F$3
C5=SUM(B4:B$5)*(ROW()>5)
D5=D$1+(ROW()-5)*(D$2-D$1)/(D$3-1)
E5=COUNTIF(F$5:INDEX(F:F,B$3+4),D5)
F5=VLOOKUP(RAND(),C$5:INDEX(D:D,D$3+4),2)

<o:p></o:p>
<o:p></o:p>
Named Formulas for chart ranges:<o:p></o:p>
<o:p></o:p>
X axis for both charts (Column D):<o:p></o:p>
Norm0 =$D$5:INDEX($D:$D,$D$3+4)<o:p></o:p>
Y axis for chart 1 (Column B):<o:p></o:p>
Norm1 =$B$5:INDEX($B:$B,$D$3+4)<o:p></o:p>
Y axis for chart 2 (Column E):<o:p></o:p>
Norm2 =$E$5:INDEX($E:$E,$D$3+4)<o:p></o:p>
<o:p></o:p>
Conditional Formatting Formulas:<o:p></o:p>
<o:p></o:p>
Cell B3<o:p></o:p>
=OR($B$3<>SUM($E$5:INDEX($E:$E,$D$3+4)),$B$3<>COUNT($F$5:INDEX($F:$F,$B$3+4)))<o:p></o:p>
Cell interior color = 19 (yellow)<o:p></o:p>
Cell font color = 3 (red)<o:p></o:p>
<o:p></o:p>
Cell F3<o:p></o:p>
=SUM($B$5:INDEX($B:$B,$D$3+4))<>1<o:p></o:p>
Cell interior color = 19 (yellow)<o:p></o:p>
Cell font color = 3 (red)<o:p></o:p>
<o:p></o:p>
Otherwise colors are:<o:p></o:p>
Cell interior color = 35 (green)<o:p></o:p>
Cell font color = 5 (blue)

Input PDF in Column B:


Output Count in Column E:<o:p></o:p><o:p></o:p><o:p></o:p>
 
Upvote 0
Hi Jack,

I've tried following your solution and I've the following question:
- I'm trying to get 20 random no.s. These random no.s will be able to comply with a predefined mean and standard deviation values.

- Am I correct to say that the no.s from cells F5:F24 are the random no.s I could use?

- I could not match STDout with STDin and these 2 values are always quite different from each other. Any solution for this?

Sorry for the noob question.
 
Upvote 0

The idea of random means that the Input AVERAGE and STD are unlikely to match the Output exactly.

The more that the tails of the input PDF are cut off then the less the full bell curve is modeled.

The NormSum in Cell F3 is used for the normalization and is not the % of the bell curve that is modeled.
To get the % that is modeled it may have to be integrated, as it depends on the % of the area under the curve, which goes to plus or minus infinity.

To get an idea of how much of the curve is being modeled, plot the Input and Output as previously shown.
Look at how much of the tails are cut off on both sides of the Input curve.

Note that Excel has both STDEVP and STDEV Functions.

Note that the Output will tend to match the Input more closely with an Input AVERAGE and STD that allows more of the bell curve to be modeled, and also as the number selected goes up.

For example, for 5000 selected with 10 bins:
Excel Workbook
ABCDEF
1STDin1.5Start1STDout1.4926332
2AVGin5End10AVGout5.0068
3Select5000# Bins10NormSum0.9987831
4NormDistNormalizeCumulativeBinCountDraw
50.00759730.007606601345
60.0359940.03603780.007606621805
70.109340.10947330.043644435447
80.21296530.21322480.1531177411026
90.26596150.26628560.3663425513145
100.21296530.21322480.6326281610224
110.109340.10947330.845852975827
120.0359940.03603780.955326181815
130.00759730.00760660.9913649343
140.00102820.00102940.99897061074
...


 
Upvote 0

Forum statistics

Threads
1,214,914
Messages
6,122,211
Members
449,074
Latest member
cancansova

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top