Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell

New Member
Joined
Oct 31, 2014
Messages
11
Hey

I couldn't find an answer for this so thought I'd sign up!

As you can guess, i've been trying out the method in this video: https://www.youtube.com/watch?v=IhuURsu0jdI

A series of formulas for extracting unique data, compiling it into a list, and then using this list to make a dynamicly updating DV List.

It works just great. The trouble is, when it is applied to large data sets, it seriously ties up the cpu.

My question is how can I reduce or eliminate this problem? What elements can be altered or removed to improve performance?

The below formula extracts a list of unique names in sequence. From there it's fairly simple to make the DV List.

=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),
ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROWS(B$13:B13))))

$C$13 contains a unique count formula.

B$13:B13 is the position on the list (so if there were only 10 unique names you wanted to extract, you could get away with dragging it to B$13:B23).

"User_Name_Extract" is the range, which in my case is 32000 rows xD

You can probably get a clearer understanding from the video than I can give. Is there any solution when applying this to such large datasets? And what particularly is causing the trouble?

Any help would be grand :)
 

Excel Facts

Excel Joke
Why can't spreadsheets drive cars? They crash too often!
Hi.

I'm afraid there's not much better a set-up using worksheet formulas than the one you're currently using: the data range is simply so large as to make it inevitably resource-heavy.

Regards
 
Upvote 0
Hi.

I'm afraid there's not much better a set-up using worksheet formulas than the one you're currently using: the data range is simply so large as to make it inevitably resource-heavy.

Regards

What about if I limited the range with a formula which drew on specified data? I'm trying to extract usernames from an access log, and they have an associated date... is that just going to make things worse? xD
 
Upvote 0
Just a unique list from a range of values, in the order that they appear?

How does this compare?


Excel 2013
ABC
1ValuesUnique
2aa
3ab
4ac
5bd
6ce
7c
8c
9d
10d
11d
12e
13e
Sheet1
Cell Formulas
RangeFormula
C2{=IFERROR(INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)),"")}
Press CTRL+SHIFT+ENTER to enter array formulas.
 
Upvote 0
@Jon von der Heyden

A construction with IFERROR is likely to be much worse, if anything.

Generally, such a set-up is extremely inefficient, particularly if the dataset in question is a dynamic, potentially expanding one.

The point is that, if you have a set-up such as:

=IFERROR([some_large_array_formula],"")

which is intended to be copied down a sufficient number of rows so as to encapsulate all desired returns, then you will be faced with two choices.

Firstly, you can perform some calculation to determine precisely how many such returns you will have at any given time, and then drag this formula down that number of rows. Obviously this is not ideal, even less so if you have, as I said, a dynamically-changing dataset.

Secondly, we can copy the formulas down to an arbitrarily large number of rows such that we are guaranteed to cover all possible returns even should our dataset expand at some future time, and so need not worry ourselves about it again.

Obviously this second method is preferable in practice. The problem with the IFERROR construction (even worse if you're on 2003 or earlier and have to use a repeated IF(ISERROR clause) is that, in rows to which the formula is copied beyond that which is effectively necessary, there is nothing to prevent the large, resource-heavy array formula calculating needlessly.

The point is that, in the above construction, even in rows beyond that containing our last expected return Excel still has to spend all the resource on calculating the array formula part before it can then decide for itself whether it is in fact an error or not.

Far, far better than this IFERROR approach is, as in the link the OP provides, to use a single "helper" cell to first determine the number of rows which we expect to have returns, and then reference this in the formula instead.

Granted, we still have the question of how far to copy the formula down. And even though we have a calculation in place to determine the required number of rows, we still don't want to have to manually re-adjust the number of cells containing formulas every time we want to update results. This should be a one-off initial job.

I certainly wouldn't recommend copying down to the very end of the spreadsheet. However, provided a suitably large upper bound can be chosen, then it should not matter a great deal in terms of performance if we end up with even several thousand extraneous cells containing formulas. The reason being, and the massive difference between this set-up and the IFERROR approach, that here the initial clause:

=IF(ROWS(B$13:B13)>$C$13,"",

means that, in rows beyond the expected number of returns, the IF clause returns TRUE and so a blank is returned. And the nice part about the IF function is that, if the clause passed to it is TRUE, then the FALSE part - here a large, resource-heavy array formula - does not even get considered for calculation.

This is not at all the case with the IFERROR version, which carries on churning away, oblivious to the fact that its calculations are needless and a burden on resource.

Regards
 
Upvote 0
I wasn't especially advocating use of IFERROR.

I wish Excel would offer a unique function, and emulate behaviour of Google Spreadsheets. I would probably opt form a UDF. I especially like Colins addin (free to download and all code is exposed). See his LISTUNIQUE here, using his self-extending UDF addina and specifically rxlRESIZE function. I wonder how it compares. It certainly solves the problem of not knowing how many cells to span to produce the results.
 
Upvote 0
I wasn't especially advocating use of IFERROR.
In fact I'd happily leave the errors in the results. They will appear at the end of the list anyway - making it easy to construct a DNR for the list, whereby errors are ignored. And this would be a single formula and hence not add much in terms of overhead to the model, and it is a range that would perfectly suitable to use in DV.

E.g:

Excel 2013
ABCDEF
1ValuesUniqueFormula for DNR5
2aa
3ab
4ac
5bd
6ce
7c#N/A
8c#N/A
9d
10d
11d
12e
13e
Sheet1
Cell Formulas
RangeFormula
F1=COUNTA(INDEX(C:C,2,1):INDEX(C:C,MATCH(REPT("z",255),C:C,1),1))
C2{=INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0))}
Press CTRL+SHIFT+ENTER to enter array formulas.
 
Upvote 0
I wish Excel would offer a unique function, and emulate behaviour of Google Spreadsheets. I would probably opt form a UDF. I especially like Colins addin (free to download and all code is exposed). See his LISTUNIQUE here, using his self-extending UDF addina and specifically rxlRESIZE function. I wonder how it compares. It certainly solves the problem of not knowing how many cells to span to produce the results.

Agreed. That would be a very useful feature to have in Excel. Not that I know much about Google Sheets, but it seems that, not only in this case but in terms of other functions as well, that application has the upper hand over Excel. Perhaps Microsoft should start worrying!

In fact I'd happily leave the errors in the results. They will appear at the end of the list anyway - making it easy to construct a DNR for the list, whereby errors are ignored. And this would be a single formula and hence not add much in terms of overhead to the model, and it is a range that would perfectly suitable to use in DV.

But those formulas will still have to calculate in that scenario. The point is that there's no IF clause there to tell formulas in rows beyond necessary to "not bother" calculating the resource-heavy array formula part. As a result, that's potentially an awful lot more calculation than necessary, especially if the dataset is large (as here) and, as I said, we cannot be guaranteed of copying the formula to just the precise number of cells as required each time.

Regards
 
Upvote 0
Perhaps Microsoft should start worrying!
No not at all, not yet at least. Google exposes a few nice functions but otherwise it is seriously lacking in tools/functionality, and as far as capacity and performance is concerned Excel has a very big lead.
 
Upvote 0
we cannot be guaranteed of copying the formula to just the precise number of cells as required each time.
I would opt for Colins self-extending UDF if it were my model. But as it's not I would recommend a flag. One cell to count unique values and then a comparison flag to check that the number of results yielded by the unique formula agrees with the unique count. I'm a big fan of having these sorts of flags revealing status on the landing page, so that if it is off it is blooming obvious.

I wouldn't over-complicate the formula to produce unique values - over 32000 iterations is will just slow it down.

Edit: By the way - I'm not arguing about the IF to check if the formula evaluation is needed. What I was proposing was a different way to pull back the next unique item from the list, i.e.: =INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)). You can still nest this within the IF().
 
Last edited:
Upvote 0

Forum statistics

Threads
1,214,870
Messages
6,122,021
Members
449,060
Latest member
LinusJE

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top