Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell · Oct 31, 2014

Hey

I couldn't find an answer for this so thought I'd sign up!

As you can guess, i've been trying out the method in this video: https://www.youtube.com/watch?v=IhuURsu0jdI

A series of formulas for extracting unique data, compiling it into a list, and then using this list to make a dynamicly updating DV List.

It works just great. The trouble is, when it is applied to large data sets, it seriously ties up the cpu.

My question is how can I reduce or eliminate this problem? What elements can be altered or removed to improve performance?

The below formula extracts a list of unique names in sequence. From there it's fairly simple to make the DV List.

=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),
ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROWS(B$13:B13))))

$C$13 contains a unique count formula.

B$13:B13 is the position on the list (so if there were only 10 unique names you wanted to extract, you could get away with dragging it to B$13:B23).

"User_Name_Extract" is the range, which in my case is 32000 rows xD

You can probably get a clearer understanding from the video than I can give. Is there any solution when applying this to such large datasets? And what particularly is causing the trouble?

Any help would be grand

XOR LX · Oct 31, 2014

Hi.

I'm afraid there's not much better a set-up using worksheet formulas than the one you're currently using: the data range is simply so large as to make it inevitably resource-heavy.

Regards

Chris Pucknell · Oct 31, 2014

XOR LX said:
Hi.

I'm afraid there's not much better a set-up using worksheet formulas than the one you're currently using: the data range is simply so large as to make it inevitably resource-heavy.

Regards

What about if I limited the range with a formula which drew on specified data? I'm trying to extract usernames from an access log, and they have an associated date... is that just going to make things worse? xD

Jon von der Heyden · Oct 31, 2014

Just a unique list from a range of values, in the order that they appear?

How does this compare?

Excel 2013

A

B

C

1

Values

Unique

2

a

3

a

b

4

a

c

5

b

d

6

c

e

7

c

8

c

9

d

10

d

11

d

12

e

13

e

Sheet1

Cell Formulas
Range	Formula
C2	{=IFERROR(INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)),"")}
Press CTRL+SHIFT+ENTER to enter array formulas.

XOR LX · Oct 31, 2014

@Jon von der Heyden

A construction with IFERROR is likely to be much worse, if anything.

Generally, such a set-up is extremely inefficient, particularly if the dataset in question is a dynamic, potentially expanding one.

The point is that, if you have a set-up such as:

=IFERROR([some_large_array_formula],"")

which is intended to be copied down a sufficient number of rows so as to encapsulate all desired returns, then you will be faced with two choices.

Firstly, you can perform some calculation to determine precisely how many such returns you will have at any given time, and then drag this formula down that number of rows. Obviously this is not ideal, even less so if you have, as I said, a dynamically-changing dataset.

Secondly, we can copy the formulas down to an arbitrarily large number of rows such that we are guaranteed to cover all possible returns even should our dataset expand at some future time, and so need not worry ourselves about it again.

Obviously this second method is preferable in practice. The problem with the IFERROR construction (even worse if you're on 2003 or earlier and have to use a repeated IF(ISERROR clause) is that, in rows to which the formula is copied beyond that which is effectively necessary, there is nothing to prevent the large, resource-heavy array formula calculating needlessly.

The point is that, in the above construction, even in rows beyond that containing our last expected return Excel still has to spend all the resource on calculating the array formula part before it can then decide for itself whether it is in fact an error or not.

Far, far better than this IFERROR approach is, as in the link the OP provides, to use a single "helper" cell to first determine the number of rows which we expect to have returns, and then reference this in the formula instead.

Granted, we still have the question of how far to copy the formula down. And even though we have a calculation in place to determine the required number of rows, we still don't want to have to manually re-adjust the number of cells containing formulas every time we want to update results. This should be a one-off initial job.

I certainly wouldn't recommend copying down to the very end of the spreadsheet. However, provided a suitably large upper bound can be chosen, then it should not matter a great deal in terms of performance if we end up with even several thousand extraneous cells containing formulas. The reason being, and the massive difference between this set-up and the IFERROR approach, that here the initial clause:

=IF(ROWS(B$13:B13)>$C$13,"",

means that, in rows beyond the expected number of returns, the IF clause returns TRUE and so a blank is returned. And the nice part about the IF function is that, if the clause passed to it is TRUE, then the FALSE part - here a large, resource-heavy array formula - does not even get considered for calculation.

This is not at all the case with the IFERROR version, which carries on churning away, oblivious to the fact that its calculations are needless and a burden on resource.

Regards

Jon von der Heyden · Oct 31, 2014

I wasn't especially advocating use of IFERROR.

I wish Excel would offer a unique function, and emulate behaviour of Google Spreadsheets. I would probably opt form a UDF. I especially like Colins addin (free to download and all code is exposed). See his LISTUNIQUE here, using his self-extending UDF addina and specifically rxlRESIZE function. I wonder how it compares. It certainly solves the problem of not knowing how many cells to span to produce the results.

Jon von der Heyden · Oct 31, 2014

Jon von der Heyden said:
I wasn't especially advocating use of IFERROR.

In fact I'd happily leave the errors in the results. They will appear at the end of the list anyway - making it easy to construct a DNR for the list, whereby errors are ignored. And this would be a single formula and hence not add much in terms of overhead to the model, and it is a range that would perfectly suitable to use in DV.

E.g:

Excel 2013

A

B

C

D

E

F

1

Values

Unique

Formula for DNR

5

2

a

3

a

b

4

a

c

5

b

d

6

c

e

7

c

#N/A

8

c

#N/A

9

d

10

d

11

d

12

e

13

e

Sheet1

Cell Formulas
Range	Formula
F1	=COUNTA(INDEX(C:C,2,1):INDEX(C:C,MATCH(REPT("z",255),C:C,1),1))
C2	{=INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0))}
Press CTRL+SHIFT+ENTER to enter array formulas.

XOR LX · Oct 31, 2014

Jon von der Heyden said:
I wish Excel would offer a unique function, and emulate behaviour of Google Spreadsheets. I would probably opt form a UDF. I especially like Colins addin (free to download and all code is exposed). See his LISTUNIQUE here, using his self-extending UDF addina and specifically rxlRESIZE function. I wonder how it compares. It certainly solves the problem of not knowing how many cells to span to produce the results.

Agreed. That would be a very useful feature to have in Excel. Not that I know much about Google Sheets, but it seems that, not only in this case but in terms of other functions as well, that application has the upper hand over Excel. Perhaps Microsoft should start worrying!

Jon von der Heyden said:
In fact I'd happily leave the errors in the results. They will appear at the end of the list anyway - making it easy to construct a DNR for the list, whereby errors are ignored. And this would be a single formula and hence not add much in terms of overhead to the model, and it is a range that would perfectly suitable to use in DV.

But those formulas will still have to calculate in that scenario. The point is that there's no IF clause there to tell formulas in rows beyond necessary to "not bother" calculating the resource-heavy array formula part. As a result, that's potentially an awful lot more calculation than necessary, especially if the dataset is large (as here) and, as I said, we cannot be guaranteed of copying the formula to just the precise number of cells as required each time.

Regards

Jon von der Heyden · Oct 31, 2014

XOR LX said:
Perhaps Microsoft should start worrying!

No not at all, not yet at least. Google exposes a few nice functions but otherwise it is seriously lacking in tools/functionality, and as far as capacity and performance is concerned Excel has a very big lead.

Jon von der Heyden · Oct 31, 2014

XOR LX said:
we cannot be guaranteed of copying the formula to just the precise number of cells as required each time.

I would opt for Colins self-extending UDF if it were my model. But as it's not I would recommend a flag. One cell to count unique values and then a comparison flag to check that the number of results yielded by the unique formula agrees with the unique count. I'm a big fan of having these sorts of flags revealing status on the landing page, so that if it is off it is blooming obvious.

I wouldn't over-complicate the formula to produce unique values - over 32000 iterations is will just slow it down.

Edit: By the way - I'm not arguing about the IF to check if the formula evaluation is needed. What I was proposing was a different way to pull back the next unique item from the list, i.e.: =INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)). You can still nest this within the IF().

Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell

New Member

Excel Facts

XOR LX

Well-known Member

Chris Pucknell

New Member

Jon von der Heyden

MrExcel MVP, Moderator

XOR LX

Well-known Member

Jon von der Heyden

MrExcel MVP, Moderator

Jon von der Heyden

MrExcel MVP, Moderator

XOR LX

Well-known Member

Jon von der Heyden

MrExcel MVP, Moderator

Jon von der Heyden

MrExcel MVP, Moderator

Similar threads

Forum statistics

Share this page

Extracting unique data from Large Datasets - Excel Magic Trick 698

New Member

Excel Facts

Well-known Member

New Member

MrExcel MVP, Moderator

Well-known Member

MrExcel MVP, Moderator

MrExcel MVP, Moderator

Well-known Member

MrExcel MVP, Moderator

MrExcel MVP, Moderator

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock