Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell · Oct 31, 2014

Hey

I couldn't find an answer for this so thought I'd sign up!

As you can guess, i've been trying out the method in this video: https://www.youtube.com/watch?v=IhuURsu0jdI

A series of formulas for extracting unique data, compiling it into a list, and then using this list to make a dynamicly updating DV List.

It works just great. The trouble is, when it is applied to large data sets, it seriously ties up the cpu.

My question is how can I reduce or eliminate this problem? What elements can be altered or removed to improve performance?

The below formula extracts a list of unique names in sequence. From there it's fairly simple to make the DV List.

=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),
ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROWS(B$13:B13))))

$C$13 contains a unique count formula.

B$13:B13 is the position on the list (so if there were only 10 unique names you wanted to extract, you could get away with dragging it to B$13:B23).

"User_Name_Extract" is the range, which in my case is 32000 rows xD

You can probably get a clearer understanding from the video than I can give. Is there any solution when applying this to such large datasets? And what particularly is causing the trouble?

Any help would be grand

XOR LX · Oct 31, 2014

Jon von der Heyden said:
No not at all, not yet at least. Google exposes a few nice functions but otherwise it is seriously lacking in tools/functionality, and as far as capacity and performance is concerned Excel has a very big lead.

Ah, ok. Thanks. One day I shall have to investigate further. So better on functions but worse on overall functionality would be a fair summary?

All I know is that, when I run my weekly Advanced Formula Challenge over at my blogsite, every week I get a guy who offers a Google Sheets solution. Of course, I can't verify these solutions, but just on appearance, and going on his explanations, it does seem to be far more effective (and flexible) in a lot of worksheet formula situations than the equivalent Excel set-up.

Jon von der Heyden said:
Edit: By the way - I'm not arguing about the IF to check if the formula evaluation is needed. What I was proposing was a different way to pull back the next unique item from the list, i.e.: =INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)). You can still nest this within the IF().

Yes, that's true. To be honest I always opt for the FREQUENCY set-up myself (I personally don't like the idea that a formula "cannot begin in row 1"?!) but I've never run any tests to see which of this and the MATCH/COUNTIF set-up perform better. I've a feeling I might have seen a post in which Mr. Akyurek expressed the opinion that the FREQUENCY set-up performed slightly better, though I may be imagining it.

Regards

Jon von der Heyden · Oct 31, 2014

XOR LX said:
Ah, ok. Thanks. One day I shall have to investigate further. So better on functions but worse on overall functionality would be a fair summary?

I think they have some very useful functions missing from Excel - and some existing functions are flexible (e.g. coutif can handle arrays) - but if I knit-pick your comments I wouldn't say they are better. All I know is that formula evaluation in Google is terribly slow compared with the equivalent in Excel. I do hope though that MS recognise some cool improvements to make in Excel.

XOR LX said:
To be honest I always opt for the FREQUENCY set-up myself (I personally don't like the idea that a formula "cannot begin in row 1"?!) but I've never run any tests to see which of this and the MATCH/COUNTIF set-up perform better.

I confess I don't know which method is quicker, nor why. I just presumed that what I prescribed would be better mostly because of omitting the SMALL(IF()). I'm hoping that the OP will give it a shot and let us know. I'm too lazy (busy) to set-up a sample and test.

Or maybe Aladin will just tell us.

XOR LX · Oct 31, 2014

Jon von der Heyden said:
I'm too lazy (busy) to set-up a sample and test. Or maybe Aladin will just tell us.

That's nothing. I don't even know how to time functions!

Jon von der Heyden · Oct 31, 2014

XOR LX said:
That's nothing. I don't even know how to time functions!

I recommend the timeGetTime API function (heaps of tutorials online and very easy to set-up). Apparently it's accurate to a significance of 10 milliseconds. To be honest I find it entirely unnecessary to use a high-resolution timer - I've just upsized the sample so to increase the time it takes. For me either one is significantly faster than the nother - or it isn't. The actual time doesn't matter - different machines will vary anyway.

XOR LX · Oct 31, 2014

Jon von der Heyden said:
I recommend the timeGetTime API function (heaps of tutorials online and very easy to set-up). Apparently it's accurate to a significance of 10 milliseconds. To be honest I find it entirely unnecessary to use a high-resolution timer - I've just upsized the sample so to increase the time it takes. For me either one is significantly faster than the nother - or it isn't. The actual time doesn't matter - different machines will vary anyway.

Thanks for the link. And perhaps I'll give it a go sometime, though to be honest I'm of a similar opinion to you re calculation timings. In fact, I put the task of shaving a few milliseconds off a particular construction as quite low on my priority list, to be honest. There are seemingly more important (and fascinating) sides to Excel to spend too much time on such matters.

Regards

Aladin Akyurek · Oct 31, 2014

@Chris Pucknell

1.

Better to define Ivec by means of Formulas | Name Manager as referring to:

Rich (BB code):

=ROW(User_Name_Extract)-ROW(INDEX(User_Name_Extract,1,1))+1

C13, control+shift+enter (CSE), not just enter:

Rich (BB code):

=SUM(IF(FREQUENCY(IF(User_Name_Extract<>"",
  MATCH(User_Name_Extract,User_Name_Extract&"",0)),
  Ivec),1))

B13, CSE and copy down:

Rich (BB code):

=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
  SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
  MATCH(User_Name_Extract,User_Name_Extract&"",0)),
  Ivec),Ivec),ROWS(B$13:B13))))

The foregoing is sensitive to the size of the range(s) it must array-process. A possible alternative for processing a very large range is to trade off memory against speed as shown below.

2. Data Source

B1: 0 (required)

B2, just enter and copy down as far as needed:

Rich (BB code):

=IF(ISNUMBER(MATCH($A2,$A$1:A1,0)),"",
  LOOKUP(9.99999999999999E+307,$B$1:B1)+1)

C1, just enter:

Rich (BB code):

=LOOKUP(9.99999999999999E+307,A:A)

C2, just enter and copy down:

Rich (BB code):

=IF(ROWS($C$2:C1)<=$C$1,LOOKUP(ROWS($C$2:C2),B:B),"")

Now define DVlist as referring to:

Rich (BB code):

='Data Source'!$B$2:INDEX('Data Source'!$C:$C,
  MATCH("*",'Data Source'!$C:$C,-1))

Aladin Akyurek · Oct 31, 2014

Jon von der Heyden said:
I think they have some very useful functions missing from Excel - and some existing functions are flexible (e.g. coutif can handle arrays) - but if I knit-pick your comments I wouldn't say they are better. All I know is that formula evaluation in Google is terribly slow compared with the equivalent in Excel. I do hope though that MS recognise some cool improvements to make in Excel.

I confess I don't know which method is quicker, nor why. I just presumed that what I prescribed would be better mostly because of omitting the SMALL(IF()). I'm hoping that the OP will give it a shot and let us know. I'm too lazy (busy) to set-up a sample and test. Or maybe Aladin will just tell us.

MATCH (even ISNUMBER followed by MATCH) is faster than COUNTIF. See the profiling using FastExcel:
http://www.mrexcel.com/forum/excel-questions/38643-auto-flag-list.html#post178581

Chris Pucknell · Nov 4, 2014

Woah! This thread got way more air time than I was expecting

I'm going to have to digest all this and give it a try, thanks all for pitching in :D

Jon von der Heyden · Nov 4, 2014

Start by implementing post #16.

Chris Pucknell · Nov 4, 2014

Jon von der Heyden said:
Start by implementing post #16.

Oh don't worry I'm getting my hands dirty!

I'm trying out the first one first, but I'll do both and time them against eachother on this low spec work computer.

Have run into trouble.

I've defined User_Name_Extract:

Code:

='Data Source'!A2:INDEX('Data Source'!A2:A32001,COUNTA('Data Source'!A2:A32001))

and Ivec:

Code:

=ROW(User_Name_Extract)-ROW(INDEX(User_Name_Extract,1,1))+1

C13 (Unique Count):

Code:

{=SUM(IF(FREQUENCY(IF(User_Name_Extract<>"",MATCH(User_Name_Extract,User_Name_Extract&"",0)),Ivec),1))}

This returns #N/A Value Not Available Error, as does the list formula:

Code:

{=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),[SIZE=2][FONT=lucida console]
Ivec),Ivec),ROWS(B$13:B13))))[/FONT]}
[/SIZE]

So where have I gone wrong here? I'm guessing something to do with the User_Name_Extract.

Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell

New Member

XOR LX

Well-known Member

Excel Facts

Jon von der Heyden

MrExcel MVP, Moderator

XOR LX

Well-known Member

Jon von der Heyden

MrExcel MVP, Moderator

XOR LX

Well-known Member

Aladin Akyurek

MrExcel MVP

Aladin Akyurek

MrExcel MVP

Chris Pucknell

New Member

Jon von der Heyden

MrExcel MVP, Moderator

Chris Pucknell

New Member

Similar threads

Forum statistics

Share this page

Extracting unique data from Large Datasets - Excel Magic Trick 698

New Member

Well-known Member

Excel Facts

MrExcel MVP, Moderator

Well-known Member

MrExcel MVP, Moderator

Well-known Member

MrExcel MVP

MrExcel MVP

New Member

MrExcel MVP, Moderator

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock