Extracting unique data from Large Datasets - Excel Magic Trick 698

Chris Pucknell

New Member
Joined
Oct 31, 2014
Messages
11
Hey

I couldn't find an answer for this so thought I'd sign up!

As you can guess, i've been trying out the method in this video: https://www.youtube.com/watch?v=IhuURsu0jdI

A series of formulas for extracting unique data, compiling it into a list, and then using this list to make a dynamicly updating DV List.

It works just great. The trouble is, when it is applied to large data sets, it seriously ties up the cpu.

My question is how can I reduce or eliminate this problem? What elements can be altered or removed to improve performance?

The below formula extracts a list of unique names in sequence. From there it's fairly simple to make the DV List.

=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),
ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROW(User_Name_Extract)-ROW('Data Source'!$A$2)+1),ROWS(B$13:B13))))

$C$13 contains a unique count formula.

B$13:B13 is the position on the list (so if there were only 10 unique names you wanted to extract, you could get away with dragging it to B$13:B23).

"User_Name_Extract" is the range, which in my case is 32000 rows xD

You can probably get a clearer understanding from the video than I can give. Is there any solution when applying this to such large datasets? And what particularly is causing the trouble?

Any help would be grand :)
 
No not at all, not yet at least. Google exposes a few nice functions but otherwise it is seriously lacking in tools/functionality, and as far as capacity and performance is concerned Excel has a very big lead.

Ah, ok. Thanks. One day I shall have to investigate further. So better on functions but worse on overall functionality would be a fair summary?

All I know is that, when I run my weekly Advanced Formula Challenge over at my blogsite, every week I get a guy who offers a Google Sheets solution. Of course, I can't verify these solutions, but just on appearance, and going on his explanations, it does seem to be far more effective (and flexible) in a lot of worksheet formula situations than the equivalent Excel set-up.

Edit: By the way - I'm not arguing about the IF to check if the formula evaluation is needed. What I was proposing was a different way to pull back the next unique item from the list, i.e.: =INDEX($A$2:$A$13,MATCH(0,COUNTIF($C$1:$C1,$A$2:$A$13),0)). You can still nest this within the IF().

Yes, that's true. To be honest I always opt for the FREQUENCY set-up myself (I personally don't like the idea that a formula "cannot begin in row 1"?!) but I've never run any tests to see which of this and the MATCH/COUNTIF set-up perform better. I've a feeling I might have seen a post in which Mr. Akyurek expressed the opinion that the FREQUENCY set-up performed slightly better, though I may be imagining it.

Regards
 
Last edited:
Upvote 0

Excel Facts

Best way to learn Power Query?
Read M is for (Data) Monkey book by Ken Puls and Miguel Escobar. It is the complete guide to Power Query.
Ah, ok. Thanks. One day I shall have to investigate further. So better on functions but worse on overall functionality would be a fair summary?
I think they have some very useful functions missing from Excel - and some existing functions are flexible (e.g. coutif can handle arrays) - but if I knit-pick your comments I wouldn't say they are better. All I know is that formula evaluation in Google is terribly slow compared with the equivalent in Excel. I do hope though that MS recognise some cool improvements to make in Excel.

To be honest I always opt for the FREQUENCY set-up myself (I personally don't like the idea that a formula "cannot begin in row 1"?!) but I've never run any tests to see which of this and the MATCH/COUNTIF set-up perform better.
I confess I don't know which method is quicker, nor why. I just presumed that what I prescribed would be better mostly because of omitting the SMALL(IF()). I'm hoping that the OP will give it a shot and let us know. I'm too lazy (busy) to set-up a sample and test. ;) Or maybe Aladin will just tell us.
 
Upvote 0
That's nothing. I don't even know how to time functions! :)
I recommend the timeGetTime API function (heaps of tutorials online and very easy to set-up). Apparently it's accurate to a significance of 10 milliseconds. To be honest I find it entirely unnecessary to use a high-resolution timer - I've just upsized the sample so to increase the time it takes. For me either one is significantly faster than the nother - or it isn't. The actual time doesn't matter - different machines will vary anyway.
 
Upvote 0
I recommend the timeGetTime API function (heaps of tutorials online and very easy to set-up). Apparently it's accurate to a significance of 10 milliseconds. To be honest I find it entirely unnecessary to use a high-resolution timer - I've just upsized the sample so to increase the time it takes. For me either one is significantly faster than the nother - or it isn't. The actual time doesn't matter - different machines will vary anyway.

Thanks for the link. And perhaps I'll give it a go sometime, though to be honest I'm of a similar opinion to you re calculation timings. In fact, I put the task of shaving a few milliseconds off a particular construction as quite low on my priority list, to be honest. There are seemingly more important (and fascinating) sides to Excel to spend too much time on such matters.

Regards
 
Upvote 0
@Chris Pucknell

1.

Better to define Ivec by means of Formulas | Name Manager as referring to:
Rich (BB code):
=ROW(User_Name_Extract)-ROW(INDEX(User_Name_Extract,1,1))+1

C13, control+shift+enter (CSE), not just enter:
Rich (BB code):
=SUM(IF(FREQUENCY(IF(User_Name_Extract<>"",
  MATCH(User_Name_Extract,User_Name_Extract&"",0)),
  Ivec),1))

B13, CSE and copy down:
Rich (BB code):
=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
  SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
  MATCH(User_Name_Extract,User_Name_Extract&"",0)),
  Ivec),Ivec),ROWS(B$13:B13))))

The foregoing is sensitive to the size of the range(s) it must array-process. A possible alternative for processing a very large range is to trade off memory against speed as shown below.

2. Data Source

B1: 0 (required)

B2, just enter and copy down as far as needed:
Rich (BB code):
=IF(ISNUMBER(MATCH($A2,$A$1:A1,0)),"",
  LOOKUP(9.99999999999999E+307,$B$1:B1)+1)

C1, just enter:
Rich (BB code):
=LOOKUP(9.99999999999999E+307,A:A)

C2, just enter and copy down:
Rich (BB code):
=IF(ROWS($C$2:C1)<=$C$1,LOOKUP(ROWS($C$2:C2),B:B),"")

Now define DVlist as referring to:
Rich (BB code):
='Data Source'!$B$2:INDEX('Data Source'!$C:$C,
  MATCH("*",'Data Source'!$C:$C,-1))
 
Upvote 0
I think they have some very useful functions missing from Excel - and some existing functions are flexible (e.g. coutif can handle arrays) - but if I knit-pick your comments I wouldn't say they are better. All I know is that formula evaluation in Google is terribly slow compared with the equivalent in Excel. I do hope though that MS recognise some cool improvements to make in Excel.


I confess I don't know which method is quicker, nor why. I just presumed that what I prescribed would be better mostly because of omitting the SMALL(IF()). I'm hoping that the OP will give it a shot and let us know. I'm too lazy (busy) to set-up a sample and test. ;) Or maybe Aladin will just tell us.

MATCH (even ISNUMBER followed by MATCH) is faster than COUNTIF. See the profiling using FastExcel:
http://www.mrexcel.com/forum/excel-questions/38643-auto-flag-list.html#post178581
 
Upvote 0
Woah! This thread got way more air time than I was expecting :)

I'm going to have to digest all this and give it a try, thanks all for pitching in :D
 
Upvote 0
Start by implementing post #16. :)

Oh don't worry I'm getting my hands dirty!

I'm trying out the first one first, but I'll do both and time them against eachother on this low spec work computer.

Have run into trouble.

I've defined User_Name_Extract:
Code:
='Data Source'!A2:INDEX('Data Source'!A2:A32001,COUNTA('Data Source'!A2:A32001))

and Ivec:
Code:
=ROW(User_Name_Extract)-ROW(INDEX(User_Name_Extract,1,1))+1

C13 (Unique Count):
Code:
{=SUM(IF(FREQUENCY(IF(User_Name_Extract<>"",MATCH(User_Name_Extract,User_Name_Extract&"",0)),Ivec),1))}

This returns #N/A Value Not Available Error, as does the list formula:

Code:
{=IF(ROWS(B$13:B13)>$C$13,"",INDEX(User_Name_Extract,
SMALL(IF(FREQUENCY(IF(User_Name_Extract<>"",
MATCH(User_Name_Extract,User_Name_Extract&"",0)),[SIZE=2][FONT=lucida console]
Ivec),Ivec),ROWS(B$13:B13))))[/FONT]}
[/SIZE]

So where have I gone wrong here? I'm guessing something to do with the User_Name_Extract.
 
Upvote 0

Forum statistics

Threads
1,215,487
Messages
6,125,080
Members
449,205
Latest member
Healthydogs

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top