Comparing millions for records for similar entries

AMAS

Active Member
Joined
Apr 11, 2010
Messages
472
Hi,

This is a theoretical question more than an actual problem I been thinking about as I view questions and responses on the forum. I have a database with millions of entries. Is it feasible to:

1) Compare each entry for identical matches.

2) Use Fuzzy logic to give me the best matches.

I am using Excel 2010 (64-bit).

I have read several great approaches including AlphaFrog's use of arrays to attack a similar problem. And Alan's excellent code for Fuzzy logic. But the looming question is whether or not this is a feasible task and what potential problems this could cause.

AMAS
 

Excel Facts

Easy bullets in Excel
If you have a numeric keypad, press Alt+7 on numeric keypad to type a bullet in Excel.
Well I would say no, but I'm sure others will disagree.

It just doesn't sound like Excel us the best tool for the job.

Also, you've mentioned you have a database

Why not try using a 'real' database?

Of course this is based on the details you've posted.

There could be other factors.
 
Upvote 0
Don't even think about it for a database of that size - that's what Access is for.
 
Upvote 0
Hi Norie,

You are probably right. And for clarification, its a dataset from a large database. It can be chopped up into smaller bits if that's the more operationalizable way, but I am only an amateur compared some of the great minds on this forum and I wanted to see what others think.

AMAS
 
Upvote 0
Hi Ruddles,

I have access, but am not very proficient with it. Is this the way to go? Can this be accomplished much better in Access rather than excel.

AMAS
 
Upvote 0
Assuming the effort involved in implementing fuzzy matching is more or less the same in Excel or Access, Access scores because it can handle millions of records in a single table, plus it has all the tools for finding duplicates - in fact there's a wizard for it - and counting/sorting.
 
Upvote 0
AMAS

Is this a one-off or relatively rare exercise?

If it is then perhaps Access but if it's some sort of ongoing process probably not Access.

You can work with millions of records in Access but it's not ideal.

Even running pretty simple queries can take a long time.

As for 'fuzzy' matching, wherever you do that it's going to be pretty intensive.
 
Upvote 0
If you can run SQL on the live database, that would be my first choice - provided you clear what you're going to do with your database admin people.
 
Upvote 0
Ruddles

Something tells me that isn't going to happen.

Allow a common user near the actual data.:eek:
 
Upvote 0

Forum statistics

Threads
1,224,551
Messages
6,179,473
Members
452,915
Latest member
hannnahheileen

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top