I have a large dataset (70,000 rows) on property ownership, which includes personal identifying information (i.e. people's names). I want to make the data anonymous so that other users cannot identify the property owners. However, I need to preserve the internal fidelity of the names, so that users can tell if multiple properties have the same owner. To further complicate things, the data are littered with typos. So the same name may appear several times, but spelled slightly differently. Here's an example:
In the above example data, I want to achieve the following:
I apologize if I'm not using the correct terminology for aspects of this problem. Thank you for your help.
Property number | Owner name |
1 | Jeremiah Wilson |
2 | Emily Chang |
3 | Emily Chang |
4 | Jeremaih Wilson |
5 | Jeremiah W. Brown |
In the above example data, I want to achieve the following:
- the names in Column 2 cannot be identified
- the names for properties 2 and 3 are identical to each other
- the names for properties 1 and 4 are similar enough to know this is a typo
- the names for property 1 and 5 are clearly different
I apologize if I'm not using the correct terminology for aspects of this problem. Thank you for your help.