comparing text strings (problem formula-any ideas, Mark W or Aladin)


Posted by carl on June 20, 2001 5:59 PM

Thank you for your help earlier in the year with the formula below. The formula works well however it has problems with words that have a double letter in common. For 'doll' and 'roll' it should return a value of 3 letters in common (position and identity independent), however the formula returns a value of 2. Sorry for not making the identity independence parameter clear in my initial request for help. Any ideas? There were a couple of other similar formulas suggested previously however all have the same problem. Thanks, Carl.

{=SUM(IF(FREQUENCY(CODE(MID(UPPER(A3),ROW(INDIRECT("1:"&LEN(A3))),1)),ROW($64:$90))*FREQUENCY(CODE(MID(UPPER(B3),ROW(INDIRECT("1:"&LEN(B3))),1)),ROW($64:$90))*MID(0&REPT(1,26)&0,ROW($1:$28),1),1))}

Posted by Mark W. on June 21, 2001 5:55 AM

Back to Square One

It would be helpful if you could provide some cases
and their expected result. For example, "roof" and
"rotor"... by your new requirement are there 3 or 4
letters in common?

Posted by Carl on June 21, 2001 5:00 PM

Re: Back to Square One

Sorry for not being clear about this. What I am interested in the amount of letter overlap. examples:

ROOF ROTOR 3
PIT TIT 2
ROLL DOLL 3
LOLL ROLL 3 (see below)

I have been working on a two step process that puts the letters in each string into separate columns using MID, and then uses a formula to count the number of letters in the first string that are in the second. The problem is that it does not work for pairs where the first letter string has more occurrences of a particular letter than the second, eg for LOLL and ROLL it gives 4 not 3 as I would like (I am dealing with both real words and non-words).

Thanks.

Posted by Mark W. on June 22, 2001 6:26 AM

Re: Back to Square One

So, what logic did you use to determine that the
overlap for "roof" and "rotor" was 3? Did you
only use "roto" for the comparison? Will longer
words always be truncated to length of the
a shorter one?

Posted by Mark W. on June 22, 2001 7:06 AM

Also...

In the case of "loll" and "roll", which of the
letters in "loll" are included in the count of 3?
And, why? So, what logic did you use to determine that the

Posted by Mark W. on June 22, 2001 7:32 AM

Upon further review...

...of your posting I now understand that the 1st
word is used as the standard for the comparison.
I think the morning coffee is starting to "kick in". So, what logic did you use to determine that the

Posted by Carl on June 24, 2001 4:31 PM

Re: Also...

The current set up, which has turned into a collaborative effort, is as follows:
S1 appears in column A and S2 appears in column C. The MID function (thanks Steve) then separates S1 into individual letters appearing in columns G:K depending on its length (at present we are only looking at letter strings with a max of 5 letters but this may change); and S2 is represented by its individual letters in columns M:Q. The following formula is then applied:
IF(AND(OR(G2=M2,G2=N2,G2=O2,G2=P2,G2=Q2),
G2<>""),1,0).
This formula is used in the cells S:W (in row 2 of course in this case) with G2 being replaced by H2 in the formula as you move from column S to column T and so on until you reach column W. So what the formula is doing as you can probably tell is assigning the value 1 to a cell for each appearance of a letter in S1 that also appears in S2. Empty cells in S1 (G:K) or letters in S1 that are not in S2 get coded 0. We then sum from S:W to get the amount of orthographic overlap in column Y.

So ROOF:ROTOR will give an output in cells S:W of 1 1 1 0 0 summing to 3 which is correct. But LOLL:DOLL will give 1 1 1 1 summing to a value of 4 which is incorrect. So as mentioned the formula at present works for all strings except in the case where there are more exemplars of a particular letter in S1 than in S2. As far as the actual column assignment is concerned from A:Y we are using the columns not mentioned for other menial things like letter counts and spacing etc. Also as I requested in my first message it would be best if we could come up with a formula that required only columns A:C, with S1 in ColA and S2 in ColB and the formula in ColC. Hope I'm making things clearer. I wouldn't be surprised if my colleage Steve chips in to help my clarity problem.

Carl.



Posted by steve on June 24, 2001 5:54 PM

Re: Also...

Hi, Carl has explained the approach we have taken so far, but to make it absolutely clear what we mean by "orthographic overlap", here's what you would do if you did it by hand: Take two letter strings e.g. ROLL and LOLL. Choose one word and cross out letters, a pair at a time, that appear in both words, the number of pairs you cross out is the amount of orthographic overlap of the two strings. For example,first you cross out one L from LOLL, then one L from ROLL, that's a pair.Then you cross out O from LOLL and O from ROLL, that's another pair. Then you cross out the next L from LOLL and L from ROLL. Then there are no more letters in common between the two words, as only L and R are left, and they don't match, so you stop. You then count the number of pairs you have crossed out, and that is 3, and that is the amount of orthographic overlap. This may give you other ideas about how to approach it. Ideally it should work for any pairs of strings e.g. WWWFF and KWFWQ.Here the overlap is 3. Or WWWFF and KFW. Here the overlap is 2.Don't want much, huh? steve

steve

: In the case of "loll" and "roll", which of the : letters in "loll" are included in the count of 3? : And, why?