A book titled The Bible Code introduced the topic of equidistant letter sequences (ELS), described below, for finding words “hidden” in text. That book referenced the Hebrew Bible, but prompts a question about finding any given word in any, say, English-language text.
For simplicity, and to better match the Hebrew, spaces and punctuation are removed. A particular text that I have in mind, thus crunched, has 284,939 characters remaining (letters and digits). How many times would you expect to find the word FLOOBLE as an equidistant letter sequence in the text? Ignore case. The word can start at any of the 284,939 characters and proceed by skipping any constant number of letters forward or backward. So, for example, if the 11,000th character were an F and the 10,000th an L, and the 9,000th an O, etc. that would be one occurrence. Of course we don’t expect always to find such decimally round spacings. The question again, How many do we expect to find?
The absolute and relative frequencies of the relevant letters in the text are:
B 4771 0.016744
E 36232 0.127157
F 7167 0.025153
L 9563 0.033562
O 22486 0.078915
that is, for each letter is shown the number of occurrences in the text and that number divided by the total of characters in the text.
(In reply to incorrect solution?
From my point of view, the way one occurrence of flooble affects another is that the presence of say flooble at locations 100, 200, ... , 700 means that location 200 for example is an L. That of course makes more likely flooble at , say, 198, 200, ... , 210, but that's already taken into consideration in the letter frequencies.
Admittedly the above prevents, again to take an example, flooble starting at 200, and going at increments of 100, but the actual occurrences are so few and far between, that I don't foresee a large effect. In this case we see 8 actual or 5 expected sequences occupying 56 characters, or about as many disallowed sequence positions, out of 284,939 character positions and 13,531,705,620 pseudorandom sequences.
Put another way, when there are hundreds of thousands of balls in an urn, and 8 are drawn out, I don't think it matters much whether there is sampling with or without replacement.
Even without this argument from large number, however, is the fact that expected number (as opposed to the probability of a given number) should not be affected by lack of independence. For example in the matching of 9 labels with 9 cans, we expect 1 match, which is the 1/9 chance that any given label will match multiplied by the fact there are 9 opportunities for a match. But the occurrence of matches is not independent--if one matches it's more likely that others do also. In the 9 can problem, the probability of 1 match is .367882, and of 2 matches is .183929—about 1/2 that of one. If the can of peas matches its label, the can of corn has 1/8 probability of match. Still, the expected number of matches is 1.
Posted by Charlie
on 2003-03-25 10:24:42