A book titled The Bible Code introduced the topic of equidistant letter sequences (ELS), described below, for finding words “hidden” in text. That book referenced the Hebrew Bible, but prompts a question about finding any given word in any, say, English-language text.
For simplicity, and to better match the Hebrew, spaces and punctuation are removed. A particular text that I have in mind, thus crunched, has 284,939 characters remaining (letters and digits). How many times would you expect to find the word FLOOBLE as an equidistant letter sequence in the text? Ignore case. The word can start at any of the 284,939 characters and proceed by skipping any constant number of letters forward or backward. So, for example, if the 11,000th character were an F and the 10,000th an L, and the 9,000th an O, etc. that would be one occurrence. Of course we don’t expect always to find such decimally round spacings. The question again, How many do we expect to find?
The absolute and relative frequencies of the relevant letters in the text are:
B 4771 0.016744
E 36232 0.127157
F 7167 0.025153
L 9563 0.033562
O 22486 0.078915
that is, for each letter is shown the number of occurrences in the text and that number divided by the total of characters in the text.
(In reply to
re(3): incorrect solution? by Charlie)
... continued:
2. The presence of a solution starting at position n, with a skip sequence of s, precludes another one at say position n+s with that skip. You could also say that the L at position n+s or at position n+5s precludes its use as an F, or O, etc., but that’s just part of the ordinary distribution of letters already accounted for. What does deserve consideration is the sets of letters with the same skip sequence beginning s, 2s,…,6s positions before or after where the given occurrence does. This is analogous to finding the number of occurrences of the sequence 12345 in the decimal expansion of pi—one occurrence interferes with nine other sequence’s possibilities. If we expect 5 occurrences, this precludes 5 x 13 = 65 sequences out of 13,531,705,620 leaving 13,531,705,555. We could iterate this, so that if the modified calculation lowered the expected number, we could retry with the new value, but we can see the change is only in the ninth significant digit.
Hopefully this shows that this approximation is as legitimate as using the normal distribution as an approximation to the binomial in the 2,000,000 coins problem.
As the non-independence effects would tend to lower the number of occurrences, it is interesting to note that the observed 8 occurrences in the actual text is near the high tail of the expected distribution, rather than the low.
|
Posted by Charlie
on 2003-03-26 14:00:29 |