Repeat-Finding
A. First Edition
This is a small practice of "radix sort". Actually it is what I read from "Battle of Wits" which gave a brief
introduction of the application of "automaton" which was invented by IBM in "code-breaking" area.
If you have taken "data structure" course, I guess you know "radix sort". But I am not sure if you
understand it or know how to use it. For example, I personally don't know what I can use for "radix
sort" at all before I read my favourite book---"Battle of Wits".
In World War II, American secret service needed to break the code of Japanese. One way to do it is
to find out so-called "double hits", which means the repeated string in scrambled code. So, here is
the question: How to find out a certain length of repeated string in a text file?
It seems easy at first glance. However, what is your way if you take looking at following text:
THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVA
LUESANDACONSTANTREFERENCEVALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHE
COMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFORMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNC
ANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALPHAVALUEISCOMPAREDTOREF
USINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQ
UENTSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATI
ONTHEGLALPHAFUNCFUNCTIONOPERATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOI
NTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPERATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCR
EENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEVEINFORMATIONRELATED
TOTHEGLALPHAFUNCFUNCTION
In order to make the text look like real enciphered, I removed all punctuations and white spaces.
The IBM machine can sort punch cards automatically which was used to find out repeated strings. The sorting
method is exactly "radix sort" which clarify each string by its indexed characters. (I don't know what I am
talking about at all.)
First, you sort all string by its first character. Second, you sort them by its second character...Finally, you
just look those strings who are neighbours. You compare them to see if they are truly repeating each other.
(After I finished my little practice, I called Mr. Zhu and tried to impress him by asking the question. Quite
to my surprise, he quickly gave me the most obvious solution which should be used by almost 95% of c/c++
programmer: You start from the first character in the stream, then your second pointer move from first to end
to try to find the repeated one. That's it. Even though it is square(n), it works!!)
E.Further improvement
It is a small game and no one would bother to treat them seriously, I presume.
F.File listing
1. repeat.cpp(main)
file name: repeat.cpp
#include <stdio.h>
#include <stdlib.h>
const int MaxRepeatNumber=300;
const int MaxStringLen=20;
const int MaxFileLength=2048;
const int MaxBinNumber=26;
class Repeat
{
private:
char buffer[MaxFileLength];
int current;
int source[MaxBinNumber][MaxRepeatNumber];
int target[MaxBinNumber][MaxRepeatNumber];
int sourceCounter[MaxBinNumber];
int targetCounter[MaxBinNumber];
public:
void readFile(char* fileName);
void findRepeat();
void report();
};
int main()
{
Repeat R;
R.readFile("nick.txt");
R.findRepeat();
R.report();
return 0;
}
void Repeat::findRepeat()
{
int pos;
for (pos=0; pos<current; pos++)
{
source[buffer[pos]-'A'][sourceCounter[buffer[pos]-'A']++]=pos;
}
for (int i=1; i<MaxStringLen; i++)
{
//for all character
for (int bin=0; bin<MaxBinNumber; bin++)
{
//for each character which has similar character
for (int count=0; count<sourceCounter[bin]; count++)
{
pos=source[bin][count];
if (pos+i<current)
{
target[buffer[pos+i]-'A'][targetCounter[buffer[pos+i]-'A']++]=pos;
}
}
}
//now copy target back to source
for (bin=0; bin<MaxBinNumber; bin++)
{
for (int count=0; count<targetCounter[bin]; count++)
{
source[bin][count]=target[bin][count];
}
sourceCounter[bin]=targetCounter[bin];
targetCounter[bin]=0;
}
}
}
void Repeat::report()
{
int first, second;
bool found;
printf("first let me show you the strings\n%s\n\n", buffer);
for (int i=0; i<MaxBinNumber; i++)
{
first=0;
found=true;
for (second=1; second<sourceCounter[i]; second++)
{
found=true;
for (int index=0; index<MaxStringLen; index++)
{
if (source[i][second]+index>=current)
{
found=false;
break;
}
if (buffer[source[i][first]+index]!=buffer[source[i][second]+index])
{
found=false;
break;
}
}
if (found&&first!=second)
{
printf("found repeat string of length %d at first=%d, second=%d\n",
MaxStringLen, source[i][first], source[i][second]);
printf("the string is:");
for (int j=0; j<MaxStringLen; j++)
{
printf("%c", buffer[source[i][first]+j]);
}
printf("\n");
}
first=second;
}
}
}
void Repeat::readFile(char* fileName)
{
FILE* stream;
char ch;
current=0;
stream=fopen(fileName, "r");
ch=fgetc(stream);
if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z')
{
buffer[current]=toupper(ch);
}
while (!feof(stream))
{
ch=fgetc(stream);
if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z')
{
current++;
buffer[current]=toupper(ch);
}
}
current++;
buffer[current]='\0';
for (int i=0; i<MaxBinNumber; i++)
{
sourceCounter[i]=targetCounter[i]=0;
}
}
The result is like following:
First let me show you the strings
THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVALUESANDACONSTANTREFERENCE
VALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHECOMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFO
RMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNCANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALP
HAVALUEISCOMPAREDTOREFUSINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQUEN
TSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATIONTHEGLALPHAFUNCFUNCTIONOPER
ATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOINTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPER
ATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCREENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEV
EINFORMATIONRELATEDTOTHEGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=127, second=584
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=584, second=738
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=738, second=875
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=129, second=586
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=586, second=740
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=740, second=877
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=128, second=585
the string is:HEGLALPHAFUNCFUNCTIO
found repeat string of length 20 at first=585, second=739
the string is:HEGLALPHAFUNCFUNCTIO
found repeat string of length 20 at first=739, second=876
the string is:HEGLALPHAFUNCFUNCTIO