RepeatFinding

         Repeat-Finding

A. First Edition
This is a small practice of "radix sort". Actually it is what I read from "Battle of Wits" which gave a brief
introduction of the application of "automaton" which was invented by IBM in "code-breaking" area. 
B.The problem

If you have taken "data structure" course, I guess you know "radix sort". But I am not sure if you

understand it or know how to use it. For example, I personally don't know what I can use for "radix

sort" at all before I read my favourite book---"Battle of Wits".

In World War II, American secret service needed to break the code of Japanese. One way to do it is

to find out so-called "double hits", which means the repeated string in scrambled code. So, here is

the question: How to find out a certain length of repeated string in a text file?

It seems easy at first glance. However, what is your way if you take looking at following text:

 

THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVA

LUESANDACONSTANTREFERENCEVALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHE

COMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFORMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNC

ANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALPHAVALUEISCOMPAREDTOREF

USINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQ

UENTSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATI

ONTHEGLALPHAFUNCFUNCTIONOPERATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOI

NTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPERATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCR

EENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEVEINFORMATIONRELATED

TOTHEGLALPHAFUNCFUNCTION

 

In order to make the text look like real enciphered, I removed all punctuations and white spaces.

 

C.The idea of program
 
The IBM machine can sort punch cards automatically which was used to find out repeated strings. The sorting
method is exactly "radix sort" which clarify each string by its indexed characters. (I don't know what I am 
talking about at all.)
First, you sort all string by its first character. Second, you sort them by its second character...Finally, you
just look those strings who are neighbours. You compare them to see if they are truly repeating each other.
 
(After I finished my little practice, I called Mr. Zhu and tried to impress him by asking the question. Quite
to my surprise, he quickly gave me the most obvious solution which should be used by almost 95% of c/c++ 
programmer: You start from the first character in the stream, then your second pointer move from first to end
to try to find the repeated one. That's it. Even though it is square(n), it works!!)
D.The major functions
 
E.Further improvement
It is a small game and no one would bother to treat them seriously, I presume.
F.File listing
1. repeat.cpp(main)
file name: repeat.cpp
#include <stdio.h>
#include <stdlib.h>

const int MaxRepeatNumber=300;
const int MaxStringLen=20;
const int MaxFileLength=2048;
const int MaxBinNumber=26;

class Repeat
{
private:
char buffer[MaxFileLength];
int current;
int source[MaxBinNumber][MaxRepeatNumber];
int target[MaxBinNumber][MaxRepeatNumber];
int sourceCounter[MaxBinNumber];
int targetCounter[MaxBinNumber];

public:
void readFile(char* fileName);
void findRepeat();
void report();

};

int main()
{
Repeat R;
R.readFile("nick.txt");
R.findRepeat();
R.report();
return 0;
}




void Repeat::findRepeat()
{
int pos;
for (pos=0; pos<current; pos++)
{
source[buffer[pos]-'A'][sourceCounter[buffer[pos]-'A']++]=pos;
}
for (int i=1; i<MaxStringLen; i++)
{
//for all character
for (int bin=0; bin<MaxBinNumber; bin++)
{
//for each character which has similar character
for (int count=0; count<sourceCounter[bin]; count++)
{
pos=source[bin][count];
if (pos+i<current)
{
target[buffer[pos+i]-'A'][targetCounter[buffer[pos+i]-'A']++]=pos;
}
}
}
//now copy target back to source
for (bin=0; bin<MaxBinNumber; bin++)
{
for (int count=0; count<targetCounter[bin]; count++)
{
source[bin][count]=target[bin][count];
}
sourceCounter[bin]=targetCounter[bin];
targetCounter[bin]=0;
}
}
}

void Repeat::report()
{
int first, second;
bool found;
printf("first let me show you the strings\n%s\n\n", buffer);
for (int i=0; i<MaxBinNumber; i++)
{
first=0;
found=true;
for (second=1; second<sourceCounter[i]; second++)
{
found=true;
for (int index=0; index<MaxStringLen; index++)
{
if (source[i][second]+index>=current)
{
found=false;
break;
}

if (buffer[source[i][first]+index]!=buffer[source[i][second]+index])
{
found=false;
break;
}
}

if (found&&first!=second)
{
printf("found repeat string of length %d at first=%d, second=%d\n",
MaxStringLen, source[i][first], source[i][second]);
printf("the string is:");
for (int j=0; j<MaxStringLen; j++)
{
printf("%c", buffer[source[i][first]+j]);
}
printf("\n");
}
first=second;
}
}
}





void Repeat::readFile(char* fileName)
{
FILE* stream;
char ch;
current=0;
stream=fopen(fileName, "r");
ch=fgetc(stream);
if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z')
{
buffer[current]=toupper(ch);
}
while (!feof(stream))
{
ch=fgetc(stream);
if (ch>='A'&&ch<='Z'||ch>='a'&&ch<='z')
{
current++;

buffer[current]=toupper(ch);
}

}
current++;
buffer[current]='\0';
for (int i=0; i<MaxBinNumber; i++)
{
sourceCounter[i]=targetCounter[i]=0;
}
}

The result is like following: 
First let me show you the strings
THEALPHATESTDISCARDSFRAGMENTSDEPENDINGONTHEOUTCOMEOFACOMPARISONBETWEENTHEINCOMINGFRAGMENTSALPHAVALUESANDACONSTANTREFERENCE
VALUETHEGLALPHAFUNCFUNCTIONSPECIFIESTHEREFERENCEANDCOMPARISONFUNCTIONTHECOMPARISONISPERFORMEDONLYIFALPHATESTINGISENABLEDFO
RMOREINFORMATIONONGLALPHATESTSEEGLENABLETHEFUNCANDREFPARAMETERSSPECIFYTHECONDITIONSUNDERWHICHTHEPIXELISDRAWNTHEINCOMINGALP
HAVALUEISCOMPAREDTOREFUSINGTHEFUNCTIONSPECIFIEDBYFUNCIFTHECOMPARISONPASSESTHEINCOMINGFRAGMENTISDRAWNCONDITIONALONSUBSEQUEN
TSTENCILANDDEPTHBUFFERTESTSIFTHECOMPARISONFAILSNOCHANGEISMADETOTHEFRAMEBUFFERATTHATPIXELLOCATIONTHEGLALPHAFUNCFUNCTIONOPER
ATESONALLPIXELWRITESINCLUDINGTHOSERESULTINGFROMTHESCANCONVERSIONOFPOINTSLINESPOLYGONSANDBITMAPSANDFROMPIXELDRAWANDCOPYOPER
ATIONSTHEGLALPHAFUNCFUNCTIONDOESNOTAFFECTSCREENCLEAROPERATIONSALPHATESTINGISDONEONLYINRGBAMODETHEFOLLOWINGFUNCTIONSRETRIEV
EINFORMATIONRELATEDTOTHEGLALPHAFUNCFUNCTION

found repeat string of length 20 at first=127, second=584
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=584, second=738
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=738, second=875
the string is:THEGLALPHAFUNCFUNCTI
found repeat string of length 20 at first=129, second=586
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=586, second=740
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=740, second=877
the string is:EGLALPHAFUNCFUNCTION
found repeat string of length 20 at first=128, second=585
the string is:HEGLALPHAFUNCFUNCTIO
found repeat string of length 20 at first=585, second=739
the string is:HEGLALPHAFUNCFUNCTIO
found repeat string of length 20 at first=739, second=876
the string is:HEGLALPHAFUNCFUNCTIO
 
 
				back.gif (341 bytes)       up.gif (335 bytes)         next.gif (337 bytes)