inexact match

         Inexact Match (DFS by pure C)

A. Third Edition
This is a small test program as suggested by Mr. Li. It is a kind of simplified inexact match. The idea is a kind
of straight forward and it is like a kind of scanner. This edition only solves the problem exposed by first edition,
the ambiguous grammar. (Sometimes I really feel it is a great waste for my precious web space to allow almost 
exactly same project to be stored there. There is little modification between two editions!) Mr. Li wishes it to be 
implemented by pure C so that the embedded environment can exploit it.
B.The problem
The problem is like following and pay attention to the last line of explanation. They want to solve this trivial
problem and it is one of the simplest DFS.
In China, many people use their mobile phone as their address book. They may store contacts with UNICODE or ASCII
code. So, the searching problem is a bit complicated. 
We assume following conditions:
1. The index for ASCII code is using its own ASCII string.
2. The UNICODE string is using their "pinyin" string and especially in embedded system the storage resource is
very tight and the developer may not want to store index as "pinyin" along with UNICODE string. Since there are
only limited "pinyin" combinations, they are willing to consider that conversion from UNICODE to "pinyin" is fast.
3. The order between a ASCII code string and a UNICODE string with same letter is that ASCII goes before UNICODE.
i.e. The "pinyin" for my name of UNICODE is "huang qing zhe". If I also input an ASCII code string "huang qing zhe",
I will expect to find ASCII code before UNICODE. And the UNICODE is before ASCII string of "goo" since ASCII has
a letter "g" which is after "pinyin" of "h".
4. An inexact match is performed by accepting user's input string and should return all matched string. 
i.e. UNICODE string with "pinyin" of "huang qing zhe" will be considered to be match with "hqz", "huqizh", 
"huangqz", or "hqzhe". And the following is NOT matched: "haqz", "hugqingzhe" etc.
5. However, this kind of grammar has one ambiguity: i.e. Suppose we have an ASCII string "shang hai shi". And the
searching string is "shs". There can be two explanation: one is that it should be matched since "shs" are all 
starting letter for all words. The other is that it is not matched since "sh" is matched with first word "shang".
And letter "s" will not be matched in second word "hai" any more. 
This kind of ambiguity is usually solved by pre-defined priority of searching pattern.
C.The idea of program

Originally Mr. Li suggested me to draw the possible DFA. But I feel it is a kind of more difficult than coding.

The dictionary is input by a file named "nick.txt" and it should be the same directory along with executable file.

The input file format is like following:

[number of words in name]  [flag for UNICODE, 1==UNICODE] [string1, string2 ...]

In this second edition, I want to express the algorithms in some pseudo-code for explanation purpose. Therefore you will find some useless pseudo-code at top of code.

D.The major functions
The findFirst function is returning the insertion point for new-added name.
The function "unico2pinyin" is only for simplification and you should take advantage of some library in your 
platform.
E.Further improvement
The searching for matching can be improved if you make use of the property that the address book is already sorted.
กก
F.File listing
1. addressbook.cpp
file name: addressbook.h
/*
This is a so-called pure C implementation of fuzzy match.
1. Assume you have an input file such that each line is a name, either in English or 
in pinyin. The first integer indicates if it is pinyin or not. 
2. Assume that you have an pre-defined function that can translate chinese into its pinyin 
format. It is called "unicode2pinyin".

3. I only use "struct Name" to store each name.
4. The matching logic is same as before: using DFS.



*/


//#include <iostream>
#include <stdlib.h>
#include <stdio.h>
//#include <fstream>
#include <string.h>

//using namespace std;

const int MaxWordLength=10;
const int MaxWordNumber=10;
const int MaxNameNumber=100;
const int BufSize=MaxWordNumber*MaxWordLength;

char* defaultFileName="nick.txt";

char* unicode2pinyin(char* str);

struct Name
{
	int pinyin;//1 is  pinyin, 0 not
	int wordCount;
	char* names[MaxWordNumber];
} ;

struct Name myNames[MaxNameNumber];
int nameCount=0;

void readFromFile(char* fileName);
void displayName(Name* theName);
void display();
void findMatch(char* target);

int canMatch(char* target, Name* theName);
int doMatch(Name* theName, char* target, int targetIndex, int wordIndex, int letterIndex); 


//you should already have a such conversion function which 
//translate chinese into pinyin
char* unico2pinyin(char* chinese);


int main()
{
	char input[BufSize];
	readFromFile(defaultFileName);
	display();
	printf("please input your search string>");
	scanf("%s", input);
	while (strcmp(input, "exit")!=0)
	{
		findMatch(input);
		printf("please input your search string>");

		scanf("%s", input);
	}

	return 0;
}


void displayName(Name* theName)
{
	if (theName->pinyin==1)
	{
		printf("The pinyin is:");
	}
	for (int i=0; i<theName->wordCount; i++)
	{
		printf("%s ", theName->names[i]);
	}
}

void findMatch(char* target)
{	
	//char* ptr;
	for (int i=0; i<nameCount; i++)
	{
		if (canMatch(target, &myNames[i])>0)
		{
			printf("find one match:\n");
			displayName(&myNames[i]);
			printf("\n");
		}	
	}
}

int canMatch(char* target, Name* theName)
{
	//int wordIndex=0, letterIndex=0, targetIndex=0;
	return doMatch(theName, target, 0, 0, 0);
}


int doMatch(Name* theName, char* target, int targetIndex, int wordIndex, int letterIndex)
{
	char* ptr;
	if (target[targetIndex]==NULL)
	{
		return 1;//match
	}
	if (wordIndex>=theName->wordCount)
	{
		return 0;//not match
	}
	if (theName->pinyin==1)
	{
		ptr=unicode2pinyin(theName->names[wordIndex]);
	}
	else
	{
		ptr=theName->names[wordIndex];
	}
	if (ptr[letterIndex]==NULL)
	{
		return doMatch(theName, target, targetIndex, wordIndex+1, 0);
	}


	if (target[targetIndex]==ptr[letterIndex])
	{
		return doMatch(theName, target, targetIndex+1, wordIndex, letterIndex+1)+
			doMatch(theName, target, targetIndex, wordIndex+1, 0);
	}
	else
	{
		if (letterIndex!=0)
		{
			return doMatch(theName, target, targetIndex, wordIndex+1, 0);
		}
		else
		{
			return 0;
		}
	}
}




//in my case the conversion is just return input
char* unicode2pinyin(char* chinese)
{
	return chinese;
}
		

void readFromFile(char* fileName)
{
	FILE* stream;
	int i;
	char* ptr;
	char buffer[BufSize];
	if ((stream=fopen(fileName, "r"))==NULL)
	{
		printf("open file failed\n");
		exit(1);
	}
	nameCount=0;
	while (!feof(stream))
	{
		fgets(buffer, BufSize, stream);
		i=0;
		ptr=strtok(buffer, " \0\n");
		if (strcmp(ptr, "1")==0)
		{
			myNames[nameCount].pinyin=1;
		}
		else
		{
			myNames[nameCount].pinyin=0;
		}
		ptr=strtok(NULL, " \0\n");
		while (ptr!=NULL)
		{
			myNames[nameCount].names[i]=(char*)malloc(strlen(ptr)*sizeof(char)+1);
			strcpy(myNames[nameCount].names[i], ptr);
			i++;
			ptr=strtok(NULL, " \0\n");
		}
		myNames[nameCount].wordCount=i;
		//myNames[nameCount].names[i]=NULL;
		nameCount++;
	}
}

void display()
{
	for (int i=0; i<nameCount; i++)
	{
		displayName(&myNames[i]);
	}
}


running result: (This is the running results of 10 seconds. The position of each quantum is

just approximation since I am trying to use integer to map double.)

The input file is named "nick.txt" and it is like this:

3 1 huang qing zhe
2 0 xiang dong
3 1 zheng shu zhao
2 1 yong jun
4 0 xiao xiao yu er
5 1 this is a unicode test
1 0 ascii
2 1 many test
3 1 yang zi wen
3 0 hong yang yang
3 0 tian an men
2 0 zhang qing
3 1 li yi hong
3 1 xiao yong jun
3 0 xiao jian dong
3 1 li qing qing
2 0 xu jin
2 1 huang yong
3 1 huang chang zheng
3 1 zhu chun ming
3 0 zhu chun min
3 0 zhu jian min
3 1 ni xiang li
3 0 wu hui min
3 1 li teng long
3 0 chen zhi xiong
3 1 hong xiang yang
3 0 su cong wei
3 1 su dong po
2 1 yang yang
2 1 yang tao
3 1 yang jian zhong
3 1 huang hong dao
3 1 wu qi wei
3 1 jiang jie shi
3 0 mao ze dong
2 1 zhu de
3 1 ren bi shi
3 0 du yu min
2 0 xiao sun
2 0 zhang yan
3 0 li ke xin
3 0 wo shi shui
3 0 huang shi ren
3 0 xu jia hui
3 0 bei jin shi
3 0 shang hai shi
3 1 yang zhen ning
3 1 li zheng dao
3 1 wu de w
4 0 a pei a wang
2 0 cheng long
3 1 ma jun ren
2 0 liang zhan
3 1 li deng hui
3 0 su jin qiang
3 1 huang xi jin
3 0 huang tian ci
2 0 tian ci
3 0 zhang yun lei
3 0 ma jia jun

The following is the running result:


a,pei,a,wang,
ascii,
bei,jin,shi,
chen,zhi,xiong,
cheng,long,
du,yu,min,
hong,yang,yang,
huang,tian,ci,
huang,shi,ren,
unicode and corresponding pinyin is:hong,xiang,yang,
unicode and corresponding pinyin is:huang,xi,jin,
unicode and corresponding pinyin is:huang,hong,dao,
unicode and corresponding pinyin is:huang,chang,zheng,
unicode and corresponding pinyin is:huang,yong,
unicode and corresponding pinyin is:huang,qing,zhe,
unicode and corresponding pinyin is:jiang,jie,shi,
li,ke,xin,
liang,zhan,
unicode and corresponding pinyin is:li,deng,hui,
unicode and corresponding pinyin is:li,zheng,dao,
unicode and corresponding pinyin is:li,teng,long,
unicode and corresponding pinyin is:li,qing,qing,
unicode and corresponding pinyin is:li,yi,hong,
ma,jia,jun,
mao,ze,dong,
unicode and corresponding pinyin is:ma,jun,ren,
unicode and corresponding pinyin is:many,test,
unicode and corresponding pinyin is:ni,xiang,li,
unicode and corresponding pinyin is:ren,bi,shi,
shang,hai,shi,
su,jin,qiang,
su,cong,wei,
unicode and corresponding pinyin is:su,dong,po,
tian,ci,
unicode and corresponding pinyin is:this,is,a,unicode,test,
wo,shi,shui,
wu,hui,min,
unicode and corresponding pinyin is:wu,de,w,
unicode and corresponding pinyin is:wu,qi,wei,
xiang,dong,
xiao,sun,
xiao,jian,dong,
xiao,xiao,yu,er,
xu,jia,hui,
xu,jin,
unicode and corresponding pinyin is:xiao,yong,jun,
unicode and corresponding pinyin is:yang,zhen,ning,
unicode and corresponding pinyin is:yang,jian,zhong,
unicode and corresponding pinyin is:yang,tao,
unicode and corresponding pinyin is:yang,yang,
unicode and corresponding pinyin is:yang,zi,wen,
unicode and corresponding pinyin is:yong,jun,
zhang,yun,lei,
zhang,yan,
zhang,qing,
zhu,jian,min,
zhu,chun,min,
unicode and corresponding pinyin is:zheng,shu,zhao,
unicode and corresponding pinyin is:zhu,de,
unicode and corresponding pinyin is:zhu,chun,ming,
now let's find the match
find a match:shang,hai,shi,
input your searching string>shs
find a match:shang,hai,shi,
input your searching string>zcm
find a match:zhu,chun,min,
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>zhch
find a match:zhu,chun,min,
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>







                                 back.gif (341 bytes)       up.gif (335 bytes)         next.gif (337 bytes)