String Matching Algorithms

Joseph Mariadassou
theSundayProgrammer@gmail.com
Norwest Computing, Sydney, Australia

Motivation

String matching is one of the most studied issues in Computer Science. There are three well known algorithms by

It is very unlikely that a developer in this day will be asked to write one let develop a new one. However the reason for a refresher is that some problems may be modeled as a string matching problem. Besides, this is the actual reason for the refresher, they are interesting.

Classic String Match

The classic problem is described as: Given a string Y and a search string x is there a substring of Y that matches exactly with x. In C syntax a brute force algorithm is shown below

        int  strSearch(const char *Y, int N, const char *x, int n)
        // N is the length of string Y
        // n is the length of string x
        {
            for(int i=0; i+n <= N; ++i)
            {
                bool found=true;
                for(j=0; found && j < n; ++j)
                {
                found = (Y[i+j] == x[j]);
                }
                if (found)
                return i;
            }
        return -1;
        }

Permutaion

An interesting variation is to find a substring in Y that is a permutation of the string x. Let’s try test driven development: Given a solution let us write code to verify it. In other words for some k such that 0 ≤ k < N-n, verify that the substring Y[k, k+n) is a a permutation of X. There are n! (! denotes factorial) permutations of the substring Y[k,k+n). Comparing every permutation is not viable. One solution is to reduce both strings to a canonical form that can then be compared. One such form is to sort the two strings which makes comparison possible. Thus to find a permuted substring, just slide a window of size n across Y and check if it is a permuted version of x.

int  strPermuteSearch(const char *Y, int N, const char *x, int n)
// N is the length of string Y
// n is the length of string x
{
    char y[n]; //C syntax not C++
    sort(x,x+n);
    for(int i=0;  i+n <= N; ++i)
    {
        strncpy(y, Y+i, n);
        sort(y,y+n);
        if (strncmp(x,y,n)==0)
        return i;
    }
    return -1;
}

The time complexity of this algorithm is O(N*n*Log(n)). The n*Log(n) is because of the sorting withing the loop. This can reduced to O(N*n) by using insertion sort. Since y is sorted all that is needed when i is incremented is to remove Y[i-1] from y and insert Y[i+n-1] which can be done in O(n) while keeping y sorted.

Combination

This leads to another problem. Is there a substring of Y such that some combination of n characters of the substring is a permutation of x. Again consider a verification algorithm. Given a string Z of length M where M>=n is there some collection n characters which can be permuted to match x. There are C(M,n) combinations. Trying all the combinations is not viable. Consider Z[0]. If Z[0] is present in x, our solution space is now reduced to C(M-1,n-1). If Z[0] is not present then our solution space is reduced to C(M-1,n). The algorithm terminates when n==0 which means that a result has been found or n>M which means that Z does not contain a substring which can be permuted to x.

bool Verify(const char* Z, int M, char *x, int n)
// M is the length of string Z
// n is the length of string x
{
    if (n==0) return true;
    if (n > M) return false;
    auto i = find(x,x+n,*Z);
    if ( i != x+n)    {
        swap(*i,*x);
        return Verify(Z+1, M-1, x+1, n-1);
    } 
    else
        return Verify(Z+1,M-1,x,n);
}

If M = n, the function verifies that Z is a permutation of x. Notice that the linear search in Lines 5 and 6 push the time complexity to O(n*M). We could reduce it to O(M), if the two strings were sorted, as shown below

bool VerifySorted(const char* Z, int M, char *x, int n) {
    // M is the length of string Z
    // n is the length of string x
    if (n==0) 
        return true;
    else if (n > M) 
        return false;
    else if (*Z < *x)
        return VerifySorted(Z+1, M-1,x,n);
    else if (*Z == *x)
        return VerifySorted(Z+1,M-1,x+1,n-1);
    else // *Z > *x
        return false;
    }

This solution can be adapted to checking if B is a subset of A, if A and B are sequences. The C++ algorithm includes does just that. Now let us restate the problem adding a simple restriction: Find the smallest substring of Y such that there exists n characters in it, which can be permuted to match x. This can be done by sliding a window of size M where n<=M<=N and calling Verify with that window as shown below

struct position
{
    int start;
    int length;
};
position CombinationSubset(const char* Y, int N, char *x, int n){
    // N is the length of string Y
    // n is the length of string x
    position result = {-1,-1};
    char Z[N]; //C syntax not C++
    sort(x,x+n);
    for(M=n; M <= N; ++M)  {
        for(i=0; i+M <= N; ++i)    {
            strncpy(Z,Y+i,M);
            sort(Z,Z+M);
            if (VerifySorted(Z+i, M, x,n))     {
                result.start=i;
                result.length = M;
                return result;
            }
        }
    }
    return result;
}

Distinct Characters

Problem: How many duplicate characters are then in a string

int Duplicates(const char* Y, int N){
  // N is the length of string Y
  sort(Y,Y+N);
  auto yn = unique(Y,Y+N);
  return (Y+N) - yn;
}

Problem: Are all the characters in a given string unique?

Hint When duplicates = 0 all the characters in the string are unique