Strings

Kesa...大约 4 分钟algorithmdata structurestrings

1. What is String

We use the term string to refer to a variable-length array of characters, defined by a starting point and by a string-termination character marking the end.

Strings are valuable as low-level data structures for two basic reasons:

  • Many computing applications involve processing textual data, which can be represented directly with strings.
  • Many computer systems provide direct and efficient access to bytes of memory, which correspond directly to characters in strings. In many situations, the string abstraction matches needs of the application to the capabilities of the machine.

The abstract notion of sequence of characters ending with a string-termination character could be implemented in many ways. For example, we can use array, linked list and other implementations.

The difference between a string and an array of characters revolves around length. Both represent contiguous areas of memory, but the length of an array is set at the time that the array is created, whereas the length of a string may change during the execution of a program.

We need to reserve memory for a string, either at compile time, by declaring a fixed-length array of characters, or at execution time, by calling malloc.

Once the array is allocated, we can fill it with characters, starting at the beginning, and ending with the string-termination character:

  • Without a string-termination character, a string is no more or no less than an array of characters
  • With the string-termination character, we can work at a higher level of abstraction and consider only the portion of the array from the beginning to the string-termination character to contain meaningful information. (In C, it is \0)

2. Elementary string-processing operations

This table gives implementation of basic string-processing operations, using two different C primitives.

  • The indexed-array approach is a more natural way to express the algorithms and leads to code easier to understand
  • The pointer approach leads to more compact code

2.1 Indexed array versions

OperationCode
Compute string length (strlen(a))for(i=0; a[i]!=0;i++);return i;
Copy (strcpy(a, b))for (i = 0; (a[i] = b[i])!= 0; i++);
Compare (strcmp(a, b))for(i = 0; a[i] == b[i]; i++) if(a[i]==0) return 0;
return a[i] - b[i];
Compare(prefix) (strcmp(a, b, strlen(a)))for (i=0;a[i]==b[i];i++) if(a[i]==0) return 0;
if(a[i]==0) return 0;
return a[i]-b[i];
Append(strcat(a, b))strcpy(a+strlen(a), b)

2.2 Equivalent Pointer versions

OperationCode
Compute string length (strlen(a))b=a; while(*b++); return b-a-1;
Copy (strcpy(a, b))while(*a++ = *b++);
Compare (strcmp(a, b))while(*a++ == *b++) if(*(a-1)==0) return 0;
return *(a-1) - *(b-1);
Compare(prefix) strcmp(a, b , strlen(a))while (*a++ == *b++) if (*(a-1)==0) return 0;
if (*(a-1) == 0) return 0;
return *(a-1) - *(b-1);
Append(strcat(a, b))strcpy(a+strlen(a), b)

One of the most important operations that we perform on strings is the compare operation, which tell us which of two strings would appear first in the dictionary. This ordering is called lexicographic order (字典序).

3. String search (program-3.15)

// program_3.15 String Search
#include <stdio.h>
#define N 10000

int main(int argc, char* argv[]) {
  if (argc < 2) {
    printf("Parameter list cannot be empty");
    return 1;
  }

  char* p = argv[1];
  char a[N];

  printf("String: %s\n", p);

  // read entered string
  char t;
  for (int i = 0; i < N; i++) {
    t = getchar();
    if (t == EOF) {
      a[i] = 0;
      break;
    }
    a[i] = t;
  }
  // search string
  for (int i = 0; a[i] != 0; i++) {
    int j;
    for (j = 0; p[j] != 0; j++) {
      if (a[i + j] != p[j]) {
        break;
      }
    }
    if (p[j] == 0) {
      printf("Pos: %d, ", i);
    }
  }
  printf("\n");
}

This program discovers all occurrences of a word form the stdin in a large text string.

  • a : we declare the text string as a fixed-size character array (could also use malloc).
  • argv[1]: the command line argument string pointer, its memory is allocated by the system before the program is invoked.
  • For each starting position in a, we try matching the substring starting at that position in p, testing for equality character one by one.
  • Whenever we reach the end of p successfully, we print out the starting position i.

4. Performance Bug

String processing provides a convincing example of the need to be knowledgeable about the performance of library functions. The problem is that a library function might take more time than we expect.

For example, determining the length of a string takes time proportional to the length of the string. Ignoring this fact can lead to severe performance problems.

for (i = 0; i < strlen(a); i++) {
    if (strcmp(&a[i], p, strlen(p) == 0)) {
        printf("%d",i);
    }
}

The code fragment above takes time proportional to at least the square of the length of a, no matter what code is in the body of the loop, it goes all the way through a to determine its length each time through the loop.

Problem such as this are difficult to detect because the program might work fine for small strings, but then slow down or even never finish when it goes into production.

This kind of error is called a performance bug, the code can be verified to be correct, but it does not perform as efficiently as we expect.

Library functions, all to often, cannot guarantee to provide the best performance for all applications. Even if the performance is well documented, we have no assurance that some future implementation might not involve performance changes that will have adverse effects on our programs.

5. Memory allocation for strings

Memory allocation for strings is more difficult than linked list because strings vary in size. We tend to assume that each string sits in memory of indeterminate allocation, just big enough for the string and its termination character.

So we must be very careful to ensure adequate allocation when we are performing operations that build or lengthen strings.

Exercises

  • 3.56

Reference

  1. Algorithms in Copen in new window
  2. How to enter the value of EOF in the terminalopen in new window
上次编辑于:
评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v2.15.2