CS 3 (Spring 2023) Project 00: Debugging and C Strings

This project focuses on drawing comparisons between Java and C to introduce you to the wild world of C programming. It also introduces asan, a useful tool for debugging your C code.

Setup

If you haven’t already, please do the Setup for the course. Otherwise, just register for the project using grinch: https://grinch.caltech.edu/register and clone the repository as explained in the setup instructions. You MUST partner with someone who is in the same group as you! Please use the SSH link found on the registration page and in the email sent to you, it should look something like this git@gitlab.caltech.edu:cs3-23sp/project00-blank.git.

Intro to C

Welcome to the wild world of C programming! Throughout this project, we will explore various pieces and features of the C programming language.

Includes and the Standard Library

Java has import statements to reference other classes within a class. In C, we use #include to get a similar effect. Libraries are made up of two types of files: source files (extension .c) and header files (extension .h). Conceptually, header files are a lot like interfaces in Java.

Throughout this project, we will use stdlib.h, stdio.h, and string.h which are some of the most commonly used C libraries.

Hello World and Command-Line Arguments

Consider the following simple C program hello.c:

#include <stdio.h>
int main(int argc, char *argv[]) {
    printf("Hello!\n");
    for (int i = 0; i < argc; i++) {
        printf("%d: %s\n", i, argv[i]);
    }
    return 0;
}

This program demonstrates several very important things about C. In particular, (1) it has an include statement, (2) it uses argc and argv which together make up the command-line arguments, and (3) it uses printf which is C’s version of System.out.print.

If you compile and run the program, you’ll get an output something like the following:

Notice how the first argument (argv[0]) is always the name of the program. It is customary to print a usage message indicating how the program is intended to be used if the number of arguments is incorrect.

Pointers vs. References

In Java, we might write a line of code like File f = new File(filename);. As we discussed in CS 2, f is a reference to a File. In C, we have to be explicit about references by using a star. So, the analogous line of C code might look like FILE *f = fopen(filename, "r");, where fopen does approximately the same thing as the File constructor.

C Strings

We’ll talk about C Strings a lot later, but, for now, what you need to know is that char * means “C string”. Conceptually, if we refer to the start of a block of chars in memory, that block can be considered a String (but how do we know when it ends?).

A New Debugging Tool: asan

clang is a C compiler which has a bunch of built-in libraries which can be useful when debugging.
One of the most useful is called asan. asan will provide a stack trace when your program crashes instead of just spitting out Segmentation Fault.

Buggy Program #1

Consider the following (obviously) buggy program (which is dumb1.c in your repository):

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    char *dumb = NULL;
    dumb[0] = 'x';
}

What will happen when dumb1.c is compiled and run?

Try it! You can open a terminal in VSCode by pressing the Ctrl and ` keys together, or by manually clicking Terminal->New Terminal in the toolbar.

Run the command:

clang -g dumb1.c -o dumb1.o
./dumb1.o

Note: The -g flag adds debugging information to the compiled file. The -o flag indicates what the name of the executable (.o) file created after compilation should be.

Welcome to the wonderful world of C debugging! We got absolutely no useful information about what went wrong; however, to “debug” it, we need to start somewhere. This is where asan comes in.

Add the flag -fsanitize=address (which means “turn on asan”) to the compilation line. Then, run the program.

This time, instead of the unhelpful Segmentation Fault message, asan tells us we have an error related to a “WRITE memory access” and that “address points to the zero page”. In other words, we’re assigning (writing) to a NULL pointer, which is exactly what’s happening. It may seem obvious in this case, but in more complicated programs, it can sometimes be totally non-obvious why or where the segmentation fault occurs.

Allocating Memory and Garbage Collection

In Java, we simply write Object o = new Object() to allocate memory for an Object. In C, we need to call the malloc function and provide the exact size of the allocation we require. This usually looks like Object *o = malloc(sizeof(Object));. There’s a couple of notable pieces of syntax in this line:

In Java, when we use the new keyword, memory is allocated for us. Additionally, the garbage collector reclaims the memory when it’s safe to do so. In C, we have to manage our own memory which means we need to explicitly decide when to allocate and deallocate the memory we need. In practice, this equates to having to use the standard library functions malloc and free. malloc takes in the number of bytes the user wants to allocate and gives back a pointer to memory. At that point, the user “owns” that chunk of memory and is responsible for cleaning it up when they are done–that’s where the free function comes in. free takes a pointer that was previously returned by malloc and reclaims ownership of that memory.

A typical example might look like the following:

char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%c%c", str[0], str[1]);
free(str);

The general pattern is to ask malloc for n * sizeof(type) bytes of memory. sizeof will determine how many bytes are necessary for the type provided. We say that memory that is malloced but not freed is “leaked”.

Buggy Program #2

Now consider this (slightly) less dumb program (which is dumb2.c in your repository):

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    char *dumb = malloc(1);
    dumb[0] = 'x';
    dumb[1] = 'y';
}

In C, memory is represented by an array of bytes. Each memory address refers to a byte in memory. When we allocate space for an array of characters, we allocate a contiguous block of memory with length n, where n is the desired length of the array. Each block will have a size of 1 byte, since characters are 1 byte each. In the code above, we allocate 1 byte of memory, but we try to assign two bytes of memory to character values. Let’s see what happens when we try to compile this program.

What will happen when dumb2.c is compiled and run with normal clang? What is the bug in the program?

This one is slightly more subtle, because it may have executed without error. This is called “undefined behavior”, and we should NEVER rely on it. What has happened here is that we’re modifying someone else’s memory–which will sometimes result in a segmentation fault, but sometimes will just silently corrupt some random other memory or variable, or do something else unpredictable.

What will happen when dumb2.c is compiled and run with -fsanitize=address? What is the bug in the program?

asan spits out a rather long error message which seems useless at first, but let’s look at the two top blocks:

WRITE of size 1 at 0x6020000000d1 thread TO
...

and

0x6020000000d1 is located 0 bytes to the right of 1-byte region (0x6020000000d0, 0x6020000000d1)
allocated by thread T0 here:
...

asan is telling us that we are illegally trying to assign a single byte in main, and that it’s located to the right of the memory also allocated in main.

Intro to Strings

Strings in C

In C, strings are extremely primitive. They’re literally an array of bytes. Pointers and arrays in C are roughly equivalent. So, the type for a string in C is char *. Notably, pointers have no concept of “length”. So, strings are “null-terminated” meaning after the last character, there is a “null character” (written \0) to indicate the end of the string. A very common mistake in C is not setting the null-terminator or not allocating space for it.

Using asan to Debug Our First String Function

Let’s explore some more differences between Java and C, namely with strings. First, we will use what we learned above to debug our first program, string_print.c.

void string_print(char *s)

The method string_print prints the contents of s, one character per line prefixed with the character’s index (out to two digits) and a colon.


For example, the string “CS3Rules!” would appear as:

00: C
01: S
02: 3
03: R
04: u
05: l
06: e
07: s
08: !

Run the file with the following commands:

clang -g string_print.c -o string_print.o -fsanitize=address
./string_print.o

Uh oh! You should see some color-coded output which says “ERROR: AddressSanitizer: heap-buffer-overflow…” This indicates that the program is trying to access memory which has not been made available to the user.

To fix these errors, you’ll need to learn about the strcpy and strcat functions. (Both function names link to our documentation on these functions!)

Memory Allocation and C Strings

The next function to write is string_reverse. This one involves creating your own string which means we need to learn how to allocate memory in C.

Writing string_reverse

Here’s the specification for the function we want to write. For this one, we’ll walk you through it step by step.

char *string_reverse(char *s)

Returns a new string with the characters in s in the opposite order. Does not edit the original string.


For example, the input string “Adam” would return “madA”.

Arrays of Strings!

Next, we’ll implement string_blanks which requires allocating both an array and the elements inside it.

char **string_blanks(char *s)

Returns a new array of strings where each string has a _ in place of one of the characters. Does not edit the original string.


For example, the input string “Adam” would return {"_dam", "A_am", "Ad_m", "Ada_"}.

Recall that, in Java, if I allocate an array like so:

String[] strings = new String[10];

The only thing it does is create 10 spots for Strings. If I want to actually allocate the Strings, I need to separately make and assign them. C is no different.

To create an array of strings, we use the type char ** which means “an array of char *’s. We’ll need to allocate the container first, and then allocate each of the individual strings in the array individually as well. Your code for string_blanks should have two lines that make calls to malloc. Again, remember you can use array notation with pointers. So, array[1][2] would refer to the third character of the second array element.

Advanced Strings and Structs

In CS 2, you probably used the split method on Strings a lot. In this project, you will write a simplified version of this method for C Strings.

Structs in C

In Java, we use classes to define an object and its attributes. In C, users can define their own data types, which we call structs, which allows us to store data composed of any combination of data types into a single entity. We use the struct keyword to declare a new struct; for instance,

typedef struct Dog {
    char *name;
    int age;
    char *breed;
} dog_t

The keyword typedef is used to assign our data type, struct Dog, an alternate name, in this case dog_t. This allows us to refer to the data type using the name dog_t rather than having to type struct Dog each time. We can create a reference to an instance of this data type and assign values to its data fields like so:

dog_t *cs3_head_ta = malloc(sizeof(dog_t));

cs3_head_ta->name = "Hopper";
cs3_head_ta->age = 8;
cs3_head_ta->breed = "Labradoodle";

We use sizeof(dog_t) so the compiler knows to allocate enough space for each of the fields within the dog_t struct.

When you are done, push your code to GitLab using the following commands:

git add *.c
git commit -am "Implemented string_print, string_blanks, string_reverse, and string_split"
git push origin master

Check to make sure all the tests pass. Great work!