Setup
If you haven’t already, please do the Setup for the course. Otherwise, just register for the project using grinch
: https://grinch.caltech.edu/register and
clone the repository as explained in the setup instructions. You MUST partner with someone who is in the same group as you! Please use the SSH link found on the registration page and in the email sent to you, it should look something like this git@gitlab.caltech.edu:cs3-23sp/project00-blank.git
.
Intro to C
Welcome to the wild world of C programming! Throughout this project, we will explore various pieces and features of the C programming language.
Includes and the Standard Library
Java has import
statements to reference other classes within a class. In C, we use #include
to get a similar
effect. Libraries are made up of two types of files: source files (extension .c
) and header files (extension .h
).
Conceptually, header files are a lot like interfaces in Java.
Throughout this project, we will use stdlib.h
, stdio.h
, and string.h
which are some of the most commonly used C libraries.
Hello World and Command-Line Arguments
Consider the following simple C program hello.c
:
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello!\n");
for (int i = 0; i < argc; i++) {
printf("%d: %s\n", i, argv[i]);
}
return 0;
}
This program demonstrates several very important things about C. In particular, (1) it has an include statement, (2) it uses argc
and argv
which
together make up the command-line arguments, and (3) it uses printf
which is C’s version of System.out.print
.
If you compile and run the program, you’ll get an output something like the following:
Notice how the first argument (argv[0]
) is always the name of the program. It is customary to print a usage message indicating how the program is intended to be used if the number of arguments is incorrect.
Pointers vs. References
In Java, we might write a line of code like File f = new File(filename);
. As we discussed in CS 2, f
is a reference to a File
. In C,
we have to be explicit about references by using a star. So, the analogous line of C code might look like FILE *f = fopen(filename, "r");
, where fopen
does approximately the same thing as the File
constructor.
C Strings
We’ll talk about C Strings a lot later, but, for now, what you need to know is that char *
means “C string”. Conceptually, if we refer to the start of a block of char
s in memory,
that block can be considered a String (but how do we know when it ends?).
A New Debugging Tool: asan
clang
is a C compiler which has a bunch of built-in libraries which can be useful when debugging.
One of the most useful is called asan
. asan
will provide a stack trace when your program crashes instead of just spitting out Segmentation Fault
.
Buggy Program #1
Consider the following (obviously) buggy program (which is dumb1.c
in your repository):
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
char *dumb = NULL;
dumb[0] = 'x';
}
What will happen when dumb1.c
is compiled and run?
Try it! You can open a terminal in VSCode by pressing the Ctrl and ` keys together, or by manually clicking Terminal->New Terminal in the toolbar.
Run the command:
clang -g dumb1.c -o dumb1.o
./dumb1.o
Note: The -g
flag adds debugging information to the compiled file. The -o
flag indicates what the name of the executable (.o) file created after compilation should be.
Welcome to the wonderful world of C debugging! We got absolutely no useful information about what went wrong; however, to “debug” it, we need to start somewhere.
This is where asan
comes in.
Add the flag -fsanitize=address
(which means “turn on asan
”) to the compilation line.
Then, run the program.
This time, instead of the unhelpful Segmentation Fault
message, asan
tells us we have an error related to a “WRITE memory access” and that “address points to the zero page”.
In other words, we’re assigning (writing) to a NULL
pointer, which is exactly what’s happening. It may seem obvious in this case, but in more complicated
programs, it can sometimes be totally non-obvious why or where the segmentation fault occurs.
Allocating Memory and Garbage Collection
In Java, we simply write Object o = new Object()
to allocate memory for an Object
. In C, we need to call the malloc
function and
provide the exact size of the allocation we require. This usually looks like Object *o = malloc(sizeof(Object));
. There’s a couple of notable pieces of
syntax in this line:
- There’s a
*
on the left! This indicates that we’re defining a pointer to an Object. In Java, object types are references by default; in C, we need to tell the language we want a reference explicitly with the*
. - The
malloc
function!malloc
is the way we ask our system for memory. It’s roughly equivalent to the first part of thenew
keyword that provides the memory. Be careful though! Unlike Java’snew
,malloc
DOES NOT INITIALIZE THE MEMORY. This means that, when we get it, it’s total garbage. If we want all zeroes, we have to do that manually. - We’re using
sizeof(Object)
.sizeof
is a “function” that gets replaced at compile-time with the number of bytes necessary to hold the type specified in the parentheses. Generally, when we’re allocating a pointer with \(n\) *’s, the argument tosizeof
should have \(n-1\) *’s.
In Java, when we use the new
keyword, memory is allocated for us. Additionally, the garbage collector reclaims the memory when it’s safe to do so.
In C, we have to manage our own memory which means we need to explicitly decide when to allocate and deallocate the memory we need. In practice, this equates to having to use
the standard library functions malloc
and free
. malloc
takes in the number of bytes the user wants to allocate and gives back a pointer to memory. At that point, the
user “owns” that chunk of memory and is responsible for cleaning it up when they are done–that’s where the free
function comes in. free
takes a pointer that was previously returned by malloc
and reclaims ownership of that memory.
A typical example might look like the following:
char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%c%c", str[0], str[1]);
free(str);
The general pattern is to ask malloc
for n * sizeof(type)
bytes of memory. sizeof
will determine how many bytes are necessary for the type provided.
We say that memory that is malloc
ed but not free
d is “leaked”.
Buggy Program #2
Now consider this (slightly) less dumb program (which is dumb2.c
in your repository):
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
char *dumb = malloc(1);
dumb[0] = 'x';
dumb[1] = 'y';
}
In C, memory is represented by an array of bytes. Each memory address refers to a byte in memory. When we allocate space for an array of characters, we allocate a contiguous block of memory with length n, where n is the desired length of the array. Each block will have a size of 1 byte, since characters are 1 byte each. In the code above, we allocate 1 byte of memory, but we try to assign two bytes of memory to character values. Let’s see what happens when we try to compile this program.
What will happen when dumb2.c
is compiled and run with normal clang
? What is the bug in the program?
This one is slightly more subtle, because it may have executed without error. This is called “undefined behavior”, and we should NEVER rely on it. What has happened here is that we’re modifying someone else’s memory–which will sometimes result in a segmentation fault, but sometimes will just silently corrupt some random other memory or variable, or do something else unpredictable.
What will happen when dumb2.c
is compiled and run with -fsanitize=address
? What is the bug in the program?
asan
spits out a rather long error message which seems useless at first, but let’s look at the two top blocks:
WRITE of size 1 at 0x6020000000d1 thread TO
...
and
0x6020000000d1 is located 0 bytes to the right of 1-byte region (0x6020000000d0, 0x6020000000d1)
allocated by thread T0 here:
...
asan
is telling us that we are illegally trying to assign a single byte in main
, and that it’s located to the right of the memory also allocated in main
.
Intro to Strings
Strings in C
In C, strings are extremely primitive. They’re literally an array of bytes. Pointers and arrays in C are roughly equivalent. So, the type for a string in C is char *
.
Notably, pointers have no concept of “length”. So, strings are “null-terminated” meaning after the last character, there is a “null character” (written \0
) to indicate the end of the
string. A very common mistake in C is not setting the null-terminator or not allocating space for it.
Using asan to Debug Our First String Function
Let’s explore some more differences between Java and C, namely with strings. First, we will use what we learned above to debug our first program, string_print.c
.
void string_print(char *s)
The method string_print
prints the contents of s
, one character per line prefixed with the character’s index (out to two digits) and a colon.
For example, the string “CS3Rules!” would appear as:
00: C
01: S
02: 3
03: R
04: u
05: l
06: e
07: s
08: !
Run the file with the following commands:
clang -g string_print.c -o string_print.o -fsanitize=address
./string_print.o
Uh oh! You should see some color-coded output which says “ERROR: AddressSanitizer: heap-buffer-overflow…” This indicates that the program is trying to access memory which has not been made available to the user.
To fix these errors, you’ll need to learn about the strcpy
and strcat
functions. (Both function names link to our documentation on these functions!)
Task 0.
Fix the errors in string_print.c
. (Hint: read the stack trace to find which lines and methods are causing errors.)
To test that your fixes are correct, you should compile and run your code. Then, check that the output matches the example above.
To do this, run the following commands in the labradoodle terminal (the -g is not strictly necessary, unless you are attempting to debug):
clang -g string_print.c -o string_print.o -fsanitize=address
./string_print.o
Memory Allocation and C Strings
The next function to write is string_reverse
. This one involves creating your own string which means we need to learn how to allocate memory in C.
Writing string_reverse
Here’s the specification for the function we want to write. For this one, we’ll walk you through it step by step.
char *string_reverse(char *s)
Returns a new string with the characters in s
in the opposite order. Does not edit the original string.
For example, the input string “Adam” would return “madA”.
Task 1a.
Given the argument char *s
, we need to allocate a new string of the right length to return. Since we’re returning a reverse of the string, it should have the same
number of characters. Using the strlen
function and the malloc
example in the above section about memory allocation, allocate enough memory for the new string. Note that strlen
DOES NOT
return a value including the null terminator; so, you’ll have to add 1 to the result before passing it to malloc
.
Task 1b. Now, write a for loop to reverse the string and return the reversed string. Note that strings can literally be treated as arrays; so, you can use the same brackets syntax that you would in Java.
Remember, malloc
does not cleanly initialize the contents of memory for us - be sure to explicitly set the null-terminator
in the string by setting the correct index to '\0'
. Make sure you use single quotes around the null terminator, since it represents a single-byte character, rather than double quotes.
You can use python3 testers/test-string-methods.py reverse
to test your code.
Arrays of Strings!
Next, we’ll implement string_blanks
which requires allocating both an array and the elements inside it.
char **string_blanks(char *s)
Returns a new array of strings where each string has a _
in place of one of the characters. Does not edit the original string.
For example, the input string “Adam” would return {"_dam", "A_am", "Ad_m", "Ada_"}
.
Recall that, in Java, if I allocate an array like so:
String[] strings = new String[10];
The only thing it does is create 10 spots for String
s. If I want to actually allocate the String
s, I need to separately make and assign them. C is no different.
To create an array of strings, we use the type char **
which means “an array of char *
’s. We’ll need to allocate the container first, and then allocate each of the individual
strings in the array individually as well. Your code for string_blanks
should have two lines that make calls to malloc
. Again, remember you can use array notation with pointers.
So, array[1][2]
would refer to the third character of the second array
element.
Task 2.
Write the string_blanks
function. Here is the general strategy for writing string_blanks
:
- Allocate space for a list of strings. Recall that a list of characters, in other words a single string, has type
char*
in C. - Use
strcpy
to copy the contents of the original string into a new string and set the correct character/index to a blank, or_
. - Add this string to your list of strings.
You can use python3 testers/test-string-methods.py blanks
to test your code.
Advanced Strings and Structs
In CS 2, you probably used the split
method on String
s a lot. In this project, you will write a simplified version of this method for C Strings.
Structs in C
In Java, we use classes to define an object and its attributes. In C, users can define their own data types, which we call structs, which allows us to store data composed of any combination of data types into a single entity. We use the struct
keyword to declare a new struct; for instance,
typedef struct Dog {
char *name;
int age;
char *breed;
} dog_t
The keyword typedef
is used to assign our data type, struct Dog
, an alternate name, in this case dog_t
. This allows us to refer to the data type using the name dog_t
rather than having to type struct Dog
each time. We can create a reference to an instance of this data type and assign values to its data fields like so:
dog_t *cs3_head_ta = malloc(sizeof(dog_t));
cs3_head_ta->name = "Hopper";
cs3_head_ta->age = 8;
cs3_head_ta->breed = "Labradoodle";
We use sizeof(dog_t)
so the compiler knows to allocate enough space for each of the fields within the dog_t
struct.
Task 3.
Write a function strsplit
in mystring.c
, which takes in a C String and returns a strarray_t
struct (where strarray_t
is defined in mystring.h
as a struct with a length
, or number of words, and a data
array) of each of the “words” in the input string. If there are multiple spaces between two words (or leading
or trailing spaces), they should be collapsed (e.g., " a b "
would return a strarray_t
with two elements).
Use the following algorithm to write strsplit
:
- Count the total number of words you expect to extract from the input string. (Hint: the null-terminator
\0
denotes the end of a string) - Allocate memory for a strarray object using malloc. Be sure to also allocate memory for any internal data stored within the strarray struct, if necessary.
- Find the first non-space character in the input string.
- While you have processed fewer words than the total number of words counted in step 1: a. If you reach the end of a word, add the word to the strarray data. Skip any whitespace between this word and the next. b. Else, accumulate the letters in some temporary variable and move to the next character.
- Free any malloc’ed space that is NOT the expected return result of the function.
- Return the resulting strarray object.
We strongly recommend you use asan
to debug your code. To pass the tests, asan
must not report any errors.
Don’t forget to free any memory allocated by strsplit
in the tester (mystringtest.c
)! Note that you are also not allowed to use the built-in strtok() function. The goal of this project is to implement it from scratch.
You can use python3 testers/test-string-methods.py split
to test your code.
When you are done, push your code to GitLab using the following commands:
git add *.c
git commit -am "Implemented string_print, string_blanks, string_reverse, and string_split"
git push origin master
Check to make sure all the tests pass. Great work!