This project assumes that you have already completed lab01. We will not repeat C syntax and semantics you should have learned in lab01 in this write-up.

Writing the `wc` Utility

Our first program will be a simple utility that reads in a file and outputs how many lines it has.

An equivalent Java program might look like the following:

import java.util.Scanner;
public class Wc {
    public static void main(String[] args) throws FileNotFoundException {
        Scanner s = new Scanner(new File(args[1]));
        int i = 0;
        while (s.hasNextLine()) {
            i++;
            s.nextLine();
        }
        System.out.println(i + " " + args[0]);
    }
}

As we endeavor to write this program, we will explore various pieces and features of the C programming language.

Includes and the Standard Library

Java has import statements to reference other classes within a class. In C, we use #include to get a similar effect. Libraries are made up of two types of files: source files (extension .c) and header files (extension .h). Conceptually, header files are a lot like interfaces in Java.

Throughout this lab, we will use stdlib.h, stdio.h, and string.h which are some of the most commonly used C libraries.

Hello World and Command-Line Arguments

Consider the following simple C program hello2.c:

#include <stdio.h>
int main(int argc, char *argv[]) {
    printf("Hello!\n");
    for (int i = 0; i < argc; i++) {
        printf("%d: %s\n", i, argv[i]);
    }
    return 0;
}

This program demonstrates several very important things about C. In particular, (1) it has an include statement, (2) it uses argc and argv which together make up the command-line arguments, and (3) it uses printf which is C’s version of System.out.print.

If you compile and run the program, you’ll get an output something like the following:

Notice how the first argument (argv[0]) is always the name of the program. In our wc program, we would like to take in exactly one argument (other than the name of the program) which should be the name of the file. It is customary to print a usage message indicating how the program is intended to be used if the number of arguments is incorrect.

Just to give you another code sample to get used to what C looks like, here is the beginning of a wc program which checks for correct usage:

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    printf("The name of the file to count the lines of is: %s\n", argv[1]);
    return 0;
}

Unfortunately, there’s a warning when we compile it (at least on OS X and Linux):

blank@labradoodle:~$ clang wc.c

wc.c:3:9: warning: implicitly declaring library function 'printf' with type 'int (const char *, ...)' [-Wimplicit-function-declaration]

printf("Usage: %s <filename>\n", argv[0]);

wc.c:3:9: note: include the header <stdio.h> or explicitly provide a declaration for 'printf'

1 warning generated.

Task 0. Fix the initial skeleton of the wc.c program to remove all compilation warnings. Then, run the program on several inputs to make sure that the usage message shows up at the right times.

Next up is to actually read the file. To simplify things slightly, we will read the file one character at a time. The pattern looks very similar to Java’s Scanner pattern, but we’ll need to make a few digressions first to understand the syntax.

Pointers vs. References

In Java, we might write a line of code like File f = new File(filename);. As we discussed in CS 2, f is a reference to a File. In C, we have to be explicit about references by using a star. So, the analogous line of C code might look like FILE *f = fopen(filename, "r");, where fopen does approximately the same thing as the File constructor.

Before continuing, see if you can solve the following problem:

C Strings

We’ll talk about C Strings a lot later, but, for now, what you need to know is that char * means “C string”. Conceptually, if we refer to the start of a block of chars in memory, that block can be considered a String (but how do we know when it ends?).

Files in C

In C, the equivalent of Java’s File type is a FILE *. We will work with three functions in the standard library in this project:

FILE *fopen(char *filename, char *mode) – opens the file pointed to by filename; for now, mode will always be "r" which means “read mode”.
int fread(char *buf, int size, int number_to_read, FILE *file) – reads number_to_read chars into buf from file. For this lab, size will always be 1. Note that we’ve slightly lied about the signature here to avoid C features we do not need to talk about yet.
int fclose(FILE *file) – returns 0 on successful closing of the file and non-zero otherwise.

Allocating Memory

In Java, when we use the new keyword, memory is allocated for us. Additionally, the garbage collector reclaims the memory when it’s safe to do so. In C, we have to manage our own memory which means we need to explicitly decide when to allocate and deallocate the memory we need. In practice, this equates to having to use the standard library functions malloc and free. malloc takes in the number of bytes the user wants to allocate and gives back a pointer to memory. At that point, the user “owns” that chunk of memory and is responsible for cleaning it up when they are done–that’s where the free function comes in. free takes a pointer that was previously returned by malloc and reclaims ownership of that memory.

A typical example might look like the following:

char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%c%c", str[0], str[1]);
free(str);

The general pattern is to ask malloc for n * sizeof(type) bytes of memory. sizeof will determine how many bytes are necessary for the type provided. We say that memory that is malloced but not freed is “leaked”.

`wc`, finally

Now, we finally have all the tools we need to create the wc program! In pseudocode, it looks like the following:

main() {
    check_usage();
    f = open_file(filename);
    
    num_lines = 0;
    result = allocate_memory();
    assert(result != NULL);
    while (f has more characters) {
        read a character into result
        if (the character we read is a newline) {
            increment num_lines
        }
    }

    print(num_lines + " " + filename)
    free(result);
    close_file(f);
}

In CS 3, we expect you to pay careful attention to the quality of your code (not just if it “works”). Each project will highlight a few code quality features that we will be specifically on the lookout for that week. This week, we will be specifically looking for procedural decomposition in your programs (that is, your code is split into helper functions, each of which has a single purpose rather than one monolithic main function). We will also be looking for good variable and function names.

Task 1. Translate the pseudocode into an actual C program called wc.c. Be careful to not take the pseudocode literally. You will have to use C syntax and idioms to write the program. You should start from the wc.c skeleton you modified earlier. Make sure to check if the file is NULL before using it, and output the error message "wc: <filename>: No such file or directory\n" in the case that it is.

Finding the Longest Word in a File

Your next task is to find the longest word in a provided file. This code will be (intentionally) very similar to the wc code; so, you should begin by running the following command in the terminal: cp wc.c longest.c. (Is there a way to use procedural decomposition to capture the repeated code??) Before describing your task, we need to explain a little more about C strings:

C Strings and the String Library

We’ve briefly talked about C strings, but, now it’s time to delve into them in more depth. Recall that the type char * is considered a “C string”. In a C string, the end is marked by the character \0 (called a “null terminator”) as the end of the string. So, in memory, you might see the string “adam” represented as five characters.

From a practical standpoint, it is critical that you remember the string needs a null terminator, because (1) you have to allocate memory for it, and (2) you need to add it to the end of any string you create manually. Consider our malloc example from before with a slight modification:

char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%s", str);
free(str);

This program is broken! It will not deterministically do what we expect. The %s format string expects a valid C string which means it will keep on printing characters until it finds a null terminator!! We don’t know what is in the memory after the piece we malloced–and it’s not ours to access. In Java, this would cause an ArrayIndexOutOfBounds exception, but in C, it will sometimes crash your program (and sometimes output the wrong thing!). The fix is pretty straightforward:

char *str = malloc(3 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
str[2] = '\0';
printf("%s", str);
free(str);

The two changes we made are (1) increased the amount of memory we malloced, and (2) added the \0 to the end of the string. Now, our program will do what we expect.

When dealing with C strings, the functions defined by string.h can be very useful. In particular, we think strlen and strcpy are worth paying attention to:

int strlen(char *str) – returns the length of the string str NOT including the null terminator.
char *strcpy(char *dest, char *src) – copies bytes up to and including a null terminator from src to dest.

Task 2. Write a program that finds and prints the longest “word” in a file where \n and ` ` are defined as the word separators. In other words, we consider a character to start a new word if the previous character is a space or newline. For now, you may assume that the longest word is not longer than 100 characters (which is useful when you’re allocating memory to store it). If the file has at least one word, you should print the longest word found followed by a newline character. If the file has no words, you should print nothing.

Designing the C Equivalent of a Class

The last part of this project concerns how you create the C equivalent of a Java class; that is, how we define data types and functions associated with them. As an example, we will use 2D vectors, because they will continue to be relevant throughout the course. (Remember, we’re implementing a physics engine!) Before we can implement the functions and data definitions, there are a few more C features we need to explore.

Structs and Typedefs

In Java, the basic data unit is the class. In C, we do not have classes. Instead, all we have are structs which are roughly like classes without methods. structs are used to group related data together and define a single unit that we can then apply functions to. For example, consider the Vector type we used earlier. We would define it in C as:

typedef struct vector_t {
    double x;
    double y;
} vector_t;

We don’t technically need the typedef, but it makes the code we write with vector_ts easier to read. So, for now, we will always include typedefs. Other than the syntax, this definition should feel relatively familiar from Java.

Accessing Fields of Structs

Now that we have a vector_t definition, we need to discuss how to use it. Consider the following block of code:

vector_t *v = malloc(sizeof(vector_t));
v->x = 10.0;
v->y = 20.0;
printf("(%f, %f)\n", v->x, v->y);
free(v);

For now, we will stick with pointers to structs rather than declaring them directly. This keeps the semantics closer to Java which is a useful simplification for now. Notice that we’re using -> where we would have previously used . in Java. There’s a good reason for this difference, and we’ll discuss it soon.

Header Files

We use “header files” (files with the “.h” extension) to store function prototypes (think the declarations in Java interfaces) and types definitions which do not include the data. For now, we will write the header files for you, but you will eventually have to write your own. You may notice a weird #define at the begining and end of our header files. These are called “include guards”; don’t worry about them for now. One major difference is that we actually include our own headers in our “.c” files to make sure the definitions match up. Unlike the includes we’ve seen previously, our own headers are included using quotes like #include "vector.h".

Writing Some `vector` functions

Task 3. In vector.h, we’ve provided three function prototypes which you should implement in vector.c. See the provided documentation in vector.h for what the functionality should be. Your vector.c file should (1) include the vector.h header, (2) define the vector_t struct using the notation described earlier, and (3) define the three functions specified in vector.h.

Testing Your Code

The first way to test your code is compile and run it! You’re writing actual programs. Do not rely entirely on the testers we provide. In fact, as we get further into the course, we will be asking you to write your own tests as part of the projects.

We have also provided testers for the first few programs you wrote. You can run them by running the commands python3 ./testers/test-wc and python3 ./testers/test-longest.
We have not provided you a main or tests for the vector program. You are responsible for writing that for yourself. That is, we expect you to write a main function in another file that tests your vector_t code at least a little bit. Once you are relatively confident that your code is working, you should push it to gitlab using the commands:

git add *.c
git commit -am "Implemented wc, longest, and vector_t"
git push origin master

Gitlab will run our tests on your code and respond with either a check mark or an X indicating success and failure, respectively.

Grading in CS 3

In CS 3, each project will be graded on three dimensions: functionality, design, and testing. The weights of these components will vary (as will the ways we grade them), but we expect you to be thinking about writing good code–not just correct code.

Every week, on the day that a project is due (i.e., on Fridays), we will ask your entire team to participate in a mandatory 20 minute code review. Either Adam or a TA will sit down with you for 20 minutes and discuss aspects of your code including but not limited to: functionality, testing, code design, code readability, and code extensibility. Your design and testing scores for each project will factor in both your code and your responses during the code review. The code reviews are an opportunity for you to collaboratively discuss your code with course staff; it’s entirely possible that you will have a good justification for a decision which we would not have thought of.

For project00, functionality is 60% and design is 40%. Functionality will be further split into three equally weighted pieces: wc, longest, and vector. If you pass the tests on gitlab for wc and/or longest, you will receive those points; for vector, we will manually inspect the functionality at the code review. We do not expect you to know how to write tests yet; so, there is no testing section on project00.

To sign up for a code review for your team, one member of your team should use the following link: https://grinch.caltech.edu/cs3/code-review

You should also submit the open ended questions as part of the project by clicking the button below.

Writing the wc Utility