This project assumes that you have already completed lab01. We will not repeat C syntax and semantics you should have learned in lab01 in this write-up.
Writing the wc
Utility
Our first program will be a simple utility that reads in a file and outputs how many lines it has.
An equivalent Java program might look like the following:
import java.util.Scanner;
public class Wc {
public static void main(String[] args) throws FileNotFoundException {
Scanner s = new Scanner(new File(args[1]));
int i = 0;
while (s.hasNextLine()) {
i++;
s.nextLine();
}
System.out.println(i + " " + args[0]);
}
}
As we endeavor to write this program, we will explore various pieces and features of the C programming language.
Includes and the Standard Library
Java has import
statements to reference other classes within a class. In C, we use #include
to get a similar
effect. Libraries are made up of two types of files: source files (extension .c
) and header files (extension .h
).
Conceptually, header files are a lot like interfaces in Java.
Throughout this lab, we will use stdlib.h
, stdio.h
, and string.h
which are some of the most commonly used C
libraries.
Hello World and Command-Line Arguments
Consider the following simple C program hello2.c
:
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("Hello!\n");
for (int i = 0; i < argc; i++) {
printf("%d: %s\n", i, argv[i]);
}
return 0;
}
This program demonstrates several very important things about C. In particular, (1) it has an include statement, (2) it uses argc
and argv
which
together make up the command-line arguments, and (3) it uses printf
which is C’s version of System.out.print
.
If you compile and run the program, you’ll get an output something like the following:
Notice how the first argument (argv[0]
) is always the name of the program. In our wc
program, we would like to take in exactly one
argument (other than the name of the program) which should be the name of the file. It is customary to print a usage message indicating
how the program is intended to be used if the number of arguments is incorrect.
Just to give you another code sample to get used to what C looks like, here is the beginning of a wc
program which checks for correct usage:
int main(int argc, char *argv[]) {
if (argc != 2) {
printf("Usage: %s <filename>\n", argv[0]);
return 1;
}
printf("The name of the file to count the lines of is: %s\n", argv[1]);
return 0;
}
Unfortunately, there’s a warning when we compile it (at least on OS X and Linux):
Task 0.
Fix the initial skeleton of the wc.c
program to remove all compilation warnings.
Then, run the program on several inputs to make sure that the usage message shows up at the right times.
Next up is to actually read the file. To simplify things slightly, we will read the file one character at a time. The pattern looks very
similar to Java’s Scanner
pattern, but we’ll need to make a few digressions first to understand the syntax.
Pointers vs. References
In Java, we might write a line of code like File f = new File(filename);
. As we discussed in CS 2, f
is a reference to a File
. In C,
we have to be explicit about references by using a star. So, the analogous line of C code might look like FILE *f = fopen(filename, "r");
, where fopen
does approximately the same thing as the File
constructor.
C Strings
We’ll talk about C Strings a lot later, but, for now, what you need to know is that char *
means “C string”. Conceptually, if we refer to the start of a block of char
s in memory,
that block can be considered a String (but how do we know when it ends?).
Files in C
In C, the equivalent of Java’s File
type is a FILE *
. We will work with three functions in the standard library in this project:
-
FILE *fopen(char *filename, char *mode)
– opens the file pointed to byfilename
; for now,mode
will always be"r"
which means “read mode”. -
int fread(char *buf, int size, int number_to_read, FILE *file)
– readsnumber_to_read
char
s intobuf
fromfile
. For this lab,size
will always be1
. Note that we’ve slightly lied about the signature here to avoid C features we do not need to talk about yet. -
int fclose(FILE *file)
– returns0
on successful closing of the file and non-zero otherwise.
Allocating Memory
In Java, when we use the new
keyword, memory is allocated for us. Additionally, the garbage collector reclaims the memory when it’s safe to do so.
In C, we have to manage our own memory which means we need to explicitly decide when to allocate and deallocate the memory we need. In practice, this equates to having to use
the standard library functions malloc
and free
. malloc
takes in the number of bytes the user wants to allocate and gives back a pointer to memory. At that point, the
user “owns” that chunk of memory and is responsible for cleaning it up when they are done–that’s where the free
function comes in. free
takes a pointer that was previously returned by malloc
and reclaims ownership of that memory.
A typical example might look like the following:
char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%c%c", str[0], str[1]);
free(str);
The general pattern is to ask malloc
for n * sizeof(type)
bytes of memory. sizeof
will determine how many bytes are necessary for the type provided.
We say that memory that is malloc
ed but not free
d is “leaked”.
wc
, finally
Now, we finally have all the tools we need to create the wc
program! In pseudocode, it looks like the following:
main() {
check_usage();
f = open_file(filename);
num_lines = 0;
result = allocate_memory();
assert(result != NULL);
while (f has more characters) {
read a character into result
if (the character we read is a newline) {
increment num_lines
}
}
print(num_lines + " " + filename)
free(result);
close_file(f);
}
In CS 3, we expect you to pay careful attention to the quality of your code (not just if it “works”). Each project will highlight a few code quality
features that we will be specifically on the lookout for that week. This week, we will be specifically looking for procedural decomposition in your programs
(that is, your code is split into helper functions, each of which has a single purpose rather than one monolithic main
function). We will also be looking for
good variable and function names.
Task 1.
Translate the pseudocode into an actual C program called wc.c
. Be careful to not take the pseudocode literally. You will have to use C syntax and idioms to write the program.
You should start from the wc.c
skeleton you modified earlier. Make sure to check if the file is NULL
before using it, and output the error message
"wc: <filename>: No such file or directory\n"
in the case that it is.
Finding the Longest Word in a File
Your next task is to find the longest word in a provided file. This code will be (intentionally) very similar to the wc
code; so, you should begin by running the following command in the terminal:
cp wc.c longest.c
. (Is there a way to use procedural decomposition to capture the repeated code??) Before describing your task, we need to explain a little more about C strings:
C Strings and the String Library
We’ve briefly talked about C strings, but, now it’s time to delve into them in more depth. Recall that the type char *
is considered a “C string”. In a C string, the end is marked by
the character \0
(called a “null terminator”) as the end of the string. So, in memory, you might see the string “adam” represented as five characters.
From a practical standpoint, it is critical that you remember the string needs a null terminator, because (1) you have to allocate memory for it, and (2) you need to add it to the end
of any string you create manually. Consider our malloc
example from before with a slight modification:
char *str = malloc(2 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
printf("%s", str);
free(str);
This program is broken! It will not deterministically do what we expect. The %s
format string expects a valid C string which means it will keep on printing characters until it finds a null
terminator!! We don’t know what is in the memory after the piece we malloc
ed–and it’s not ours to access. In Java, this would cause an ArrayIndexOutOfBounds
exception, but in C, it will
sometimes crash your program (and sometimes output the wrong thing!). The fix is pretty straightforward:
char *str = malloc(3 * sizeof(char));
str[0] = 'h';
str[1] = 'i';
str[2] = '\0';
printf("%s", str);
free(str);
The two changes we made are (1) increased the amount of memory we malloc
ed, and (2) added the \0
to the end of the string. Now, our program will do what we expect.
When dealing with C strings, the functions defined by string.h
can be very useful. In particular, we think strlen
and strcpy
are worth paying attention to:
-
int strlen(char *str)
– returns the length of the stringstr
NOT including the null terminator. -
char *strcpy(char *dest, char *src)
– copies bytes up to and including a null terminator fromsrc
todest
.
Task 2.
Write a program that finds and prints the longest “word” in a file where \n
and ' '
are defined as the word separators. In other words, we consider a character to start a new word if the
previous character is a space or newline. For now, you may assume that the longest word is not longer than 100 characters (which is useful when you’re allocating memory to store it). If the file
has at least one word, you should print the longest word found followed by a newline character. If the file has no words, you should print nothing.
Designing the C Equivalent of a Class
The last part of this project concerns how you create the C equivalent of a Java class; that is, how we define data types and functions associated with them. As an example, we will use 2D vectors, because they will continue to be relevant throughout the course. (Remember, we’re implementing a physics engine!) Before we can implement the functions and data definitions, there are a few more C features we need to explore.
Structs and Typedefs
In Java, the basic data unit is the class. In C, we do not have classes. Instead, all we have are struct
s which are roughly like classes without methods. struct
s are used
to group related data together and define a single unit that we can then apply functions to. For example, consider the Vector
type we used earlier. We would define it in
C as:
typedef struct vector_t {
double x;
double y;
} vector_t;
We don’t technically need the typedef
, but it makes the code we write with vector_t
s easier to read. So, for now, we will always include typedef
s. Other than the syntax,
this definition should feel relatively familiar from Java.
Accessing Fields of Structs
Now that we have a vector_t
definition, we need to discuss how to use it. Consider the following block of code:
vector_t *v = malloc(sizeof(vector_t));
v->x = 10.0;
v->y = 20.0;
printf("(%f, %f)\n", v->x, v->y);
free(v);
For now, we will stick with pointers to structs rather than declaring them directly. This keeps the semantics closer to Java which is a useful simplification for now.
Notice that we’re using ->
where we would have previously used .
in Java. There’s a good reason for this difference, and we’ll discuss it soon.
Header Files
We use “header files” (files with the “.h” extension) to store function prototypes (think the declarations in Java interfaces) and types definitions which do not include the data.
For now, we will write the header files for you, but you will eventually have to write your own. You may notice a weird #define
at the begining and end of our header files. These are
called “include guards”; don’t worry about them for now. One major difference is that we actually include our own headers in our “.c” files to make sure the definitions match up. Unlike
the includes we’ve seen previously, our own headers are included using quotes like #include "vector.h"
.
Writing Some vector
functions
Task 3.
In vector.h
, we’ve provided three function prototypes which you should implement in vector.c
. See the provided documentation in vector.h
for what the functionality should be.
Your vector.c
file should (1) include the vector.h
header, (2) define the vector_t
struct using the notation described earlier, and (3) define the three functions specified in vector.h
.
Testing Your Code
The first way to test your code is compile and run it! You’re writing actual programs. Do not rely entirely on the testers we provide. In fact, as we get further into the course, we will be asking you to write your own tests as part of the projects.
We have also provided testers for the first few programs you wrote. You can run them by running the commands python3 ./testers/test-wc.py
and python3 ./testers/test-longest.py
.
We have not provided you a main or tests for the vector program. You are responsible for writing that for yourself. That is, we expect you to write a main
function in another file
that tests your vector_t
code at least a little bit. Once you are relatively confident that your code is working, you should push it to gitlab using the commands:
git add *.c
git commit -am "Implemented wc, longest, and vector_t"
git push origin master
Gitlab will run our tests on your code and respond with either a check mark or an X indicating success and failure, respectively.
Grading in CS 3
In CS 3, each project will be graded on three dimensions: functionality, design, and testing. The weights of these components will vary (as will the ways we grade them), but we expect you to be thinking about writing good code–not just correct code.
Every week, on the day that a project is due (i.e., on Fridays), we will ask either your partnership (in project00 and 01) or your entire team (in project02 onward) to participate in a mandatory 15 minute code review. Either Adam or a TA will sit down with you for 15 minutes and discuss aspects of your code including but not limited to: functionality, testing, code design, code readability, and code extensibility. Your design and testing scores for each project will factor in both your code and your responses during the code review. The code reviews are an opportunity for you to collaboratively discuss your code with course staff; it’s entirely possible that you will have a good justification for a decision which we would not have thought of.
For project00, functionality is 60% and design is 40%. Functionality will be further split into three equally weighted pieces: wc
, longest
, and vector
.
If you pass the tests on gitlab
for wc
and/or longest
, you will receive those points; for vector, we will manually inspect the functionality at the code review.
We do not expect you to know how to write tests yet; so, there is no testing section on project00.
To sign up for a code review for your team, one member of your team/partnership should use the following link: https://grinch.caltech.edu/cs3/code-review