Setup

Register for the lab using grinch: https://grinch.caltech.edu/register and clone the repository as explained in the setup instructions.

Iterative Development

When desiging large programs (or even small ones!), it’s very important to have a short development cycle. The typical pattern people first use to write code is: (1) write all the code, (2) run the code on a large test case, (3) attempt to debug error \(n\), (4) cry, (5) go to 1.

While this is indeed a “cycle”, it’s not a particularly effective one. The biggest issues that arises is that the developer has implemented so many features at once that they are unable to isolate where the issue in their code actually is.

In this lab, we will force you to use an iterative development cycle (with a little bit of test-driven development thrown in) which looks more like

Isolate what the individual features to implement are (what exactly a “feature” is is described below)
Write a few tests for the \(n\)th feature to make sure you understand the problem
Write feature \(n\)
Run tests on code so far
Debug feature \(n\)
Increment \(n\) and go to 1

This version of a development cycle is centered around the idea of a “feature”. Usually, a large program can be split into modules and those modules can be further sub-divided into individual “features” (or “tasks” if you prefer) which can each be implemented semi-independently. To give you a sense of what this looks like, we will first describe a programming problem, then split it into features, then you will implement the features one by one using this strategy.

Background and Problem Definition

A regular expression (or regexp) is a pattern that can be used to match against text strings. The UNIX utility grep can search through files for lines that contain the pattern. CS 21 covers the mathematical theory behind regular expressions quite extensively, but we will only implement a small subset of the full language today. In particular, for our purposes, a regular expression will contain the following features:

Letters (a-z, A-Z) which match the literal letter they represent
A dot (‘.’) which matches any single character
Two regular expressions contactenated together (‘ab’) which represents first matching the first regex and then matching the second one.
A letter or dot followed by a star (‘a*’) which represents zero or more occurrences of that letter

(Notably, we are missing union among other things like character classes.)

More or less, this means if we have a string it matches a regular expression if all the letters match (where * means the previous letter can be omitted or repeated).

The whole string must match the regexp!!! There cannot be any characters left over in a match!!!

The features above are basically the things to implement; we split it up a little more like follows:

Feature 1: Implement string literals (multiple literal characters in a row).

Feature 2: Implement dot (‘.’).

Feature 3: Implement star when it follows a character literal (‘a*’)

Feature 4: Implement star when it follows a dot (‘.*’)

Part 1: Implement String Literals

In the file test.sh, you will find a tester for your program with a single test and a comment that describes how to add more tests. The first thing you should do is write at least five tests that exercise edge cases of feature 1. The tester uses the UNIX utility grep (which is known to work) as a reference implementation to test your code.

Task 0. Implement at least five tests for feature 1 in the test.sh file.

We’ve already seen that pointers allow us to do a lot of weird things. Up next: you can do arithmetic on pointers. (Whatttttttt?)

If we have a char * pointer p, p + 1 is a reference to the second character of the string. That is, the following equalities hold:

p[0] == *p          // We already knew this
p[1] == *(p + 1)    // This one is new

In fact, in C, the brackets are literally syntactic sugar for pointer arithmetic and dereference.

You may find very simple (p + 1, p + 2, …) pointer arithmetic useful in implementing your regular expression matcher.

Task 1. Implement feature 1 in the match.c file as the match function.

You may choose to implement it recursively or iteratively, but it is far easier to finish the later features if you write it recursively. We recommend that you consume a single character on each recursive call and use the empty string as your base case. Do not do anything fancy (DFAs are great, but we’re looking for a simple implementation here.)

After you implement feature 1, you should test feature 1 by running make test. DO NOT continue until it is working.

Part 2: Implement Dot

Task 2. Implement at least five tests for feature 2 (dot) in the test.sh file.

Task 3. Implement feature 2 in the match.c file as the match function.

This should be a very small change from the previous iteration. It might be useful to extract matching a single character into its own function (that is, use procedural decomposition as you implement features).

After you implement feature 2, you should test feature 2. DO NOT continue until it is working.

Part 3: Implement Star for Character Literals

Task 4. Implement at least five tests for feature 3 (star for character literals) in the test.sh file.

Task 5. Implement feature 3 in the match.c file as the match function.

This is the hardest of all the features. In fact, if you’ve used good procedural decomposition, this is the last feature, because you’ll get the fourth one for free. When thinking about the problem recursively, there are two cases if we are attempting to match a*: either it matches the first character in the string or not. Think about how this can be turned into code.

After you implement feature 3, you should test feature 3. DO NOT continue until it is working.

Part 4: Implement Star for Dot

Task 6. Implement at least five tests for feature 4 (star for dot) in the test.sh file.

Task 7. Implement feature 4 in the match.c file as the match function if it isn’t already working on your tests.

Part 5: Fuzzing your Regexp Engine

A “fuzzer” is an automated tool that tries random inputs on your program in an attempt to make it fail. These inputs tend to start out completely random, but they narrow down as the fuzzer determines features of your code. For example, in this lab, your fuzzer will quickly determine that * is an interesting character and attempt to insert stars in many places in the random string.

In this section, we will be using clang’s built-in fuzzer to attempt to find strings that make a reference implementation of grep output different results than your implementation of match. Additionally, it will try to find inputs that cause asan to fail. This is very powerful, because it is extremely unlikely that the fuzzer will miss an entire case of outputs.

Run make fuzzer to compile the fuzzer to a binary called fuzzer.

Task 8. Run the fuzzer using the command ./fuzzer -only_ascii=1. If it crashes, the last string it outputs resulted in a failure. Let it run for a little while before stopping it!

Getting Checked Off

Labs will be due the following Tuesday from when they are assigned, at 11:30 PST. You will need to get manually checked off for most labs, as we will not be using totally automated tests. To get checked off, make a ticket saying you are done, and we will check you off as soon as possible. Please include the link to your repo in the ticket.