This portion of practicum can be completed in the match
function in the provided regex.c
file. You are free to refactor your code to not have a match
method, or to have more than one method, as long as you keep the function header, arguments, and return value of matches
intact (our regex tests call matches
). The match
method is merely provided as a starting point, and may be easier to work with than matches
for this task.
Iterative Development
When desiging large programs (or even small ones!), it’s very important to have a short development cycle. The typical pattern people first use to write code is: (1) write all the code, (2) run the code on a large test case, (3) attempt to debug error \(n\), (4) cry, (5) go to 1.
While this is indeed a “cycle”, it’s not a particularly effective one. The biggest issues that arises is that the developer has implemented so many features at once that they are unable to isolate where the issue in their code actually is.
In this lab, we will force you to use an iterative development cycle (with a little bit of test-driven development thrown in) which looks more like
- Isolate what the individual features to implement are (what exactly a “feature” is is described below)
- Write a few tests for the \(n\)th feature to make sure you understand the problem
- Write feature \(n\)
- Run tests on code so far
- Debug feature \(n\)
- Increment \(n\) and go to 1
This version of a development cycle is centered around the idea of a “feature”. Usually, a large program can be split into modules and those modules can be further sub-divided into individual “features” (or “tasks” if you prefer) which can each be implemented semi-independently. To give you a sense of what this looks like, we will first describe a programming problem, then split it into features, then you will implement the features one by one using this strategy.
Background and Problem Definition
A regular expression (or regexp) is a pattern that can be used to match against text strings. The UNIX utility grep can search through files for lines that contain the pattern. CS 21 covers the mathematical theory behind regular expressions quite extensively, but we will only implement a small subset of the full language. In particular, for our purposes, a regular expression will contain the following features:
- Letters (a-z, A-Z) and numbers (0-9) which match the literal letter or number they represent
- A dot (‘.’) which matches any single character
- Two regular expressions contactenated together (‘ab’) which represents first matching the first regex and then matching the second one.
- A letter, number, or dot followed by a star (‘a*’) which represents zero or more occurrences of that letter
(Notably, we are missing union among other things like character classes.) More or less, this means if we have a string it matches a regular expression if all the letters match (where * means the previous letter can be omitted or repeated). The whole string must match the regexp!!! There cannot be any characters left over in a match!!!
The features above are basically the things to implement; we split it up a little more like follows:
- Feature 1: Implement string literals (multiple literal characters in a row).
- Feature 2: Implement dot (‘.’).
- Feature 3: Implement star when it follows a character literal (‘a*’)
- Feature 4: Implement star when it follows a dot (‘.*’)
As it turns out, we have already implemented the first feature in the regex.c file we have provided you.
Part 1 - Implement dot
When implementing this feature, you should maintain the existing functionality of matching string literals, while also allowing a dot in the pattern to match any character in the text.
You may choose to implement it recursively or iteratively, but it is far easier to finish the later features if you write it recursively. We recommend that you consume a single character on each recursive call and use the empty string as your base case. Do not do anything fancy (DFAs are great, but we’re looking for a simple implementation here.)
We’ve already seen that pointers allow us to do a lot of weird things. Up next: you can do arithmetic on pointers. (Whatttttttt?)
If we have a char *
pointer p
, p + 1
is a reference to the second character of the string. That is, the following equalities hold:
p[0] == *p // We already knew this
p[1] == *(p + 1) // This one is new
In fact, in C, the brackets are literally syntactic sugar for pointer arithmetic and dereference.
You may find very simple (p + 1, p + 2, …) pointer arithmetic useful in implementing your regular expression matcher.
This feature is due by Monday, April 24, 11:30pm. Completion will be judged by your stage 1 tests passing on gitlab.
Part 2 - Implement Star for Character Literals
This is the hardest of all the features. In fact, if you’ve used good procedural decomposition, this is the last feature, because you’ll get the fourth one for free. When thinking about the problem recursively, there are two cases if we are attempting to match a*: either it matches the first character in the string or not. Think about how this can be turned into code.
This feature is due by Monday, May 29, 11:30pm. Completion will be judged by successfully passing both the stage 1 and stage 2 tests on gitlab.
Part 3 - Implement Star for Dot
Finally, add “.” to be able to parse successfully with “*”. This feature is due by Friday, June 16, 11:30pm. Completion will be judged by successfully passing all regex tests on gitlab.
Testing Your Code
The testing for this portion of practicum will be done with the regex
test on gitlab. Locally, you will find the test.sh
file helpful here in terms of adding test cases. To execute the test cases you write, run the make test
command.
An important note - print statements in your code will cause many of our tests to break, both locally and on git. If you want to print and test simultaneously, we suggest you use fprintf - e.g. fprintf(stderr, "statement")