CS 3 (Spring 2024) Modules and Encapsulation

Modules are the basic unit of encapsulation. They exist to hide away implementation details and present an interface that can be used without worrying about those implementation details.

This makes it easier to write code because it means that when you’re writing code that uses the module, you don’t need to think about how it’s implemented and when you’re writing the module, you don’t need to think about how it’s used.

It means if you come up with a better algorithm later, you can just switch it out, and fewer things to think about means you’re less likely to write bugs.

Another key aspect of encapsulation is that it allows you to ensure that “invariants” are maintained.

Encapsulation in Java

In Java, classes serve the role of both types and modules. Encapsulation is done by marking implementation details as private and the presented interface as public. Things outside the class can then only access public things and not private ones.

Consider the class:

public class Foo {
    public int[] sorted;

    public void sort(int[] vals) {
        // [sorts vals]
    }

    public Foo(int[] vals) {
        sort(vals)
        this.sorted = vals;
    }

    /**
     * Returns the smallest stored integer
     */
    public int smallest() {
        return sorted[0];
    }

    public void print() {
        System.out.println(foo.sorted);
    }
}

We have a problem then, because consider this code:

public class Main {
    public static void main(String[] args) {
        Foo foo = new Foo({2, 5, 3, 4});
        foo.sorted[3] = 1;
        System.out.println(foo.smallest())
    }
}

Then Main.main would print 2 even though the smallest element in the array is now 1!

In order for the smallest method to be correct, it assumes that no matter what happens elsewhere in the code, sorted is kept sorted. I.e., an invariant of this class is that sorted is sorted.

In order to avoid the invariant being violated, we would instead declare sorted as private:

class Foo {
    private int[] sorted;
    
    // ...
}

Additionally, while sort is a useful helper function here, a general sort function has nothing to do with the interface of Foo and we don’t want to worry about changing it or removing it because some other piece of code is relying on it. So we would mark that private too.

class Foo {
    // ...
    private static void sort(int[] vals) {
        // ...
    }
    // ...
}

So how do we do this in C?

Header files

In C, unlike in Java, types and modules are distinct. We can define multiple types in the same modules and we can have modules that define no types at all. For example, the string.h library you can access by doing #include <string.h> at the start of a C file defines no types, it’s just a module containing functions for working with char *.

A module in C is a pair of files, module.h (the “header file”) and module.c (the “source file”). The header file contains “declarations” of all the functions and types (which may be opaque or concrete, more on that shortly) that the module publicly exposes or exports. The source files contains implementations of the functions and definitions of opaque types.

So how would the code above look in C?

Well, let’s start with writing our header file, foo.h (while types and modules are distinct, it is common for them to share names if the module defines a single primary type and functions to work with it):

// foo.h
typedef struct foo foo_t;

foo_t *foo_init(int *bar, size_t len);

int foo_smallest(foo_t *foo);

void foo_print(foo_t *foo);

void foo_free(foo_t *foo);

Let’s note some important details:

Let’s write the source file now:

// foo.c
#include <stdlib.h>
#include <stdio.h>
#include "foo.h"

struct foo {
    int *sorted;
    size_t len;
};

void sort(int *vals, size_t len) {
    // [sort vals]
}

foo_t *foo_init(int *vals, size_t len) {
    foo_t *new = malloc(sizeof(foo_t));
    sort(vals, len);
    new->sorted = vals;
    new->len = len;
    return new;
}

int foo_smallest(foo_t *foo) {
    return foo->sorted[0];
}

void foo_print(foo_t *foo) {
    printf("%d", foo->sorted[0]);
    for (size_t i = 1; i < foo->len; i++) {
        printf(", %d", foo->sorted[i]);
    }
    printf("\n");
}

void foo_free(foo_t *foo) {
    free(foo->sorted);
    free(foo);
}

Here we defined the type, whose fields are all private because the definition is in the source file and not the header file, and then implemented all the functions. Note that sort doesn’t have the sort prefix unlike everything else. This is because sort is available only inside the module and so we’re not worried about namespace collisions.

Note that foo.c has #include "foo.h" at the top. Source files should always include their own headers.

So, how do you actually use a module?

So you’ve created foo.h and foo.c and now you want to actually use the module you’ve made. Let’s write a main.c file which does. main isn’t a module here, it’s the root of our application, which means it doesn’t need a header file, since nobody is going to be importing main.

So we just include "foo.h" like we include <stdio.h> for printf and such, right?

// main.c
#include <stdio.h>
#include <stdlib.h>
#include "foo.h"

int main() {
    size_t len = 4;
    int *vals = malloc(sizeof(int) * len);
    vals[0] = 2;
    vals[1] = 5;
    vals[2] = 3;
    vals[3] = 4;
    foo_t *foo = foo_init(vals, len);
    printf("%d\n", foo_smallest(foo));
    foo_free(foo);
}

VSCode doesn’t make any red squiggles, so everything’s fine, right?

Let’s compile it. We run:

clang main.c -o main

Uh oh.

yshaluno@labradoodle:~/cs3/module_example$ clang main.c -o main
/usr/bin/ld: /tmp/main-0e13ca.o: in function `main':
main.c:(.text+0x58): undefined reference to `foo_init'
/usr/bin/ld: main.c:(.text+0x65): undefined reference to `foo_smallest'
/usr/bin/ld: main.c:(.text+0x84): undefined reference to `foo_free'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

What gives? Understanding exactly what this error means would go beyond the scope of this reading, however, the key thing to note is the linker command failed line at the bottom. This tells you that this is not a normal compiler error like you’re used to from missing semicolons.

Instead, it basically means that the compiler was able to find the declarations of those functions (foo_init, foo_smallest, and foo_free) but it wasn’t able to find their implementations. All #include actually does is it pastes the included file. Before the compiler does anything complicated, a much simpler “preprocessor” runs and simply copy-pastes the included file where the #include line is.

Thus, our main.c is equivalent to

// main.c
#include <stdio.h>
#include <stdlib.h>
typedef struct foo foo_t;

foo_t *foo_init(int *bar, size_t len);

int foo_smallest(foo_t *foo);

void foo_print(foo_t *foo);

void foo_free(foo_t *foo);

int main() {
    size_t len = 4;
    int *vals = malloc(sizeof(int) * len);
    vals[0] = 2;
    vals[1] = 5;
    vals[2] = 3;
    vals[3] = 4;
    foo_t *foo = foo_init(vals, len);
    printf("%d\n", foo_smallest(foo));
    foo_free(foo);
}

It makes sense why compiling this doesn’t work. We’ve declared all the functions we want, but we haven’t implemented them anywhere. We only told the compiler about main.c and that doesn’t mention foo.c anywhere. It turns out that fixing this is quite simple: we just need to tell the compiler about foo.c.

yshaluno@labradoodle:~/cs3/module_example$ clang main.c foo.c -o do_thing
yshaluno@labradoodle:~/cs3/module_example$ ./do_thing
4

There are better ways to do this, but this is good enough for now. Our provided Makefiles will handle the fancy things for you.

fatal error: 'foo.h' file not found

You may notice that our provided repositories organize things much more so that it’s easier to navigate.

In particular, they place all the header files in a directory called include/. Let’s do that.

yshaluno@labradoodle:~/cs3/module_example$ mkdir include
yshaluno@labradoodle:~/cs3/module_example$ mv foo.h include
yshaluno@labradoodle:~/cs3/module_example$ clang main.c foo.c -o do_thing
main.c:3:10: fatal error: 'foo.h' file not found
#include "foo.h"
         ^~~~~~~
1 error generated.

Uh oh. What gives? Well, #include "foo.h" tells the compiler to look for a file called foo.h, but that doesn’t mean it magically knows what you mean. By default, the compiler will look in the working directory and then fall back to a list of locations it expects the header file to (this is things like /usr/include—try running ls /usr/include on Labradoodle and see what happens). If we put foo.h at include/foo.h, the compiler won’t be able to find it.

To fix this, we simply tell the compiler where to look for it with the -I[path] flag:

yshaluno@labradoodle:~/cs3/module_example$ clang main.c foo.c -o do_thing -Iinclude
yshaluno@labradoodle:~/cs3/module_example$ ./do_thing
4

What’s the deal with #ifndef __FOO_H

If you’ve looked at the header files we’ve written, you’ll notice that they’re all wrapped in this pattern of:

#ifndef __MODULE_H
#define __MODULE_H

// actual contents go here

#endif

The exact formatting of the choice of identifier used in the place of __MODULE_H (which is just an identifier) varies, but the purpose is to make sure it doesn’t collide with any other identifiers.

This pattern is called a “header guard” and they should be placed on all headers. Why do we have this? There’s a couple reasons. First, suppose foo.h also defined a constant:

const int FOO_MAX = INT_MAX;

Then, say we have:

// bar.h
#include "foo.h"

int bar_bar(foo_t *foo);
// baz.h
#include "foo.h"

int baz_baz(foo_t *foo);

And then we have a main.c which uses both of them:

// main.c
#include "bar.h"
#include "baz.h"

// ...

If you tried to compile this, you would get an error: redefinition of 'FOO_MAX'. You’re only allowed to define constants once but we included the file twice, once through bar.h and once through baz.h.

Header guards solve this problem by ensuring that every header is included once. #ifndef says “include the code inside only if the preprocessor constant called __FOO_H is not defined” and the following line immediately defines it.

If we wrote:

//foo.h
#ifndef __FOO_H
#define __FOO_H
const int FOO_MAX = INT_MAX;
// ...
#endif

If we then include this twice, the first time the #ifndef passes, since __FOO_H hasn’t been defined yet. But the second time, __FOO_H is already defined and so #ifndef ensures the constant (and the rest of the header file) doesn’t appear a second time.

Header guards should still be used in cases where there are no constants defined because they also speed up compilation by telling the compiler that it doesn’t need to do this work again.