I typically back my files up from my laptop to a USB drive every month or so. The naive approach I tend to use is just to copy the entirety of my working directory to the USB, since I don’t know exactly which files I’ve updated since the last backup. There’s a pretty simple approach here: I could just keep track of a timestamp of the last backup, and only copy over the folders modified since then. But that’s too easy.

I want to make my life a little more difficult, and to make this task a bit more fun, so I’m opting for an overkill approach involving hashes, JSON and some nightmare-inducing use of the Windows API.

Outline of Approach

Taking inspiration from git, I wanted to build a changelog based approach, where I keep track of the state of all files in the directory and only copy over files when there has been a change in state. The way these states are tracked is via a SHA-256 hash of the file contents. The flow goes something like this:

Load previous state Read .usbdiff.json, which maps each file path to its SHA-256 hash.
Scan current directory Recursively compute SHA-256 hashes of all current files.
Compare states Detect files that are added, removed, or modified by comparing old and new hashes.
Update cache Write the current file state back into .usbdiff.json.

So the key components of the project are:

File hash map: A hash map data structure to match file names to file hashes (a hash map for hashes!)
Hash cache: A file (.usbdiff.json) to store the previous state for change detection
JSON I/O: Utilities to create/parse JSON files for the hash cache
SHA-256 hashing: The core of the application, streaming file data from disk and computing the hash.

File Hash Map

The main property we need for our internal data structure is fast lookup, so we can check if a file in the current directory exists in the previous state (and if it has changed). A hash map is perfect for this! I built a simple chained hash table:

struct fhash_entry  {
    char *filename;
    char *filehash;
    struct fhash_entry *next; // Chaining
};

typedef struct
{
    fhashentry_t *farray[HMAP_MAX_ELEMS];

} fhashmap_t;

Here, HMAP_MAX_ELEMS is set to 4096, setting the maximum number of independent hashes. If there’s a collision, the item is placed into the bucket by chaining in the form of a linked list. In the worst case, the lookup complexity is \(\mathcal{O}(m)\) where \(m\) is the length of the linked list.

We go through all the files in the directory recursively and add them to the hash map, with the filename as the key and the file hash as the value:

char* key = _strdup(full_path);
char* value = _strdup(hash);

if (key && value) {
    fhashmap_add(map, key, value);
}

here map is a pointer to the global variable curr_fhashmap, which is an fhashmap_t.

JSON I/O All things JSON are handled via the cJSON library, for which I’ve just copied the header and source into my project directory. It’s quite a well-written, fun repository to use, and I’ve written a very simple wrapper around it for my purposes. The core responsibility of this wrapper code is to convert between my internal fhashmap_t and the cJSON object type. To that end, there are two key functions:

// Create cJSON object from existing fhashmap_t map
cJSON* create_json(fhashmap_t *map);

// Parse JSON from infile into an internal fhashmap_t map
void parse_json(fhashmap_t *map, const char *infile);

which do exactly that.

Internally, the cJSON object is a tree data structure, so while it would be convenient to just compare them directly, lookup performance would be hampered compared to the hashmap.

The create_json function is quite simple, involving only traversal of the hashmap, building the cJSON object via the API and writing to a file. However, the parser is a bit more complex, as it involves reading the entire .usbdiff.json file and constructing a cJSON object from it. The cJSON API does not support streaming from a file into an object, so the entire file contents must be read into a buffer at once, giving rise to some pretty ugly code:

// For now, reserve 1MB for JSON contents
size_t rdbuf_size = 1 << 20;
char *rdbuf = malloc(rdbuf_size);
if(!rdbuf)  {
    fprintf(stderr, "Failed to allocate space for JSON parsing buffer\n");
}

is then converted to an fhashmap_t:

cJSON *elem = NULL;
cJSON_ArrayForEach(elem, object)    {
    if(!cJSON_IsString(elem)) continue;
    fhashmap_add(map, elem->string, elem->valuestring);
}

We are also able to print the cJSON object to the .usbdiff.json file by a trivial wrapper around the cJSON_Print function.

char *string = cJSON_Print(object);
fprintf(out, "%s\n", string);

Improving the Parser

Naturally, this is unideal and we want to be able to stream the JSON file contents so we don’t need to reserve a 1MB (or, potentially, larger) buffer.

The idea is to load the JSON data in chunks of, say, 4KB and try to form complete JSON objects from the loaded data. Naturally, there will be incompleteobjects at the end of the chunk, so we move any remaining characters to the start of the buffer and load the next chunk appended to it.

Any JSON objects found are iteratively merged to produce the final, whole, object. Merging involves simply duplicating the object and adding it to the merged object.

#Hashing

Detecting Changes

To find changes between the two directory structures, we simply do pointwise comparisons of the two hashmaps. We need to do two passes:

One that looks for files in the cached directory state based on the current state (Looking for newly created files)
One that does the opposite (Looking for modified and deleted files)

We then simply add each diff to a filediff_t object that looks like this:

typedef struct {
    char filename[MAX_PATH];
    enum { MODIFIED, DELETED } status;
} filediff_t;

The status field allows us to keep track of the type of change to each file, so we can print the diff out in a nice, clear and coloured format:

    HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
    printf("Diffs: \n");
    for(int i = 0; i < diff_count; i++) {

        CONSOLE_SCREEN_BUFFER_INFO info;
        if(!GetConsoleScreenBufferInfo(hConsole, &info)) return NULL;

        if(diff[i].status == MODIFIED) { SetConsoleTextAttribute(hConsole, FOREGROUND_GREEN); printf("+\t"); }
        if(diff[i].status == DELETED)  { SetConsoleTextAttribute(hConsole, FOREGROUND_RED); printf("-\t"); }

        printf("%s\n", diff[i].filename);

        SetConsoleTextAttribute(hConsole, info.wAttributes);
    }

The Windows API is a bit weird here. We essentially read the screen buffer info into the info variable in order to save the original console buffer state. We then set the text colour to red or green, based on whether the file was deleted or modified, using the SetConsoleTextAttribute API. This uses an opaque handle to the console object, so its unclear exactly how this function works, but we can assume Windows is casting it to some other struct and modifying a wAttributes field.

Refactoring + UNIX support

At this point, some refactoring of the code is in order. The main thing I wanted to change here was the load_files function, which is single-handedly traversing the directory structure, hashing file contents and adding to the hash map. I think it makes more sense to separate this into two functions: a collect_files_list function that recursively goes through the directory and adds files to an array, and a load_files function which then does the hard work of computing hashes and storing them into the fhashmap_t.

To that end, we define a filelist_t object, which is really just a wrapper around a static array, and add files to it.

void collect_files_list(const char *dir, filelist_t *list) {
#ifdef _WIN32
    char search_path[MAX_PATH];
    snprintf(search_path, MAX_PATH, "%s\\*", dir);

    WIN32_FIND_DATAA findData;
    HANDLE hFind = FindFirstFileA(search_path, &findData);
    if (hFind == INVALID_HANDLE_VALUE) return;

    do {
        if (strcmp(findData.cFileName, ".") == 0 || strcmp(findData.cFileName, "..") == 0)
            continue;

        char full_path[MAX_PATH];
        snprintf(full_path, sizeof(full_path), "%s\\%s", dir, findData.cFileName);

        if (findData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
            collect_files_list(full_path, list);
        } else {
            if(list_add(list, _strdup(full_path))) return;
        }
    } while (FindNextFileA(hFind, &findData));

    FindClose(hFind);

#else // Linux/macOS
    DIR *dp = opendir(dir);
    if (!dp) return;

    struct dirent *entry;
    while ((entry = readdir(dp)) != NULL) {
        if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
            continue;

        char full_path[PATH_MAX];
        snprintf(full_path, sizeof(full_path), "%s/%s", dir, entry->d_name);

        struct stat path_stat;
        if (stat(full_path, &path_stat) == -1) continue;

        if (S_ISDIR(path_stat.st_mode)) {
            collect_files_list(full_path, list);
        } else {
            list_add(list, _strdup(full_path));
        }
    }

    closedir(dp);
#endif
}

Another notable change is UNIX support, which is visible in the else branch of the ifdef. As a side effect, this was useful later on when doing performance profiling!

This all now works, and is organised better, but we can still aim to optimise the runtime. On a large directory, with a diverse set of files (PDFs, images, code, text, etc.), usbdiff takes minutes, which doesn’t seem good enough.

Profiling

Profiling the code with perf record ./usbdiff <directory> followed by perf report, we find that the runtime is dominated by two functions in the sha-256 hashing library: consume_chunk and right_rot. While these functions don’t have a large runtime on their own, they are called a significant number of times, which bottlenecks the total runtime of the program. Thus, the performance is compute-bound. This is good news, as there are many ways to improve on this, one of which being multi-threading. For a start, I am turning to my good friend OpenMP.

OpenMP defines the use of several #pragma macros that parallelise blocks of code, and it takes care of thread management behind the scenes. Very convenient.

int load_files(const filelist_t *const list, fhashmap_t *map)
{   
    if(!list || !map) return 1;
    
    #pragma omp parallel for schedule(dynamic, 8)
    for(int i = 0; i < list->len; i++)  {
        const char *filename = list->filenames[i];
        char *hash = compute_sha256(filename);
        if(!hash) {
            fprintf(stderr, "Couldn't hash %s, skipping\n", filename);
            continue;
        }
        
        #pragma omp critical
        fhashmap_add(map, filename, _strdup(hash));

        free(hash);
    }

    return 0;
}

Metadata Is Good

In all this, we’re doing something pretty stupid. Let’s say we have a working directory where we modify a few files, maybe add and delete some, but the majority of the directory remains unchanged. In that case, we will unnecessarily be computing hashes for every file in the directory every time we run the program. Most of the files haven’t even been touched! But this raises kind of a cyclical issue: we want to prevent computing the hash of a file if its is unchanged, but how can we know if a file has changed without computing its hash? The answer: metadata.

All internal data structures are updated to store two additional values:

typedef struct  {
    char filename[MAX_PATH];
    long long file_size;
    long long mtime;
} file_t;

typedef struct {
    file_t files[MAX_FILES];
    size_t len;
} filelist_t;

struct fhash_entry  {
    char *filename;
    char *filehash;
    long long file_size;
    long long mtime;
    struct fhash_entry *next; // Chaining
};

Namely, the file_size and mtime fields. Thus, along with the filename, we keep track of the file’s size and last modified time. This data is read from the file when it is found, and kept stored in the JSON file. When it comes time to traverse the directory again, the JSON is parsed and the previous metadata fields compared to the current values. If they match, we can rest assured the file has definitely not been modified, and we don’t need to recompute its hash! This method covers the following scenarios:

A file has not been modified, so its metadata fields and hash are unchanged. In this case, we will simply see unchanged metadata fields, and reuse the previous hash.
A file has been modified, but its file size has not changed. In this case, the mtime will have changed, and we will recompute the hash.
A file has been modified, and its file size has changed. In this case, the file must have changed, and we will have to recompute the hash.

This covers all possible scenarios, since it is impossible for a file to be modified and its mtime field to remain unchanged. The file_size field provides a quick check which, if failed, necessitates checking the mtime. This gives rise to the following section in load_files:

int reuse_hash = 0;

fhashentry_t *entry = fhashmap_lookup(prev_map, filename);

// If the file size does not match, this section is short-circuited since the file must have changed.
if(entry && file_size == entry->file_size)  {
    if (mtime == entry->mtime) {
        #pragma omp critical
        fhashmap_add(curr_map, filename, strdup(entry->filehash), entry->file_size, entry->mtime);
        reuse_hash = 1;
    }
}

if (reuse_hash) continue;

The performance gains speak for themselves:

–copy-to

As a final addition, the --copy-to flag was added to let a user copy filesfrom the diff into another directory. The original purpose of this project was to allow exactly this - a quick USB backup of only the files that were updated.

The files marked with status MODIFIED in the diff are copied over to the chosen directory. This marks the completion of the project to the functional level I aimed for at the start. While there are some potential further avenues to explore such as:

Tracking directory status history
Making a UI
Manual threading with pthreads

but I’m happy with the way it’s turned out, so let’s leave it at that. Maybe its fitting to reiterate the tagline:

Never copy your whole folder again - usbdiff tells you exactly what’s changed