Migrating a Legacy Codebase to RequireJS, Part 3

This is the third in a three-post series about migrating our large legacy codebase to use modern JavaScript dependency management:
- 1. Decoupling JavaScript and Django
- 2. Migrating pages to RequireJS
- 3. r.js optimization and build process changes
RequireJS has an associated optimizer, r.js, which handles concatenation and minification and is highly configurable. Since our main repository is a Django application, we’ve traditionally used Django Compressor for both of these tasks. As part of the switch to RequireJS, we’re also switching over to r.js for JavaScript optimization. In the interest of not changing everything at once, we’ll continue to use Django Compressor for CSS compression, at least for now.
This post explains how the new build steps fit into our existing process and then walks through the most complex parts: bundling files, integrating caching, and supporting source maps.
This post assumes familiarity with RequireJS. It’s also helpful to have some familiarity with python and Django.
Updating our build process
Since our main product is a Django application, our build process is python-based. We added a python script, build_requirejs, to handle the new RequireJS-related build steps. This script was added to the set of build operations that work with static files, which also handle tasks like running collectstatic and updating bower and npm.
The new script has three major steps:
- 1. Organizing files into logical bundles and writing out this configuration using the optimizer’s modules option in combination with RequireJS’s bundles option
- 2. Running the optimizer, r.js, to concatenate and minify files
- 3. Managing cache-busting for files that have changed since the last build, using RequireJS’s paths option
Much of the script’s work is generating and writing out JavaScript configuration. The script manipulates two types of JavaScript configuration:
- The build config which is fed as input to r.js
- The config options for RequireJS itself. These are organized into a couple of different files that each call requirejs.config and are included as script tags in our master base template
Our base build config is stored in a YAML file that has a handful of settings and also the definitions for any custom modules. The python build script reads this YAML file into a dictionary, adds configuration for bundling files (described below), and then writes out a build.js file.
Once that build.js is ready, the script runs the optimizer:
call(["node", "bower_components/r.js/dist/r.js", "-o", "staticfiles/build.js"])
The optimizer generates the concatenated, minified bundles that will be served in production. Once those bundles exist, the build script runs logic to handle cache-busting, ensuring that the correct version of each bundle – and its associated source map – will be served.
The new build process is rather intricate, but it has saved us from having to add any steps to the development process, a major goal of the RequireJS initiative.
Bundles
In order to reduce the number of http requests a page makes, we’ve historically used Django Compressor to group files, which compresses all scripts within a compress block into a single minified file:
{% compress js %} <script src="{% static 'moment/moment.js' %}"></script> <script src="{% static 'bootstrap-daterangepicker/daterangepicker.js' %}"></script> <script src="{% static 'hqwebapp/js/daterangepicker.config.js' %}"></script> {% endcompress %}
With RequireJS, the modules configuration supports a centralized definition of which groups of files to concatenate. Rather than the old ad hoc approach, the new build script walks through our JavaScript directory structure and creates one module for each directory. Because our JavaScript files typically live underneath a Django app, the major organization of JavaScript bundles matches the organization of python code into apps.
This app-based approach is a reasonable default, but it’s also useful to have the ability to override. As an example, we have an app that manages large sets of user-uploaded tabular data. This area uses some JavaScript from our reporting app, which also deals with tables. It makes sense for the table-management app to include the reporting files it needs but not the entire reporting bundle, which is quite large.
Our base build config contains the custom module definitions. The build script reads this base config, walks through static files directories with the help of STATICFILES_FINDERS, adds a module for any directory that doesn’t already have a custom module defined, and then writes the final build.js to be consumed by the optimizer:
# Write build.js file to feed to r.js with open(os.path.join(self.root_dir, 'staticfiles', 'hqwebapp', 'yaml', 'requirejs.yaml'), 'r') as f: config = yaml.load(f) # bundles will be a dictionary with an entry for every directory: # the key is the directory path # the value is a list of all js paths in that directory # all_modules will be a flat list of all js files bundles = {} all_modules = [] ... customized = {re.sub(r'/[^/]*$', '', m['name']): True for m in config['modules']} for directory, inclusions in six.iteritems(bundles): if directory not in customized: # Add this module's config to build config config['modules'].append({ 'name': os.path.join(directory, 'bundle'), 'include': inclusions, 'exclude': ['hqwebapp/js/common'], 'excludeShallow': [name for name in all_modules if name not in inclusions], }) with open(os.path.join(self.root_dir, 'staticfiles', 'build.js'), 'w') as fout: fout.write("({});".format(json.dumps(config, indent=4)))
For many projects, a major strength of RequireJS is its ability to walk through a tree of dependencies and build up a module that contains all of them – for single page apps, this may be all they need, one god-like module. But we’re managing a huge app with many pages and prefer to have more control over exactly what goes into each module.
This desire for control is reflected in the configuration of individual modules: each module specifies exactly which other modules it should contain with include, and then uses exclude to avoid including the major common modules (jQuery, knockout, etc.) and excludeShallow to avoid including any file that’s going to be included in a different module. This makes for a massive config file, but it’s automatically generated and we know exactly what’s in every bundle.
Since we’re using RequireJS in a somewhat unusual way, there’s another snag to deal with. The documentation for modules states that
In the modules array, specify the module names that you want to optimize, in the example, “main”. “main” will be mapped to appdirectory/scripts/main.js in your project. The build system will then trace the dependencies for main.js and inject them into the appdirectory-build/scripts/main.js file.
Implied in this is the assumption that each module already exists – the optimizer will add dependencies to an existing file and then minify the whole thing. But each of our modules is just an abstract container for the files in a directory, not an actual file. To work with this assumption, for each specified module, we write out a (nearly empty) JavaScript file:
# Write .js files to staticfiles for module in config['modules']: with open(os.path.join(self.root_dir, 'staticfiles', module['name'] + ".js"), 'w') as fout: fout.write("define([], function() {});")
Each directory gets with one of these files added to it, named bundle.js. When r.js runs, each of these files will have its empty define overwritten with all of the files specified in the corresponding entry in the modules build config, all concatenated and minified.
However, none of our development code references these bundle.js files; all of our dependency lists reference real modules. Besides, developers shouldn’t need to know that the foo/bar/file module is eventually going to be part of foo/bar/bundle.js. Making bundles invisible during development saves developers from needing to think about bundle configuration when they’re working on features. In addition, it allows us to adjust the bundle configuration in the future without changing product code.
To make this transparency possible, RequireJS’s bundles option works in tandem with the build config’s modules option. Bundles map a list of module names to the file where they can be found. Consider this sample file:
hqDefine(“myApp/js/utils”, [ “otherApp/js/utils”, ... ], function() { otherUtils, ... });
In development, the otherUtils modules comes from the otherApp/js/utils.js file. But in production, it’s part of otherApp/js/bundle.js. We can specify this using the following config in production:
requirejs.config({ bundles: { “otherApp/js/bundle”: [“otherApp/js/utils”, ...] }, });
This looks fairly similar to the modules setup that we’re already writing to the build config – and happily, the build config has a bundlesConfigOutFile option that accepts a file path and writes out bundles to correspond with the given modules. By setting bundlesConfigOutFile to the path to our main RequireJS config, we end up with a production config that specifies all of the appropriate bundles. This setup keeps all of this bundle-related logic in the build code only, with bundles out of sight during day-to-day JavaScript development.
Caching and the CDN
To improve performance in production, each script tag’s src has a version appended as a GET parameter. The version’s purpose is to bust caching when a file changes: when the browser gets a request for a version it hasn’t seen before, it fetches the most recent version of the file from us rather than using its cached version.
The exact value of the version doesn’t matter, so long as it changes when the file changes. One option would be to use the file’s last modified timestamp. Our approach is more direct: we use a SHA-1 hash of the file’s actual contents. Part of the pre-existing build process is to determine the hash for every static file and store them all in resource_versions, which is a dictionary with filenames for keys and hash codes for values.
In addition to taking advantage of browser caching, we use Amazon CloudFront as a CDN. In development, our script tag src parameters use relative paths, but in production, they point to the CDN. The CDN has the same semantics as the browser, serving files from its cache except when it sees a new URL, and again, the version parameter makes URLs look “new” when a file has been updated.
The RequireJS build script needed to integrate with this versioning scheme and also with the CDN.
Pre-RequireJS: the static tag
Our links to static content use a custom template tag called static that takes the relative URL as a parameter:
<script src="{% static 'app_manager/js/app_manager_utils.js' %}"></script>
The static tag prepends the CDN’s address and also adds a version as a GET param:
@register.filter @register.simple_tag def static(url): resource_url = url version = resource_versions.get(resource_url) url = settings.STATIC_CDN + settings.STATIC_URL + url return url
This transforms the tag above into something like this:
<script src="https://d2f60qxn5rwjxl.cloudfront.net/app_manager/js/app_manager_utils.js?version=c7b3eab"></script>
RequireJS: the paths config
As described above, the new build script writes out JavaScript files: the many bundle.js files and also a few files that contain RequireJS config. These files also need to have their versions tracked. To handle this, the build script generates versions for the newly-created bundle.js files, which it adds to resource_versions:
# Overwrite each bundle in resource_versions with the sha from the optimized version in staticfiles for module in config['modules']: filename = os.path.join(self.root_dir, 'staticfiles', module['name'] + ".js") file_hash = self.get_hash(filename) ... resource_versions[module['name'] + ".js"] = file_hash
Similarly, other parts of the script determine the versions for the generated RequireJS config files and add them to resource_versions.
But the existing build process already manages resource_versions, what’s different about these RequireJS-related files? It’s all in the timing. There are a lot of static files – over 20K at the time of this writing – so it takes a while to generate all of them. For this reason, the existing versioning task is one of the first build steps kicked off, running in parallel with other time-consuming setup. The RequireJS build script logically belongs with static file handling, which runs much later in the build process. By the time the build script runs, most of the versioning is complete, so the build script itself has to add versions for the files it has created or modified.
In addition to generating the version hashes, the script needs to ensure that each request for a JavaScript file actually references the correct version. This used to be handled by the static tag described above, appending versions to URLs in script tags. But RequireJS pages don’t use script tags for individual files; files are requested as needed based on the dependency map. This means that RequireJS needs to know how to fetch the properly-versioned bundle files from the CDN. The RequireJS paths option supports this mapping from module name to URL.
Our RequireJS paths configuration is stored in its own file, which is included in a script tag in the master base template. The development version of this config file is skeletal:
requirejs.config({ paths: { // This file gets auto-generated during deploy. // It will add the CDN-based paths for each JavaScript module, as in the following example. //'notifications/js/notifications_service': 'https://d2f60qxn5rwjxl.cloudfront.net/static/notifications/js/notifications_service.js?version=e7ee173', }, });
In production, the build process overwrites this file with the full CDN-based URLs for all JavaScript files in resource_versions:
# Write out resource_versions.js for all js files in resource_versions if settings.STATIC_CDN: filename = os.path.join(self.root_dir, 'staticfiles', 'hqwebapp', 'js', 'resource_versions.js') with open(filename, 'w') as fout: fout.write("requirejs.config({ paths: %s });" % json.dumps({ file[:-3]: "{}{}{}{}".format(settings.STATIC_CDN, settings.STATIC_URL, file[:-3], ".js?version=%s" % version if version else "") for file, version in six.iteritems(resource_versions) if file.endswith(".js") }, indent=2)) resource_versions["hqwebapp/js/resource_versions.js"] = self.get_hash(filename)
(Notice six? We’re also in the middle of migrating from Python 2 to 3, which could be a whole separate blog series…)
By convention, we create one module per file, and that module’s name matches the filename. Because of this, in development, files are just fetched from the local filesystem, with module names used as paths relative to the baseUrl. Then in production, the paths config exists and files are fetched from the CDN.
The paths config is additive: you can make multiple calls to requirejs.config that contain paths, and the final configuration uses the union of all of them. Since we also use paths to set up shortcuts for common libraries, this is convenient, allowing us to put the common library paths in our main RequireJS config and then use an additional file for the CDN-based paths. We can then completely overwrite this file, rather than needing to do more intricate – and fragile – parsing.
Source maps
Source maps, a lifesaver for debugging production issues, are integrated into r.js via the generateSourceMaps flag. This generates a map file for each bundle and adds the map URL to the end of the bundle code. For a file named bundle.js, this looks like:
//# sourceMappingURL=bundle.js.map
The built-in source map functionality works out of the box, but we need to integrate it with the caching scheme described above. Without the correct version appended, an outdated map file would be served even as the bundle’s code changes – a debugging nightmare.
To deal with this, the build rewrites each bundle file, appending the version to the source map line:
for module in config['modules']: filename = os.path.join(self.root_dir, 'staticfiles', module['name'] + ".js") file_hash = self.get_hash(filename) # Overwrite source map reference to add version hash with open(filename, 'r') as fin: lines = fin.readlines() with open(filename, 'w') as fout: for line in lines: if re.search(r'sourceMappingURL=bundle.js.map', line): line = re.sub(r'bundle.js.map', 'bundle.js.map?version=' + file_hash, line) fout.write(line) resource_versions[module['name'] + ".js"] = file_hash
This step means that the file’s version is no longer a true hash of its contents, since the overwrite itself changed the contents. This is counterintuitive, but it’s fine that the hash isn’t “correct.” All that matters is that the hash changes whenever the code changes.
What’s next?
The optimization and build process support for RequireJS needed to be complete in order to migrate any pages, but now that it’s complete, it will hopefully take little to no maintenance as we continue to migrate pages. That migration will likely take a while to complete, but it’s a relatively mechanical process. We can start thinking about other improvements while continuing to plug along migrating pages.
There’s always tool maintenance: checking out new tools, keeping our current tools up to date, and deprecating tools we’ve experimented with in the past but decided not to use broadly. We’ve also discussed improvements to our testing infrastructure and error monitoring.
It’s also possible that as we get used to modern dependency management, our preferences will become more refined and we’ll find a reason to switch off of RequireJS. Our original selection of RequireJS specifically was somewhat arbitrary; it was one of several tools that looked sufficient. We may decide to migrate to webpack or SystemJS or something altogether different.
The JavaScript ecosystem changes constantly, even from the start of this migration to the end, whenever that long-awaited day arrives. Even if we move on from RequireJS, going through this migration makes huge strides in encapsulation, dependency management, and consistency, simplifying nasty dependency-related debugging and overall making our JavaScript a less brittle, better place to do development.
Share
Tags
Similar Articles
Another day, another Zero Day: What all Digital Development organizations should take away from recent IT security news
Even if you don’t work as a software developer, you probably heard about recent, high profile security issues that had IT Admins and developers frantically patching servers over the holidays and again more recently. Dimagi's CTO shares what these recent issues mean for Digital Development organizations.
Technology
January 28, 2022
Join the fight to support critical open source infrastructure
Open Source tools are a critical piece of global infrastructure, and need champions for long term investment
Technology
March 17, 2020
Two big lessons that Iowa and Geneva can teach us about technology in digital development
Last week brought two high profile technology failures into the global spotlight. Although these two mishaps may seem quite different at first glance, they both highlight challenges that are inherent in providing software in the public sector (regardless of locale) and illustrate cautionary lessons worth discussing for practitioners in Digital Development. The Iowa Caucus Debacle
Technology
February 7, 2020