The source stands alone

People keep weird stuff in their source repositories. Over the years, I observed mobile development frameworks stored in Mercurial, SQL Server installation media stored in Perforce, network captures stored in SVN, large test datasets stored in pretty much every source repository type, and most certainly a lot of 3rd-party source stored alongside with the project source.

The common belief is that a source repository should contain everything that is needed to build, package and test project binaries and these processes should not rely on external sources. This sentiment does make sense on its own, but keeping binaries and test data in source repositories simply turns these repositories into oversized junkyards, storing everything, but the kitchen sink.

Project Ecosystem

Project development lives in an ecosystem, where the project can be built, packaged and tested. Such ecosystem can be as small as a couple of computers on the local network or as big as a bunch of cloud servers building, testing and deploying after every commit.

A typical project ecosystem will have subsystems that can resolve build dependencies, build binaries, run unit tests, package build artifacts, deploy packaged binaries, set up databases, run integration tests and publish packages in a way that allows them to be promoted between QA, staging and production environments.

All of these subsystems store sets of items identified by references in every commit of a project source repository. Examples of such references are CI pipeline names, release versions, build image names, build dependency names and versions, test datasets names, test database names and, perhaps, a few others things, often expressed as variables in various scripts.

Project Ecosystem References

This post focuses on maintaining build and test dependencies within the project ecosystem storage, which may be implemented in a variety of ways, such as using cloud storage, network shares, generic package repositories and even source repositories repurposed just for storage.

Build Dependencies

Build dependencies can be maintained in different ways for various development languages and operating systems.

For example, C++ dependencies on Linux may be installed system-wide via a package manager, such as dnf or apt, may be built from the source and installed under /usr/local/, or may be built along with the project source. Node.js dependencies, on the other hand, are most commonly installed via npm configured with a set of repositories and, in specific cases, from packages stored as local files.

Package Repositories

Most public package repositories do not allow deleting published packages and can be referenced safely in project build scripts. In contrast, referencing a public source repository in project build scripts, such as those in GitHub, is a bad idea because those can be deleted by their owners at any time.

A private package repository may be maintained within the project ecosystem to make privately built packages transparently available for package retrieval tools in CI builds and to avoid any possible CI pipeline downtime, on an odd chance that the package repository is not available.

An important distinction should be made for package repositories that merely cache 3rd-party packages and repositories that can be used to manage build maturity for project artifacts.

The former is a simpler case and a few products and services offer package repositories with a way to configure upstream repositories, such as Azure Artifacts, CodeArtifact, Artifactory and a few specialized repositories, such as Sinopia for npm.

Managing build maturity is a whole different game and requires that a package repository would accept CI artifacts with the same version and different build metadata and would expose the package with specific build metadata as the one and only package visible for this version, which is considered as the promotion candidate for its stage.

At this point, I found only Artifactory to provide this functionality in a way that maintains a build history. A simplified approach is possible with some repositories that allow deleting package versions.

For example, npm and CodeArtifact can be forced to delete a package version and publish a new build with the same version, which may be used as an improvised build maturity management. Other services, like Azure Artifacts, will not allow deleting package versions, so they are utterly useless for this.

For package types that are more flexible with package sources, such as Nuget, a network share may be set up to collect, test and promote build artifacts, but in general, finding a managed package repository that allows using standard package type tools is a more robust solution.

Local Linux Builds

On Linux systems, one can build 3rd-party libraries and tools from the source and install them under /usr/local, which works well with using Docker images for CI pipeline builds.

In order to create such an image, a Dockerfile is created for each supported Linux flavor, which installs all necessary tools via a standard package manager, pulls 3rd-party source archives from the project ecosystem storage and builds them within a Docker container. This resulting Docker image is then stored in the ecosystem container registry and is wired into the CI pipeline, so all build dependencies are in place when the project source is mounted in the Docker container for that image

For 3rd-party source distributions that require project-specific changes, which happens more often than one would expect, a set of patches can be maintained in the project source repository and applied against a copy of each 3rd-party source tree before a local build is made. These patches need to be updated every time a new version of the 3rd-party source is integrated into the project.

Windows Build Dependencies

Some build tools attempt to replicate the Linux approach for local builds on Windows, such as installing 3rd-party source under %PROGRAMFILES% or %LOCALAPPDATA%, but Windows provides no system support for this in that these directories are not automatically included in C/C++ search paths for header files and libraries and in many companies are locked down and cannot be used to install unsanctioned content. Using any of these directories is just not a good way to implement build dependencies on Windows.

vcpkg does a good job in emulating local Linux builds. It pulls the requested 3rd-party source from its repository, builds it within the vcpkg directory tree and integrates build artifacts via CMake includes and specialized Nuget packages into Visual Studio projects, which can reference installed header files and libraries.

However, I find that Nuget packages work out much better than any other option for C/C++ projects on Windows. They can be configured per-project, support multiple projects with different dependencies within the same solution and provide clear reference on what 3rd-party dependencies are required for any project.

Nuget packages are easy to make, which is often a better option for larger projects than having to rely on existing packages, because most packages on nuget.org are not published by original 3rd-party source maintainers and are quite inconsistent in their build configurations.

Building Nuget packages can be mostly automated, with some exceptions described below, so all build dependencies on Windows can be moved to Nuget packages.

Binary compatibility should be observed with Nuget packages and having either the Visual Studio version or the build toolset version in the package name will go a long way, as the project evolves over time. See this page for additional binary compatibility information for Visual Studio build artifacts.

https://learn.microsoft.com/en-us/cpp/porting/binary-compat-2015-2017

One problem with automating building Nuget packages for 3rd-party source is the lack of a tool in Nuget to inject the content on .props and .targets files from Nuget packages into .vcxproj files, which is required for C/C++ projects using packages.config and packages.{project-name}.config to work. Without such tool, Visual Studio projects need to be edited manually to add Nuget references for 3rd-party projects with other 3rd-party dependencies that are being moved to use Nuget packages. See this issue for more details.

https://developercommunity.visualstudio.com/t/Impossible-to-maintain-CC-Nuget-packa/10195731

Maintaining Patches

There are several ways to maintain 3rd-party source patches for a project. Over the years I tried a few of them and found the way described in this section to be most usable.

I maintain a patch repository for each 3rd-party project I want to patch and use these repositories to generate patches, which are stored in the project source repository. A patch repository can be a clone of the upstream repository, if one is available, or it could just be a series of source package drops for the designated version.

For the second option, the upstream source from a source archive goes into the main branch and is tagged with the released version. For subsequent source drops, previous source is completely erased, so there is no need to merge anything. Depending on the source directory structure a .gitignore and README.md  files may need to be added and maintained on the main branch.

A patch branch is created from a release tag and all changes are maintained on this patch branch. This branch is never merged anywhere and instead the previous patch branch may be cherry-picked onto the new patch branch when the new upstream source is imported.

Dedicated Patch Repository

For example, on the diagram above, 1.0.0 commits A and B are cherry-picked into the 1.5.1 patches, commit C is dropped and commits E and D introduced in this version.

A cloned repository would work in a similar way, except the existing tag is used to create a patch branch for each release.

A combined patch or a series of patches can then be generated from a patch branch and committed into the project source repository. For example, commits A and B can be combined in a single patch (e.g. C++17 compatibility changes).

Using standalone 3rd-party source repositories allows patches to be documented in commit messages in patch repositories, while keeping generated patches concise and grouped in a way that better describes what is being patched in the build dependencies consumed by the project.

Cloud Storage

Cloud storage is the most attractive option nowadays for keeping project dependencies because it can be used from just about anywhere and is always available.

I mostly used Azure storage and will use it in examples, but other cloud storage platforms provide similar functionality, with different implementation details.

Azure storage can be configured as an organization-wide or as a project-wide storage account with a blob container for 3rd-party source and additional subfolders for different projects. For example, for a smaller team, a storage account can be created for the DevOps team and folders for projects A and B could be used to maintain their 3rd-party dependencies in this structure.

  • devops
    • 3rd-party
      • a
      • b

3rd-party source files should be named with the released version in the name. Actual released packages should be favored over the tagged source from the 3rd-party source repository. The released source package may have additional content packaged during their build process or may have unneeded content removed. Checking out a tag from the 3rd-party public source repository will miss these additional bits and may require additional steps to be used within the project.

  • fmt-9.1.0.zip
  • exiv2-0.27.5-Source.tar.gz

Some useful 3rd-party packages don't have versions, so a full or a partial commit hash from the repository will work out just as well. The commit date may also be added to make the file name more descriptive and provide a time frame for the source inside.

  • meta-12345678.tar.gz
  • meta-20190102-12345678.tag.gz

Archived packages should never be updated or removed from the ecosystem storage, so project source in past commits can be built with the appropriate dependencies.

If any of 3rd-party source packages require additional changes, patches from the project source repository may be applied after these packages have been downloaded from the ecosystem storage. These patches must be generated for the exact version of the 3rd-party source that is being downloaded from the ecosystem storage.

A typical sequence of steps for a local Linux build would look like this.

azcopy storage/package-name?sas-token some/path
extract-archive (tar, unzip, etc)
remove-archive-file
patch -p1 --unified --directory src --input devops/patches/x1.patch
patch -p1 --unified --directory src --input devops/patches/x2.patch
configure
build
sudo install
remove-build-directory

The SAS token for azcopy controls who can read specified files in the cloud storage and in this case, given that all source packages came from public source repositories, may be generated with a longer expiry time using a storage policy and checked into the source repository, along with scripts using it. The SAS token should be generated with only the Read permission, so no one can list the contents of the blob container.

AWS provides similar functionality to copy files, except that pre-signed URLs are very short-lived, so one would need to set up a policy for specific groups or users with the s3:GetObject permission, so these accounts can run aws s3 cp  command to copy files. Otherwise, the logic is the same as described above.

Generic Package Repositories

Some package managers allow generic package types, which are, effectively, just archives with some metadata. Artifactory maintains such files in generic repositories, which can be downloaded via the jfrog tool and Azure Artifacts provides support for universal packages, which can be managed via az artifacts or via CI pipeline tasks.

Note, however, that Azure Artifacts treats universal packages the same way as any other package types and will not allow replacing a published version within a package repository. This may complicate things if some package needs to be updated, which should not happen with 3rd-party source, but it is still something to keep in mind.

Storage Source Repositories

Smaller teams often do not have access to cloud storage and cannot provide network shares for cloud build VMs, such as GitHub workflow runners. One way to avoid non-source content in project source repositories is to set up a standalone source repository configured just to store source packages and other binary content.

Such repositories would keep archived source packages and would serve as a surrogate cloud storage. They would never have any branches and will need LFS enabled for large archive files.

Directory structure in a storage source repository would be the same as in a cloud storage implementation.

Network Shares

Network shares work only for local networks and not for all package types, which limits their usefulness.

For example, one can configure Visual Studio to find Nuget packages on a network share, but npm will fail to understand \\server\share notation and one would need to map a network drive in order to use a network share with npm.

For C/C++ projects, only packages should be placed on a network share, not header files and libraries, which would inflate build time significantly for large projects. I once worked for a company that stored header files and libraries on a network share for a project with a couple of million lines of code and it took about 4 hours to build this project in the default configuration and only 2 hours if the same dependencies were copied locally.

Building Project Along with 3rd-party Source

Building 3rd-party source along with the project source may look like an attractive option at first, especially if one neatly isolates 3rd-party source with Git sub-modules, but it usually results in longer builds, possibly with less robust artifacts (e.g. untested build options may be used), mixed up common dependencies (e.g. different versions of Boost included), unclear 3rd-party source references (e.g. hard to tell which version of 3rd-party source is used) and more build tools required (e.g. 3rd-party source may require Python, Ragel, etc).

Consuming 3rd-party source in packages or via local builds is a better option, any way you look at it.

Docker Images

Using Docker images in builds is well documented in a lot of places, except one thing in that using the tag latest in any automated scripts and pipelines is a terrible idea because when building past revisions of the project source, build scripts will reference the latest Docker image and will either fail to build or will produce unreliable results.

Always use a specific tag for any script referencing Docker images. I find that date works out the best in most cases, but anything specific can be used as well.

For example, a CI pipeline may reference these Docker images:

  • abc-1.2.3:20221227
  • abc-1.2.3-features-prj456:20220517
  • abc-1.2.3/features/prj456:20220517

The last one will work only in some container registries that allow slashes in image names. For example, Docker Hub will not allow slashes, but Azure Container Registries will.

Test Dependencies

Integration testing often requires large datasets, such as massive product lists, network captures, generated request sequences and other similar data.

Storing test data in the project ecosystem keeps the source repository clean and makes it more descriptive which dataset is referenced in any script. Conceptually, this difference can be illustrated in these diagrams.

When a test dataset is stored in the source repository, the script that uses it references the dataset as x.csv. When the CSV file is updated, the script remains unchanged.

Test Dataset in a Source Repository

When test datasets are stored in the project ecosystem storage, a new dataset file is created every time the logical dataset is updated, so the script referencing the test dataset will use the file named x-20231.csv initially and then will be changed to use the new file x-20232.csv later.

The appropriate dataset is downloaded before the test script can run with tools such as azcopy  or aws s3 cp into their intended folders, where the test script can find it.

Test Dataset in a Cloud Storage

Practically, unlike 3rd-party source packages, which always come in a form of a single file, test data is usually comprised of multiple data files and using a date-versioned directory under a logical test dataset name is a better approach. So, for a JMeter script for a load test creating orders one might use a folder structure similar to this:

  • devops
    • JMeter
      • order-load-test
        • 2022-11-05
          • users.csv
          • products.csv
        • 2023-01-10
          • acct-users.csv
          • anon-users.csv
          • products.csv

The JMeter script referencing these files would use the __P(csv-path,)  prefix for all CSV references, which can be passed with the -J option into JMeter to point to any directory on the machine running JMeter. This allows to run these scripts in all CI pipelines and on development machines without modifying the JMeter script.

Also, unlike with 3rd-party source packages, test datasets may contain proprietary information and will likely need more elaborate authentication than a SAS token that can be committed into the project source repository.

Note that test datasets can never be modified in existing directories. Any modifications must be done against a copy of that data in a new directory, which is then used in scripts that expect new data layout.

Some cloud storage providers implement blob versioning, which may be used instead of having to create date-versioned directories. I found Azure blob versioning to be less convenient for my purposes, but I can see it being useful for some projects with large amount of test data and many changes.

Conclusion

Keeping source repositories clean seems like a common-sense practice to implement, but many follow the school of thought that a source repository must be a one-stop shop for all project-related content and fill their repositories with content that makes working with such source repositories less productive.

Keeping all non-source project files in the project ecosystem forces one to parameterize project scripts, which is a self-documenting practice on its own, and it also allows one to experiment with all scripts much easier by using existing mechanisms to supply different arguments with locally-modified data files.

A source repository containing only project source is easier to maintain and analyze. It does take an extra effort to set up a project with an ecosystem, but maintaining it is very straightforward and has many benefits in keeping explicit references to all project dependencies and leaving source repositories to serve the one and only purpose, which is to keep track of source changes in a project.

Comments: