CI/CD: Standardizing the Interface

Meet our CI/CD platform, Coach and learn how we increased consistent adoption of Continuous Integration (CI) across our engineering organization. And why that's important.

This is the second part of a series of posts about our new CI/CD platform, Coach. Part I explores several design choices we made in building out our notifications pipeline and describes how those choices are emblematic of our overarching engineering principles here at Betterment. Today I’d like to talk about how we increased consistent adoption of Continuous Integration (CI) across our engineering organization, and why.

Our Principles in Action: Standardizing the Interface

At Betterment, we want to empower our engineers to do their best work. CI plays an important role in all of our teams’ workflows. Over time, a handful of these teams formed deviating opinions on what kind of acceptance criteria they had for CI. While we love the concern that our engineers show toward solving these problems, these deviations became problematic for applications of the same runtime that should abide by the same set of rules; for example, all Ruby apps should run RSpec and Rubocop, not just some of them.

In building a platform as a service (PaaS), we realized that in order to mitigate the problem of nurturing pets vs herding cattle we would need to identify a firm set of acceptance criteria for different runtimes. In the first post of this series we mention one of our principles, Standardize the Pipeline. In this post, we’ll explore that principle and dive into how we committed 5000 line configuration files to our repositories with confidence by standardizing CI for different runtimes, automating configuration generation in code, and testing the process that generates that configuration.

What’s so good about making everything the same?

Our goals in standardizing the CI interface were to:

  • Make it easier to distribute new CI features more quickly across the organization.
  • Onboard new applications more quickly.
  • Ensure the same set of acceptance criteria is in place for all codebases in the org. For example, by assuming that any Java library will run the PMDlinter and unit tests in a certain way we can bootstrap a new repository with very little effort.
  • Allow folks outside of the SRE team to contribute to CI.

In general, our CI platform categorizes projects into applications and libraries and divides those up further by language runtime. Combined together we call this a project_type. When we make improvements to one project type’s base configuration, we can flip a switch and turn it on for everyone in the org at once. This lets us distribute changes across the org quickly. How we managed to actually execute on this will become clearer in the next section, but for the sake of hand-wavy-expediency, we have a way to run a few commands and distribute CI changes to every project in a matter of minutes.

How did we do it?

Because we use CircleCI for our CI pipelines, we knew we would have to define our workflows using their DSL inside a .circleci/config.yml file at the root of a project’s repository. With this blank slate in front of us we were able to iterate quickly by manually adding different jobs and steps to that file. We would receive immediate feedback in the CircleCI interface when those jobs ran, and this feedback loop helped us iterate even faster. Soon we were solving for our acceptance criteria requirements left and right — that Java app needs the PMD linter! This Ruby app needs to run integration tests! And then we reached the point where manual changes were hindering our productivity. The .circleci/config.yml file was getting longer than a thousand lines fast, partly because we didn’t want to use any YAML shortcuts to hide away what was being run, and partly because there were no higher-level mechanisms available at the time for re-use when writing YAML (e.g. CircleCI’s orbs).

Defining the system

Our solution to this problem was to build a system, a Coach CLI for our Coach app, designed according to CLI 12-factor conventions. This system’s primary goal is to create .circleci/config.yml files for repositories to encapsulate the necessary configuration for a project’s CI pipeline. The CLI reads a small project-level configuration definition file (coach.yml) located in a project’s directory and extrapolates information to create the much larger repo-level CircleCI specific configuration file (.circleci/config.yml), which we were previously editing ourselves.

To clarify the hierarchy of how we thought about CI, here are the high level terms and components of our Coach CLI system:

  • There are projects. Each project needs a configuration definition file (coach.yml) that declares its project_type. We support wordpress_app, java_library, java_app, ruby_gem, ruby_app, and javascript_libraryfor now.
  • There are repos, each repo has one or more projects of any type.
  • There needs to be a way to set up a new project.
  • There needs to be a way to idempotently generate the CircleCI configuration (.circleci/config.yml) for all the projects in a repo at once.
  • Each project needs to be built, tested, and linted.

We realized that the dependency graph of repository → projects → project jobs was complicated enough that we would need to recreate the entire .circleci/config.yml file whenever we needed to update it, instead of just modifying the YAML file in place. This was one reason for automating the process, but the downsides of human-managed software were another. Manual updates to this file allow the configuration for infrequently-modified projects to drift. And leaving it up to engineers to own their own configuration lets folks modify the file in an unsupported way which could break their CI process. And then we’re back to square one.

We decided to create that large file by ostensibly concatenating smaller components together. Each of those smaller components would be the output of specific functions, and each of those functions would be written in code and be tested. The end result was a lot of small files that look a little like this:

 

https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a

 

Every time we make a change to the Coach CLI codebase we are confident that the thousands of lines of YAML that are idempotently generated as a result of the coach update ci command will work as expected because they’re already tested in isolation, in unit tests. We also have a few heftier integration tests to confirm our expectations. And no one needs to manually edit the .circleci/config.yml file again.

Defining the Interface

In order to generate the .circleci/config.yml that details which jobs to run and what code to execute we first needed to determine what our acceptance criteria was. For each project type we knew we would need to support:

  • Static code analysis
  • Unit tests
  • Integration tests
  • Build steps
  • Test reports

We define the specific jobs a project will run during CI by looking at the projecttype value inside a project’s coach.yml. If the value for projecttype is ruby_app then the .circleci/config.yml generator will follow certain conventions for Ruby programs, like including a job to run tests with RSpec or including a job to run static analysis commands like Rubocopand Brakeman. For Java apps and libraries we run integration and unit tests by default as well as PMD as part of our static code analysis.

Here’s an example configuration section for a single job, the linter job for our Coach repository:

 

https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a

 

And here’s an example of the Ruby code that helps generate that result:

 

https://gist.github.com/agirlnamedsophia/a96f3a79239988298207b7ec72e2ed04

 

For each job that is defined in the .circleci/config.yml file, according to the project type’s list of acceptance criteria, we include additional steps to handle notifications and test reporting. By knowing that the Coach app is a ruby_appwe know how many jobs will need to be run and when. By writing that YAML inside of Ruby classes we can grow and expand our pipeline as needed, trusting that our tests confirm the YAML looks how we expect it to look. If our acceptance criteria change, because everything is written in code, adding a new job involves a simple code change and a few tests, and that’s it. We’ll go into contributing to our platform in more detail below.

Onboarding a new project

One of the main reasons for standardizing the interface and automating the configuration generation was to onboard new applications more quickly. To set up a new app all you need to do is be in the directory for your project and then run coach create project --type $project_type.

-> % coach create project --type ruby_app
'coach.yml' configuration file added -- update it based on your project's needs

When you run that, the CLI creates the small coach.yml configuration definition file discussed earlier. Here’s what an example Ruby app’s coach.yml looks like:

 

https://gist.github.com/agirlnamedsophia/2f966ab69ba1c7895ce312aec511aa6b

 

The CLI will refer back to a project’s coach.yml to decide what kind of CircleCI DSL needs to be written to the .circleci/config.yml file to wire up the right jobs to run at the right time. Though our contract with projects of different types is standardized, we permit some level of customization. The coach.yml file allows our users to define certain characteristics of their CI flow that vary and require more domain knowledge about a specific project: like the level of test parallelism their application test suite requires, or the list of databases required for tests to run, or an attribute composed of a matrix of Ruby versions and Gemfiles to run the whole test suite against. Using this declarative configuration is more extensible and more user friendly and doesn’t break the contract we’ve put in place for projects that use our CI platform.

Contributing to CI

Before, if you wanted to add an additional linter or CI tool to our pipeline, it would require adding a few lines of untested bash code to an existing Jenkins job, or adding a new job to a precarious graph of jobs, and crossing your fingers that it would “just work.” The addition couldn’t be tested and it was often only available to one project or one repository at a time. It couldn’t scale out to the rest of the org with ease.

Now, updating CI requires opening a PR to make the change. We encourage all engineers who want to add to their own CI pipeline to make changes on a branch from our Coach repository, where all the configuration generation magic happens, verify its effectiveness for their use-case, and open a pull request. If it’s a reasonable addition to CI, our thought is that everyone should benefit.

By having these changes in version control, each addition to the CI pipeline goes through code review and requires tests be written. We therefore have the added benefit of knowing that updates to CI have been tested and are deemed valid and working before they’re distributed, and we can prevent folks from removing a feature without considering the impact it may have. When a PR is merged, our team takes care of redistributing the new version of the library so engineers can update their configuration. CI is now a mechanism for instantly sharing the benefits of discovery made in isolated exploration, with everyone.

Putting it all together

Our configuration generator is doing a lot more than just taping together jobs in a workflow — we evaluate dependency graphs and only run certain jobs that have upstream changes or are triggered themselves. We built our Coach CLI into the Docker images we use in CircleCI and so those Coach CLI commands are available to us from inside the .circleci/config.yml file. The CLI handles notifications, artifact generation, and deployment triggers. As we stated in our requirements for Coach in the first post, we believe there should be one way to test code, and one way to deploy it. To get there we had to make all of our Java apps respond to the same set of commands, and all of our Ruby apps to do the same. Our CLI and the accompanying conventions make that possible.

When before it could take weeks of both product engineering and SRE time to set up CI for an application or service within a complex ecosystem of bash scripts and Jenkins jobs and application configuration, now it takes minutes. When before it could take days or weeks to add a new step to a CI pipeline, now it takes hours of simple code review. We think engineers should focus on what they care about the most, shipping great features quickly and reliably. And we think we made it a little easier for them (and us) to do just that.

What’s Next?

Now that we’ve wrangled our CI process and encoded the best practices into a tool, we’re ready to tackle our Continuous Deployment pipeline. We’re excited to see how the model of projects and project types that we built for CI will evolve to help us templatize our Kubernetes deployments. Stay tuned.