Solving Problems
Featured articles
-
A Functional Approach to Penny-Precise Allocation
How we solved the problem allocating a sum of money proportionally across multiple buckets by ...
A Functional Approach to Penny-Precise Allocation How we solved the problem allocating a sum of money proportionally across multiple buckets by leaning on functional programming. An easy trap to fall into as an object-oriented developer is to get too caught up in the idea that everything has to be an object. I work in Ruby, for example, where the first thing you learn is that everything is an object. Some problems, however, are better solved by taking a functional approach. For instance, at Betterment, we faced the challenge of allocating a sum of money proportionally across multiple buckets. In this post, I’ll share how we solved the problem by leaning on functional programming to allocate money precisely across proportional buckets. The Problem Proportional allocation comes up often throughout our codebase, but it’s easiest to explain using a fictional example: Suppose your paychecks are $1000 each, and you always allocate them to your different savings accounts as follows: College savings fund: $310 Buy a car fund: $350 Buy a house fund: $200 Emergency fund: $140 Now suppose you’re an awesome employee and received a bonus of $1234.56. You want to allocate your bonus proportionally in the same way you allocate your regular paychecks. How much money do you put in each account? You may be thinking, isn’t this a simple math problem? Let’s say it is. To get each amount, take the ratio of the contribution from your normal paycheck to the total of your normal paycheck, and multiply that by your bonus. So, your college savings fund would get: (310/1000)*1234.56 = 382.7136 We can do the same for your other three accounts, but you may have noticed a problem. We can’t split a penny into fractions, so we can’t give your college savings fund the exact proportional amount. More generally, how do we take an inflow of money and allocate it to weighted buckets in a fair, penny-precise way? The Mathematical Solution: Integer Allocation We chose to tackle the problem by working with integers instead of decimal numbers in order to avoid rounding. This is easy to do with money — we can just work in cents instead of dollars. Next, we settled on an algorithm which pays out buckets fairly, and guarantees that the total payments exactly sum to the desired payout. This algorithm is called the Largest Remainder Method. Multiply the inflow (or the payout in the example above) by each weight (where the weights are the integer amounts of the buckets, so the contributions to the ticket in our example above), and divide each of these products by the sum of the buckets, finding the integer quotient and integer remainder Find the number of pennies that will be left over to allocate by taking the inflow minus the total of the integer quotients Sort the remainders in descending order and allocate any leftover pennies to the buckets in this order The idea here is that the quotients represent the amounts we should give each bucket aside from the leftover pennies. Then we figure out which bucket deserves the leftover pennies. Let’s walk through this process for our example: Remember that we’re working in cents, so our inflow is 123456 and we need to allocate it across bucket weights of [31000, 35000, 20000, 14000]. We find each integer quotient and remainder by multiplying the inflow by the weight and dividing by the total weight. We took advantage of the divmod method in Ruby to grab the integer quotient and remainder in one shot, like so: buckets.map do |bucket| (inflow * bucket).divmod(total_bucket_weight) end This gives us 12345631000/100000, 12345635000/100000, 12345620000 /100000 and 12345614000/100000. The integer quotients with their respective remainders are [38271, 36000], [43209, 60000], [24691, 20000], [17283, 84000]. Next, we find the leftover pennies by taking the inflow minus the total of the integer quotients, which is 123456 — (38271 + 43209 + 24691 + 17283) = 2. Finally, we sort our buckets in descending remainder order (because the buckets with the highest remainders are most deserving of extra pennies) and allocate the leftover pennies we have in this order. It’s worth noting that in our case, we’re using Ruby’s sort_by method, which gives us a nondeterministic order in the case where remainders are equal. In this case, our fourth bucket and second bucket, respectively, are most deserving. Our final allocations are therefore [38271, 43210, 24691, 17284]. This means that your college savings fund gets $382.71, your car fund gets $432.10, your house fund gets $246.91, and your emergency fund gets $172.84. The Code Solution: Make It Functional Given we have to manage penny allocations between a person’s goals often throughout our codebase, the last thing we’d want is to have to bake penny-pushing logic throughout our domain logic. Therefore, we decided to extract our allocation code into a module function. Then, we took it even further. Our allocation code doesn’t need to care that we’re looking to allocate money, just that we’re looking to allocate integers. What we ended up with was a black box ‘Allocator’ module, with a public module function to which you could pass 2 arguments: an inflow, and an array of weightings. Our Ruby code looks like this. The takeaway The biggest lesson to learn from this experience is that, as an engineer, you should not be afraid to take a functional approach when it makes sense. In this case, we were able to extract a solution to a complicated problem and keep our OO domain-specific logic clean. -
CI/CD: Standardizing the Interface
Meet our CI/CD platform, Coach and learn how we increased consistent adoption of Continuous ...
CI/CD: Standardizing the Interface Meet our CI/CD platform, Coach and learn how we increased consistent adoption of Continuous Integration (CI) across our engineering organization. And why that's important. This is the second part of a series of posts about our new CI/CD platform, Coach. Part I explores several design choices we made in building out our notifications pipeline and describes how those choices are emblematic of our overarching engineering principles here at Betterment. Today I’d like to talk about how we increased consistent adoption of Continuous Integration (CI) across our engineering organization, and why. Our Principles in Action: Standardizing the Interface At Betterment, we want to empower our engineers to do their best work. CI plays an important role in all of our teams’ workflows. Over time, a handful of these teams formed deviating opinions on what kind of acceptance criteria they had for CI. While we love the concern that our engineers show toward solving these problems, these deviations became problematic for applications of the same runtime that should abide by the same set of rules; for example, all Ruby apps should run RSpec and Rubocop, not just some of them. In building a platform as a service (PaaS), we realized that in order to mitigate the problem of nurturing pets vs herding cattle we would need to identify a firm set of acceptance criteria for different runtimes. In the first post of this series we mention one of our principles, Standardize the Pipeline. In this post, we’ll explore that principle and dive into how we committed 5000 line configuration files to our repositories with confidence by standardizing CI for different runtimes, automating configuration generation in code, and testing the process that generates that configuration. What’s so good about making everything the same? Our goals in standardizing the CI interface were to: Make it easier to distribute new CI features more quickly across the organization. Onboard new applications more quickly. Ensure the same set of acceptance criteria is in place for all codebases in the org. For example, by assuming that any Java library will run the PMDlinter and unit tests in a certain way we can bootstrap a new repository with very little effort. Allow folks outside of the SRE team to contribute to CI. In general, our CI platform categorizes projects into applications and libraries and divides those up further by language runtime. Combined together we call this a project_type. When we make improvements to one project type’s base configuration, we can flip a switch and turn it on for everyone in the org at once. This lets us distribute changes across the org quickly. How we managed to actually execute on this will become clearer in the next section, but for the sake of hand-wavy-expediency, we have a way to run a few commands and distribute CI changes to every project in a matter of minutes. How did we do it? Because we use CircleCI for our CI pipelines, we knew we would have to define our workflows using their DSL inside a .circleci/config.yml file at the root of a project’s repository. With this blank slate in front of us we were able to iterate quickly by manually adding different jobs and steps to that file. We would receive immediate feedback in the CircleCI interface when those jobs ran, and this feedback loop helped us iterate even faster. Soon we were solving for our acceptance criteria requirements left and right — that Java app needs the PMD linter! This Ruby app needs to run integration tests! And then we reached the point where manual changes were hindering our productivity. The .circleci/config.yml file was getting longer than a thousand lines fast, partly because we didn’t want to use any YAML shortcuts to hide away what was being run, and partly because there were no higher-level mechanisms available at the time for re-use when writing YAML (e.g. CircleCI’s orbs). Defining the system Our solution to this problem was to build a system, a Coach CLI for our Coach app, designed according to CLI 12-factor conventions. This system’s primary goal is to create .circleci/config.yml files for repositories to encapsulate the necessary configuration for a project’s CI pipeline. The CLI reads a small project-level configuration definition file (coach.yml) located in a project’s directory and extrapolates information to create the much larger repo-level CircleCI specific configuration file (.circleci/config.yml), which we were previously editing ourselves. To clarify the hierarchy of how we thought about CI, here are the high level terms and components of our Coach CLI system: There are projects. Each project needs a configuration definition file (coach.yml) that declares its project_type. We support wordpress_app, java_library, java_app, ruby_gem, ruby_app, and javascript_libraryfor now. There are repos, each repo has one or more projects of any type. There needs to be a way to set up a new project. There needs to be a way to idempotently generate the CircleCI configuration (.circleci/config.yml) for all the projects in a repo at once. Each project needs to be built, tested, and linted. We realized that the dependency graph of repository → projects → project jobs was complicated enough that we would need to recreate the entire .circleci/config.yml file whenever we needed to update it, instead of just modifying the YAML file in place. This was one reason for automating the process, but the downsides of human-managed software were another. Manual updates to this file allow the configuration for infrequently-modified projects to drift. And leaving it up to engineers to own their own configuration lets folks modify the file in an unsupported way which could break their CI process. And then we’re back to square one. We decided to create that large file by ostensibly concatenating smaller components together. Each of those smaller components would be the output of specific functions, and each of those functions would be written in code and be tested. The end result was a lot of small files that look a little like this: https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a Every time we make a change to the Coach CLI codebase we are confident that the thousands of lines of YAML that are idempotently generated as a result of the coach update ci command will work as expected because they’re already tested in isolation, in unit tests. We also have a few heftier integration tests to confirm our expectations. And no one needs to manually edit the .circleci/config.yml file again. Defining the Interface In order to generate the .circleci/config.yml that details which jobs to run and what code to execute we first needed to determine what our acceptance criteria was. For each project type we knew we would need to support: Static code analysis Unit tests Integration tests Build steps Test reports We define the specific jobs a project will run during CI by looking at the projecttype value inside a project’s coach.yml. If the value for projecttype is ruby_app then the .circleci/config.yml generator will follow certain conventions for Ruby programs, like including a job to run tests with RSpec or including a job to run static analysis commands like Rubocopand Brakeman. For Java apps and libraries we run integration and unit tests by default as well as PMD as part of our static code analysis. Here’s an example configuration section for a single job, the linter job for our Coach repository: https://gist.github.com/agirlnamedsophia/4b4a11acbe5a78022ecba62cb99aa85a And here’s an example of the Ruby code that helps generate that result: https://gist.github.com/agirlnamedsophia/a96f3a79239988298207b7ec72e2ed04 For each job that is defined in the .circleci/config.yml file, according to the project type’s list of acceptance criteria, we include additional steps to handle notifications and test reporting. By knowing that the Coach app is a ruby_appwe know how many jobs will need to be run and when. By writing that YAML inside of Ruby classes we can grow and expand our pipeline as needed, trusting that our tests confirm the YAML looks how we expect it to look. If our acceptance criteria change, because everything is written in code, adding a new job involves a simple code change and a few tests, and that’s it. We’ll go into contributing to our platform in more detail below. Onboarding a new project One of the main reasons for standardizing the interface and automating the configuration generation was to onboard new applications more quickly. To set up a new app all you need to do is be in the directory for your project and then run coach create project --type $project_type. -> % coach create project --type ruby_app 'coach.yml' configuration file added -- update it based on your project's needs When you run that, the CLI creates the small coach.yml configuration definition file discussed earlier. Here’s what an example Ruby app’s coach.yml looks like: https://gist.github.com/agirlnamedsophia/2f966ab69ba1c7895ce312aec511aa6b The CLI will refer back to a project’s coach.yml to decide what kind of CircleCI DSL needs to be written to the .circleci/config.yml file to wire up the right jobs to run at the right time. Though our contract with projects of different types is standardized, we permit some level of customization. The coach.yml file allows our users to define certain characteristics of their CI flow that vary and require more domain knowledge about a specific project: like the level of test parallelism their application test suite requires, or the list of databases required for tests to run, or an attribute composed of a matrix of Ruby versions and Gemfiles to run the whole test suite against. Using this declarative configuration is more extensible and more user friendly and doesn’t break the contract we’ve put in place for projects that use our CI platform. Contributing to CI Before, if you wanted to add an additional linter or CI tool to our pipeline, it would require adding a few lines of untested bash code to an existing Jenkins job, or adding a new job to a precarious graph of jobs, and crossing your fingers that it would “just work.” The addition couldn’t be tested and it was often only available to one project or one repository at a time. It couldn’t scale out to the rest of the org with ease. Now, updating CI requires opening a PR to make the change. We encourage all engineers who want to add to their own CI pipeline to make changes on a branch from our Coach repository, where all the configuration generation magic happens, verify its effectiveness for their use-case, and open a pull request. If it’s a reasonable addition to CI, our thought is that everyone should benefit. By having these changes in version control, each addition to the CI pipeline goes through code review and requires tests be written. We therefore have the added benefit of knowing that updates to CI have been tested and are deemed valid and working before they’re distributed, and we can prevent folks from removing a feature without considering the impact it may have. When a PR is merged, our team takes care of redistributing the new version of the library so engineers can update their configuration. CI is now a mechanism for instantly sharing the benefits of discovery made in isolated exploration, with everyone. Putting it all together Our configuration generator is doing a lot more than just taping together jobs in a workflow — we evaluate dependency graphs and only run certain jobs that have upstream changes or are triggered themselves. We built our Coach CLI into the Docker images we use in CircleCI and so those Coach CLI commands are available to us from inside the .circleci/config.yml file. The CLI handles notifications, artifact generation, and deployment triggers. As we stated in our requirements for Coach in the first post, we believe there should be one way to test code, and one way to deploy it. To get there we had to make all of our Java apps respond to the same set of commands, and all of our Ruby apps to do the same. Our CLI and the accompanying conventions make that possible. When before it could take weeks of both product engineering and SRE time to set up CI for an application or service within a complex ecosystem of bash scripts and Jenkins jobs and application configuration, now it takes minutes. When before it could take days or weeks to add a new step to a CI pipeline, now it takes hours of simple code review. We think engineers should focus on what they care about the most, shipping great features quickly and reliably. And we think we made it a little easier for them (and us) to do just that. What’s Next? Now that we’ve wrangled our CI process and encoded the best practices into a tool, we’re ready to tackle our Continuous Deployment pipeline. We’re excited to see how the model of projects and project types that we built for CI will evolve to help us templatize our Kubernetes deployments. Stay tuned. -
CI/CD: Shortening the Feedback Loop
As we improve and scale our CD platform, shortening the feedback loop with notifications was a ...
CI/CD: Shortening the Feedback Loop As we improve and scale our CD platform, shortening the feedback loop with notifications was a small, effective, and important piece. Continuous Delivery (CD) at scale is hard to get right. At Betterment, we define CD as the process of making every small change to our system shippable as soon as it’s been built and tested. It’s part of the CI/CD (continuous integration and continuous delivery) process. We’ve been doing CD at Betterment for a long time, but it had grown to be quite a cumbersome process over the last few years because our infrastructure and tools hadn’t evolved to meet the needs of our growing engineering team. We reinvented our Site Reliability Engineering (SRE) team last fall with our sights set on building software to help developers move faster, be happier, and feel empowered. The focus of our work has been on delivering a platform as a service to make sense of the complex process of CD. Coach is the beginning of that platform. Think of something like Heroku, but for engineers here at Betterment. We wanted to build a thoughtfully composed platform based on the tried and true principles of 12-factor apps. In order to build this, we needed to do two overhauls: 1) Build a new CI pipeline and 2) Build a new CD pipeline. Continuous Integration — Our Principles For years, we used Jenkins, an open-source tool for automation, and a mess of scripts to provide CI/CD to our engineers. Jenkins is a powerful tool and well-used in the industry, but we decided to cut it because the way that we were using it was wrong, we weren’t pleased with its feature set, and there was too much technical debt to overcome. Tests were flakey and we didn’t know if it was our Jenkins setup, the tests themselves, or both. Dozens of engineers contribute to our biggest repository every day and as the code base and engineering team have grown, the complexity of our CI story has increased and our existing pipeline couldn’t keep up. There were task forces cobbled together to drive up reliability of the test suite, to stamp out flakes, to rewrite, and to refactor. This put a band-aid on the problem for a short while. It wasn’t enough. We decided to start fresh with CircleCI, an alternative to Jenkins that comes with a lot more opinions, far fewer rough edges, and a lot more stability built-in. We built a tool (Coach) to make the way that we build and test code conventional across all of our of apps, regardless of language, application owner, or business unit. As an added bonus, since our CI process itself was defined in code, if we ever need to switch platforms again, it would be much easier. Coach was designed and built with these principles: Standardize the pipeline — there should be one way to test code, and one way to deploy it Test code often — code should be tested as often as it’s committed Build artifacts often — code should be built as often as it’s tested so that it can be deployed at any time Be environment agnostic — artifacts should be built in an environment-agnostic way with maximum portability Give consistent feedback — the CI output should be consistent no matter the language runtime Shorten the feedback loop — engineers should receive actionable feedback as soon as possible Standardizing CI was critical to our growth as an organization for a number of reasons. It ensures that new features can be shipped more quickly, it allows new services to adopt our standardized CI strategy with ease, and it lets us recover faster in the face of disaster — a hurricane causing a power outage at one of our data centers. Our goal was to replace the old way of building and testing our applications (what we called the “Old World”) and start fresh with these principles in mind (what we deemed the “New World”). Using our new platform to build and test code would allow our engineers to receive automated feedback sooner so they could iterate faster. One of our primary aims in building this platform was to increase developer velocity, so we needed to eliminate any friction from commit to deploy. Friction here refers to ambiguity of CI results and the uncertainty of knowing where your code is in the CI/CD process. Shortening the feedback loop was one of the first steps we took in building out our new platform, and we’re excited to share the story of how we designed that solution. Our Principles in Action: Shortening the Feedback Loop The feedback loop in the Old World run by Jenkins was one of the biggest hurdles to overcome. Engineers never really knew where their code was in the pipeline. We use Slack, like a lot of other companies, so that part of the messaging story wouldn’t change, but there were bugs we needed to fix and design flaws we needed to update. How much feedback should we give? When do we want to give feedback? How detailed should our messages be? These were some of the questions we asked ourselves during this part of the design phase. What our Engineers Needed For pull requests, developers would commit code and push it up to GitHub and then eventually they would receive a Slack message that said “BAD” for every test suite that failed, or “GOOD” if everything passed, or nothing at all in the case of a Jenkins agent getting stuck and hanging forever. The notifications were slightly more nuanced than good/bad, but you get the idea. We valued sending Slack messages to our engineers, as that’s how the company communicates most effectively, but we didn’t like the rate of communication or the content of those messages. We knew both of those would need to change. As for merges into master, the way we sent Slack messages to communicate to engineering teams (as opposed to just individuals) was limited because of how our CI/CD process was constructed. The entire CI and CD process happened as a series of interwoven Jenkins freestyle jobs. We never got the logic quite right around determining whose code was being deployed — the deploy logic was contingent on a pretty rough shell script called “inside a Jenkins job.” The best we had was a Slack message that was sent roughly five minutes before a deploy began, tagging a good estimation of contributors but often missing someone if their Github email address was different from their Slack email address. More critically, the one-off script solution wasn’t stored in source control, therefore it wasn’t tested. We had no idea when it failed or missed tagging some contributors. We liked notifying engineers when a deploy began, but we needed to be more accurate about who we were notifying. What our SRE Team Needed Our design and UX was informed by what our engineers using our platform needed, but Coach was built based on our needs. What did we need? Well-tested code stored in version control that could easily be changed and developed. All of the code that handles changesets and messaging logic in the New World is written in one central location, and it’s tested in isolation. Our CI/CD process invokes this code when it needs to, and it works great. We can be confident that the right people are notified at the right time because we wrote code that does that and we tested it. It’s no longer just a script that sometimes works and sometimes doesn’t. Because it’s in source control and it runs through its own CI process, we can also easily roll out changes to notifications without breaking things. We wanted to build our platform around what our engineers would need to know, when they need to know it, and how often. And so one of the first components we built out was this new communication pipeline. Next we’ll explore in more detail some of our design choices regarding the content of our messages and the rate at which we send them. Make sure our engineers don’t mute their slack notifications In leaving the Old World of inconsistent and contextually sparse communication we looked at our blank canvas and initially thought “every time the tests pass, send a notification! That will reduce friction!” So we tried that. If we merged code into a tracked branch — a branch that multiple engineers contribute to, like master — for one of our biggest repos, which contained 20 apps and 20 test suites, we would be notified at every transition: every rubocop failure, every flakey occurrence of a feature test. We quickly realized it was too much. We sat back and thought really hard about what we would want, considering we were dogfooding our own pipeline. How often did we want to be notified by the notification system when our tests that tested the code that built the notification system, succeeded? Sheesh, that’s a mouthful. Our Slack bot could barely keep up! We decided it was necessary to be told only once when everything ran successfully. However, for failures, we didn’t want to sit around for five minutes crossing our fingers hoping that everything was successful only to be told that we could have known three minutes earlier that we’d forgotten a newline at the end of one of our files. Additionally, in CircleCI where we can easily parallelize our test suites, we realized we wouldn’t want to notify someone for every chunk of the test suite that failed, just the first time a failure happened for the suite. We came up with a few rules to design this part of the system: Let the author know as soon as possible when something is red but don’t overdo it for redundant failures within the same job (e.g. if unit tests ran on 20 containers and 18 of them saw failures, only notify once) Only notify once about all the green things Give as much context as possible without being overwhelming: be concise but clear Next we’ll explore the changes we made in content. What to say when things fail This is what engineers would see in the Old World when tests failed for an open pull request: Among other deficiencies, there’s only one link and it takes us to a Jenkins job. There’s no context to orient us quickly to what the notification is for. After considering what we were currently sending our engineers, we realized that 1) context and 2) status were the most important things to communicate, which were the aspects of our old messaging that were suffering the most. Here’s what we came up with: Thanks Coach bot! Right away we know what’s happened. A PR build failed. It failed for a specific GitHub branch (“what-to-say-when-things-fail-branch”), in a specific repo (“Betterment/coach”), for a specific PR (#430),for a specific job in the test suite (“coach_cli — lint (Gemfile)”). We can click on any of these links and know exactly where they go based on the logo of the service. Messages about failures are now actionable and full of context,prompting the engineer to participate in CI, to go directly to their failures or to their PR. And this bounty of information helps a lot if the engineer has multiple PRs open and needs to quickly switch context. The messaging that happened for failures when you merged a pull request into master was a little different in that it included mentions for the relevant contributors (maybe all of them, if we were lucky!): The New World is cleaner, easier to grok, and more immediately helpful: The link title to GitHub is the commit diff itself, and it takes you to the compare URL for that changeset. The CircleCI info includes the title of the job that failed (“coach_cli — lint (Gemfile)”), the build number (“#11389”) to reference for context in case there are multiple occurrences of the failure in multiple workflows, a link to the top-level “Workflow”, and @s for each contributor. What to say when things succeed We didn’t change the frequency of messaging for success — we got that right the first time around. You got one notification message when everything succeeded and you still do. But in the Old World there wasn’t enough context to make the message immediately useful. Another disappointment we had with the old messaging was that it didn’t make us feel very good when our tests passed. It was just a moment in time that came and went: In the New World we wanted to proclaim loudly (or as loudly as you can proclaim in a Slack message) that the pull request was successful in CI: Tada! We did it! We wanted to maintain the same format as the new failure messages for consistency and ease of reading. The links to the various services we use are in the same order as our new failure messages, but the link to CircleCI only goes to the workflow that shows the graph of all the tests and jobs that ran. It’s delightful and easy to parse and has just the right amount of information. What’s next? We have big dreams for the future of this platform with more and more engineers using our product. Shortening the feedback loop with notifications is only one small, but rather important, part of our CD platform. In the next post of this series on CD, we’ll explore how we committed 5000 line configuration files to our repositories with confidence by standardizing CI for different runtimes, automating config generation in code, and testing that code generation. We believe in a world where shipping code, even in really large codebases with lots of contributors, should be done dozens of times a day. Where engineers can experience feedback about their code with delight and simplicity. We’re building that at Betterment.