Have you ever made a change to your system without knowing if your users would like it? This is one of the reasons why A/B testing has become a popular practice for optimizing digital products.
Our team chose to implement it using Split, a platform that provides many tools to make the process easier. This allowed us to simplify how we kept control of our code while we experimented with user experience.
In this blog we are going to talk about A/B testing and our experience implementing it using Split along with some things to keep in mind for your implementation.
What Is A/B Testing?
Let's start from the beginning. A/B Testing is a technique used to compare two versions of an element to determine which produces better results based on one or more selected metrics. Let's imagine a scenario where we have a page with a red button (Version A) and our team wants to test how changing the color to blue (Version B) impacts the users. To do this, we split the users into two groups: one group sees Version A, while the other sees Version B. Then we evaluate which version performs better based on a defined metric collected from both groups, for this example, it could be “click through rate”.
When a reasonable number of users have participated in the experiment, we finish the test and the button with the highest “click to view” ratio is declared the winner and remains on the page.
Split’s Advantage: Often we need to test more than just two versions at once. Say we want to check how setting each option in a list of many as the default value affects our metrics. If we were to compare them one by one, the process could take forever. That's why tools that let you compare more than two at a time, like Split, speed up experiments and consequently get faster results.
Benefits Beyond Immediate Results
While A/B testing is known for enabling data driven decisions, has even more advantages that may not be visible at first glance.
Safely Deploy Incomplete Features
Thanks to feature flags, you can deploy changes in production while keeping them hidden from users while they are tested with a controlled group. This avoids the usual "we can't deploy it because it is not ready" or "this needs to go through QA first" and makes it possible to continue with other releases without waiting for a feature to be 100% ready, thereby reducing risks, accelerating development cycles, and eliminating the stress of last minute or overnight releases.
If you're working on a new UI, modifying backend architecture, or implementing critical improvements, you can roll them out only to a small subset of users or even just your internal team. This way you can identify and fix potential bugs before they affect your customer base, allowing you to deal with them without delaying the release of other features.
Instant Rollback for Failed Experiments
If an error is found or it is decided that an experiment was not successful, another great benefit of the tool is to be able to make an immediate rollback without requiring a new deployment.
Deploying with feature flags isn’t just useful for tracking metrics, it also enhances system security and reliability. If a feature fails or an experiment doesn’t produce the expected results, reverting it is as straightforward and immediate as turning off the flag.
Efficient Management with Split: A software development lifecycle aligned with these two concepts worked well for us. We would account for each version of a feature as we developed, deploy new code with a feature flag enabled only for our team, test the changes internally and, if they worked, we would enable the new version for our customers. Despite this quick testing approach, there were several times when bugs were discovered only with real user interaction, but resolving them required nothing more than turning off the flag and fixing the bug the next day.
The Dark Side of A/B Testing: Code Complexity
It’s not all perfect. These benefits also come with technical complexity, especially when A/B testing is implemented on a large scale.
During my experience working at PODS, a platform where users configure the rental and transport of containers step by step, we launched multiple experiments to test whether reordering steps, removing options, or adding new ones had an impact on sales.
It seemed simple - add some flags to show the different versions and measure conversions. But over time there were so many that it became uncontrollable. Each change introduced new possible combinations and what started as an effort to optimize our user experience became an unmanageable labyrinth of product versions.
Uncontrolled Feature Flags
One of the main code smells is having too many feature flags affecting the same system flow.
In our case, there were flags to:
- Modify the order of steps
- Change the design of certain screens
- Add or remove optional questions
Each flag affected the user experience differently, meaning there were multiple possible paths within the same flow. This led to every deployment requiring exhaustive testing since any change could interfere with another ongoing experiment. Some might argue that our automated testing strategy was inadequate, but automated tests were simply not enough.
With Split we could visualize which experiments were active in order to remove those whose analysis had already been completed, but the cleaning process being slower than adding new experiments led to an excessive amount of possible combinations. As a result, the team ended up spending more time fixing bugs and running smoke tests for all variants than actually analyzing results or building new functionality.
In the end, the project became unmanageable, and many experiments were abandoned because keeping them involved a higher cost than the expected benefit.
How to Identify an Excess of Experiments?
The easiest way to avoid this chaos is to use feature flags only when they are truly needed. But how do you know when they are truly needed? When is it better to just implement a change without a flag?
Some of the signs that I consider important would be:
- Too many flags in the same flow → If each step in a process has multiple active versions, complexity begins to grow exponentially.
- Experiments that never end → if an experiment starts to take longer than expected, it is time to rethink it or make a decision and move forward.
In the case of Pods, the solution was to simplify experiments:
- Reduce the number of simultaneously active flags
- Eliminate tests that do not seem to generate significant changes
- Implement a cleanup policy to prevent flags from accumulating in the code unnecessarily
As a result, we got a smoother workflow, fewer production errors, and a more effective experimentation strategy.
Split as an Ally: Split offers centralized experiment management. This allows teams to easily visualize which experiments are still active and which are already finished so that the team can go and remove unnecessary flags from the code.
Conclusion
A/B testing is an effective way to make data driven decisions and improve digital products. However, its implementation might become complex and difficult to maintain. If experiments accumulate without control, the code becomes hard to manage, development times increase, and the expected benefits disappear.
Feature flags are an excellent tool, but they must be used strategically because more experiments don’t always produce better results. It’s crucial to understand when it’s worth testing and when it’s better to simply make the change. Balancing experimentation with simplicity is the key to making the most of A/B testing without falling into chaos.