If you haven’t already and want to learn about the context behind this project, check out Part One: A Unified Experience Blocked By A Fractured Ecosystem.
Here in part two, we’ll discuss what to do when a team wants to join the Supergraph party, but doesn’t use Spring. You tell them to pound sand, this is an invite-only party and they’re not cool enough.
Of course, I wouldn’t actually recommend taking this approach, but this did present a challenge for us as we campaigned the virtues of our new data platform around the organization. We’ll touch on that later as we break down the technical approach we took to implement our new “data mesh”, including the frameworks, libraries, and processes that made it scalable and maintainable.
Core Architecture: Netflix's DGS Framework and the Apollo Router
At the core of our federated architecture was Netflix's Domain Graph Service (DGS), an open source framework designed for building GraphQL services with Spring and the Apollo router, a lightning fast GraphQL router written in Rust designed to run a federated supergraph.
Most teams that were already serving data had built their services with Spring so extending these services to expose that data through the new architecture was easy with DGS - define a schema, add a few new classes to fetch and return data, and register the new API with the router. The router would then compose all of the registered subgraphs and expose a single point of access for all the data served by those subgraphs. This composed gateway is known as the Supergraph.\

How Netflix Scales its API with GraphQL Federation (Part 1)
How Netflix Scales its API with GraphQL Federation (Part 2)
Building a Subgraph with DGS
DGS is nice. So nice, in fact, that it has been integrated with Spring for GraphQL. Let’s walk through a brief example to illustrate just how easy it can be to get GraphQL services spun up using DGS.
Schema-First Design
DGS promotes a schema first approach when developing GraphQL services. Your schema defines the data types exposed by your service and the operations that can be performed on them. This becomes the blueprint for your implementation. Using our example from Part 1, the two service schemas might look something like this:
Vehicle Service Schema
type Query {
vehicle(vin: String!): Vehicle
}
type Vehicle {
id: ID!
vin: String!
...
}
Vehicle Registration Service Schema
extend type Vehicle {
id: ID!
vin: String!
registration: Registration
...
}
type Registration {
expiration: Datetime
...
}
Pro Tip: DGS codegen reduces boilerplate and gets you started quickly
Once the schema has been defined, DGS’s codegen can generate all of the data classes, input types, enums, and interfaces that will support the schema your service provides. It will even provide you with example data fetchers which are great when you’re getting started.
Data Fetchers
In DGS, Data fetchers are the functions responsible for retrieving data. They are the entry points into your subgraph and can be thought of as analogous to controllers in an MVC paradigm. They define how the data exposed by an individual subgraph is resolved. Teams implemented data fetchers backed by their existing databases, allowing them to expose the data they knew best without depending on other teams or needing to adjust their architecture to fit into the Supergraph. Not only could they use their existing databases, in many cases they could reuse any or all of the business logic that surrounded their data access because the data fetchers could coexist with their previously implemented entry points and therefore leverage existing service layers. In most cases, when it comes to software, the most effective change comes from a scalpel, not a sledgehammer.
Continuing with our example, a simplified version of the vehicle service’s data fetcher might look something like the following, assuming we have an existing repository to fetch data from our database. We introduce new functions to resolve the data for the query(ies) defined in our schema and do the mapping between our existing dataclass and the DGS data class generated from our schema.
@DgsComponent
public class VehiclesDatafetcher {
private final VehicleRepository vehicleRepository;
public static Vehicle toGraphVehicle(com.demo.model.Vehicle vehicle) {
if (vehicle == null) return null;
return Vehicle.newBuilder()
.id(vehicle.getId().toString())
.vin(vehicle.getVin())
.build();
}
@DgsQuery
public Vehicle vehicle(UUID id) {
return toGraphVehicle(vehicleRepository.findById(id));
}
}
Pro Tip: Use separate data fetchers to fetch fields that require a table join
Data fetcher boundaries can be confusing at first, but our rule of thumb was: if a given fieldset required a table join in the database, it was a good idea to create a separate data fetcher. This shifted the task of joining the data together to the router, reducing the need for joins in the database and improving overall performance.
Data Loaders: Addressing the N+1 Query Problem
Say, you wanted to query for a list of vehicles and all of their registration expirations. The router would need to fetch the list of vehicles from the vehicle service, length N, and then query the registration service N times to fetch each vehicle’s registration expiration. This is known as the N+1 problem. DGS helpfully provides a convenient way to address this problem using what it calls data loaders. Data loaders allow for batching those registration requests and loading them all in a single call. The implementation for the registration data loader might look something like the following:
@DgsDataLoader(name = "registrations")
public class RegistrationDataLoader implements MappedBatchLoader<String, Registration> {
@Autowired
RegistrationService registrationService;
@Override
public CompletionStage<Map<String, Registration>> load(Set<String> keys) {
return CompletableFuture.supplyAsync(() -> registrationService.loadRegistrations(keys));
}
}
This data loader allows us to batch database calls to load registrations for vehicles by the given keys, in this case VINs, but we still need our entrypoint - the data fetcher. The data fetcher needs to use this data loader to load the data it is fetching. That might look something like this:
@DgsComponent
public class RegistrationDataFetcher {
@DgsData(parentType = "Vehicle", field = "registration")
public CompletableFuture<Registration> registration(DataFetchingEnvironment dfe) {
DataLoader<String, Registration> dataLoader = dfe.getDataLoader("registrations");
String vin = dfe.getArgument("vin");
return dataLoader.load(vin);
}
}
At this point the router will have enough information to load many vehicles, along with their registrations (in parallel), and map them back together into a list of Vehicle objects with their registration fields populated to return to the user.
Pro Tip: Use MappedBatchLoaders for loading nullable fields
DGS has two main types of data loaders: the BatchLoader and the MappedBatchLoader. MappedBatchLoaders, as their name implies, create a map of key, value pairs when given a list of keys and are recommended when you aren’t sure if every key will have a value (we pretty much always used these).
Composing the Supergraph with the Apollo Router
While DGS handled schema and resolver logic within each service, the Apollo router stitched everything together. Apollo offers a managed version of the router in GraphOS, but we chose to self-host their pre-built router image.
Self hosting the core entry point for a distributed data platform at a huge organization might sound like it would be fraught with pain, but it wasn’t so bad - we were able to automate most of the management and the router itself ran like a dream.
Continuous subgraph schema integration was critical to keeping the Supergraph up-to-date and reliable. The router needed to be updated every time a new subgraph popped up or if any existing subgraphs changed their schema. We solved this by introducing some new tasks into the CI/CD pipelines already being used by the various teams at the organization.
Automation Pipelines
- Schema Validation in CI: Each team’s build pipeline validated their schema changes against the composite schema for their target environment. This helped catch compatibility issues early and prevented a breaking change from being deployed.
- Schema Merging in CD: Once changes passed validation in CI, they were merged into the composite schema automatically during deployment. This kept the federated graph synchronized with service updates.
- Graph Publishing: Changes to the graph were published immediately, ensuring consumers had access to the latest version without risk of breaking changes.
Scaling Adoption: Becoming a Platform Team
As more teams began to drink to the graph-aid, it became clear that simply providing the infrastructure wasn’t enough. Teams needed standardized tooling and libraries to reduce friction in adoption and onboarding and to ensure following best practices was as easy as possible. This shifted our focus toward functioning as a platform team.
Internal Libraries for Developer Experience and Security
To streamline development and improve security, we created and maintained internal libraries for developers to use in their onboarding journey:
- Auto-Generated Client Library: We distributed a client library generated from the composite schema that was automatically updated anytime that schema changed. This ensured that client developers had type-safe access to the data without having to generate their own clients or worry about implementing their own data classes.
- Search Service Template and Helper Library: We created a service template and a helper library for teams who wanted to make their data searchable in the graph. Provided they registered their data for indexing with our indexing service, these two utilities allowed teams to expose an entry point to their now searchable data very easily. More on searchability in the graph in Part 3 of this blog series.
- Security Library: To address security requirements, we developed a library to standardize authentication and authorization patterns and address many of the common security concerns that come with GraphQL APIs. This made it far easier for teams to comply with enterprise security policies and pass the often dreaded, but important security review council meeting needed to get the greenlight for production deployment.
Common GraphQL Security Concerns Addressed:
- Excessive Data Exposure: Without careful schema design, clients may query and successfully retrieve more data than intended. We enforced strict field-level authorization and provided clear guidelines for how individual teams could control this access using the existing RBAC model at the organization.
- Denial of Service (DoS) via Complex Queries: GraphQL's flexibility makes it prone to overly nested or expensive queries that can tie up resources in the system. We implemented default query depth and complexity limits to mitigate this.
- Injection Attacks: We sanitized and validated all input arguments at the data fetcher layer to prevent malicious payloads.
- Authentication and Authorization: Our libraries enforced consistent JWT validation and role-based access control within the existing model used by the organization which allowed for granular security configurations, but didn’t require new permissions to call the graph itself - if you already had access to the data exposed by the graph, you still have access.
- Introspection Exposure: We implemented a request interceptor that rejected any introspection queries not sent by the router itself to prevent schema leakage to unauthorized users.
- Request timeouts: We wrapped the DGSRestController so that it returned a callable, allowing us to implement default request timeouts, further mitigating DoS attacks.
Framework Agnosticism
Remember when I said we’d tell teams that weren’t using Spring to pound sand? Well, as tempting as it was to do, we decided it wouldn't be a great way to get people excited about our new data platform. Mind you, this wasn’t a problem for the router at all. The only thing it cares about is that it can reach GraphQL services serving the schemas that are registered with it. It does not care how they are implemented. As it shouldn’t. Nor should we. The bigger challenge was that these teams could not use all of the great tooling we’d been building.
In one case, early on, a team had been developing a new service using Ktor, a Kotlin based framework designed for building asynchronous applications. This service provided some core data for the org so we knew it was important to get them into the Supergraph. They, of course, didn’t want to redo all the work they had done up to that point so we lent them an engineer from our team to build context on their service and help determine the best path forward so they could continue using their chosen framework and get their data into the graph. The solution ended up being another conveniently available federated GraphQL framework open sourced by Expedia Group (thanks Expedia Group!). This framework worked a little differently than DGS:
- Code-first approach instead of Schema first
- A different approach to security
- A different framework paradigm with Ktor
Ultimately, though, all that mattered was getting the data exposed as a GraphQL endpoint from the service, the router could take care of the rest. I won’t go into detail on the implementation with Expedia Group’s framework, but we were able to get the service integrated with the Supergraph after playing around with it a bit. Our official recommendation remained Spring + DGS for implementing subgraphs because our tooling made it that much easier to get things off the ground, but this early exception gave us a model to build off of for other teams using Ktor and helped us demonstrate to folks that they didn’t have to rewrite their services to join the Supergraph party.
Dynamic schema generation
Another service had the job of serving hierarchical relationship data between arbitrary entities. Users are a part of companies which are a part of organizations, that kind of thing. Their service had a generic data model that provided the IDs and parent-child relationships of these entities. Keeping this service type-agnostic was non-negotiable for this team, but this went against the grain of the schema-first design model of DGS.
So, instead of using the schema-first approach as we had with other services, we needed to dynamically generate the schema for the service at runtime based on the data stored in their database. Netflix really did think DGS through. We leveraged its Dynamic Schema functionality to build the schema at runtime and expose that to the router. The router just needed to know where to fetch data from when it started up, it would then introspect each service to build the supergraph schema. So, as long as this service could build its schema dynamically and then expose it for introspection to the router everything was hunky dory. Another win in showing the flexibility of our new platform.
Observability
A federated graph architecture is highly distributed in nature and the move to this new model introduced a need to easily determine exactly where an issue arose in the chain when something went wrong. Strong observability practices were essential.
Distributed Tracing with OpenTelemetry (OTel)
We instrumented our libraries and tooling with OpenTelemetry (OTel) and encouraged teams to extend that instrumentation within their services to enable full request tracing. This made it easier to pinpoint errors and, crucially, helped teams understand when an issue originated from their service rather than in the router. Good tracing allowed us to better educate folks how a request flowed through the system and how to identify which service was the source of a problem. This reduced the need for us to respond to every incident ourselves when teams inevitably came to us saying that “there’s a problem with the graph”. Being able to quickly identify root causes in a highly distributed system is critical to minimizing incident response time and keeping things running smoothly.
Router Performance
The Apollo router supports OTel and emits a lot of useful metrics out of the box. One thing I wanted to call out about the router that we discovered as we built out the observability of the platform is that the router processing time metric consistently reported times on the order of tens of microseconds. In a huge organization with heavy load and many layers of infrastructure every millisecond counts so this was truly exceptional performance for us as a platform team introducing yet another network hop in the overall request path. Thanks for writing your router in Rust and doing a great job of it, to boot, Apollo!
What’s Next
In Part Three, we’ll focus on how we leveraged Elastic Search to make the data in the Graph searchable across subgraphs. Surprise, we have more props to give to Netflix as they had once again paved the way for the approach that we took.
Stay tuned.
Back to Explore Focused Lab