TLDR: Use GraphQL for client-server communication and gRPC for server-to-server. See the Verdict section for exceptions to this rule.
I've read a lot of comparisons of these two protocols and wanted to write one that is comprehensive and impartial. (Well, as impartial as I and my reviewers can make it 😄.) I was inspired by the release of connect-web (a TypeScript gRPC client that can be used in the browser) and a popular HN post entitled GraphQL kinda sucks. My personal history of communication protocols built on top of layer 7:
- REST (Rails and Express)
- ➡️ DDP (Meteor's WebSocket protocol)
- ➡️ GraphQL (which I wrote a book about)
- ➡️ gRPC (which I use at Temporal)
Background
gRPC was released in 2016 by Google as an efficient and developer-friendly method of server-to-server communication. GraphQL was released in 2015 by Meta as an efficient and developer-friendly method of client-server communication. They both have significant advantages over REST and have a lot in common. We’ll spend most of the article comparing their traits, and then we’ll summarize each protocol’s strengths and weaknesses. At the end, we’ll know why each is such a good fit for its intended domain and when we might want to use one in the other’s domain.
Comparing gRPC and GraphQL features
Interface design
Both gRPC and GraphQL are Interface Description Languages (IDLs) that describe how two computers can talk to each other. They work across different programming languages, and we can use codegen tools to generate typed interfaces in a number of languages. IDLs abstract away the transport layer; GraphQL is transport-agnostic but generally used over HTTP, while gRPC uses HTTP/2. We don’t need to know about transport-level details like the method, path, query parameters, and body format in as REST. We just need to know a single endpoint that we use our higher-level client library to communicate with.
Message format
Message size matters because smaller messages generally take less time to send over the network. gRPC uses protocol buffers (a.k.a. protobufs), a binary format that just includes values, while GraphQL uses JSON, which is text-based and includes field names in addition to values. The binary format combined with less information sent generally results in gRPC messages being smaller than GraphQL messages. (While an efficient binary format is feasible in GraphQL, it’s rarely used and isn’t supported by most of the libraries and tooling.)
Another aspect that affects message size is overfetching: whether we can request only specific fields or will always receive all fields (“overfetching” fields we don’t need). GraphQL always specifies in the request which fields are desired, and in gRPC, we can use FieldMasks as reusable filters for requests.
Another benefit to gRPC’s binary format is faster serializing and parsing of messages compared to that of GraphQL’s text messages. The downside is that it’s harder to view and debug than the human-readable JSON. We at Temporal use protobuf’s JSON format by default for the visibility benefit to developer experience. (That loses the efficiency that came with the binary format, but users who value the efficiency more can switch to binary.)
Defaults
gRPC also doesn’t include default values in messages, which GraphQL can do for arguments but not request fields or response types. This is another factor in gRPC messages’ smaller size. It also affects the DX of consuming a gRPC API. There’s no distinction between leaving an input field unset and setting it to the default value, and the default value is based on the type of the field. All booleans default to false, and all numbers and enums default to 0. We can’t default the `behavior` enum input field to `BEHAVIOR_FOO = 2`—we have to either put the default value first (`BEHAVIOR_FOO = 0`), which means it will always be the default in the future, or we follow the recommended practice of having a `BEHAVIOR_UNSPECIFIED = 0` enum value:
enum Behavior {
BEHAVIOR_UNSPECIFIED = 0;
BEHAVIOR_FOO = 1;
BEHAVIOR_BAR = 2;
}
The API provider needs to communicate what UNSPECIFIED means (by documenting “unspecified will use the default behavior, which is currently FOO”), the consumer needs to think about whether the server default behavior may change in the future (if the server saves the provided UNSPECIFIED / 0 value in some business entity the consumer is creating, and the server later changes the default behavior, the entity will start behaving differently) and whether that would be desired. If it wouldn’t be desired, the client needs to set the value to the current default. Here’s an example scenario:
service ExampleGrpcService {
rpc CreateEntity (CreateEntityRequest) returns (CreateEntityResponse) {}
}
message CreateEntityRequest {
string name = 1;
Behavior behavior = 2;
}
If we do:
const request = new CreateEntityRequest({ name: “my entity” })
service.CreateEntity(request)
we’ll be sending BEHAVIOR_UNSPECIFIED, which depending on the server implementation and future changes, might mean BEHAVIOR_FOO now and BEHAVIOR_BAR later. Or we can do:
const request = new CreateEntityRequest({ name: “my entity”, behavior: Behavior.BEHAVIOR_FOO })
service.CreateEntity(request)
to be certain the behavior is stored as FOO and will remain FOO.
The equivalent GraphQL schema would be:
type Mutation {
createEntity(name: String, behavior: Behavior = FOO): Entity
}
enum Behavior {
FOO
BAR
}
When we don’t include behavior in the request, the server code will receive and store FOO as the value, matching the = FOO default in the schema above.
graphqlClient.request(`
mutation {
createEntity(name: “my entity”)
}
`
It’s simpler than the gRPC version to know what will happen if the field isn’t provided, and we don’t need to consider whether to pass the default value ourselves.
Other types’ defaults have other quirks. For numbers, sometimes the default 0 is a valid value, and sometimes it will mean a different default value. For booleans, the default false results in negatively named fields. When we’re naming a boolean variable while coding, we use the positive name. For instance, we’d usually declare let retryable = true rather than let nonRetryable = false. People generally find the former more readable, as the latter takes an extra step to understand the double negative (“notRetryable is false, so it’s retryable”). But if we have a gRPC API in which we want the default state to be retryable, then we have to name the field nonRetryable, because the default of an retryable field would be false, like all booleans in gRPC.
Request format
In gRPC, we call methods one at a time. If we need more data than a single method provides, we need to call multiple methods. And if we need response data from the first method in order to know which method to call next, then we’re doing multiple round trips in a row. Unless we’re in the same data center as the server, that causes a significant delay. This issue is called underfetching.
This is one of the issues GraphQL was designed to solve. It’s particularly important over high-latency mobile phone connections to be able to get all the data you need in a single request. In GraphQL, we send a string (called a document) with our request that includes all the methods (called queries and mutations) we want to call and all the nested data we need based on the first-level results. Some of the nested data may require subsequent requests from the server to the database, but they’re usually located in the same data center, which should have sub-millisecond network latency.
GraphQL’s request flexibility lets front-end and back-end teams become less coupled. Instead of the front-end developers waiting for the back-end developers to add more data to a method’s response (so the client can receive the data in a single request), the front-end developers can add more queries or nested result fields to their request. When there’s a GraphQL API that covers the organization’s entire data graph, the front-end team gets blocked waiting for backend changes much less frequently.
The fact that the GraphQL request specifies all desired data fields means that the client can use declarative data fetching: instead of imperatively fetching data (like calling `grpcClient.callMethod()```), we declare the data we need next to our view component, and the GraphQL client library combines those pieces into a single request and provides the data to the components when the response arrives and later when the data changes. The parallel for view libraries in web development is using React instead of jQuery: declaring how our components should look and having them automatically update when data changes instead of imperatively manipulating the DOM with jQuery.
Another effect GraphQL’s request format has is increased visibility: the server sees each field that’s requested. We can track field usage and see when clients have stopped using deprecated fields, so that we know when we can remove them as opposed to forever supporting something that we said we’d get rid of. Tracking is built into common tools like Apollo GraphOS and Stellate.
Forward compatibility
Both gRPC and GraphQL have good forward compatibility; that is, it’s easy to update the server in a way that doesn’t break existing clients. This is particularly important for mobile apps that may be out of date, but also necessary in order for SPAs loaded in users’ browser tabs to continue working after a server update.
In gRPC, you can maintain forward compatibility by numerically ordering fields, adding fields with new numbers, and not changing the types/numbers of existing fields. In GraphQL, you can add fields, deprecate old fields with the `@deprecated``` directive (and leave them functioning), and avoid changing optional arguments to be required.
Transport
Both gRPC and GraphQL support the server streaming data to the client: gRPC has server streaming and GraphQL has Subscriptions and the directives @defer, @stream, and @live. gRPC’s HTTP/2 also supports client and bidirectional streaming (although we can’t do bidirectional when one side is a browser). HTTP/2 also has improved performance through multiplexing.
gRPC has built-in retries on network failure, whereas in GraphQL, it might be included in your particular client library, like Apollo Client’s RetryLink. gRPC also has built-in deadlines.
There are also some limitations of the transports. gRPC is unable to use most API proxies like Apigee Edge that operate on HTTP headers, and when the client is a browser, we need to use gRPC-Web proxy or Connect (while modern browsers do support HTTP/2, there aren’t browser APIs that allow enough control over the requests). By default, GraphQL doesn’t work with GET caching: much of HTTP caching works on GET requests, and most GraphQL libraries default to using POST. GraphQL has a number of options for using GET, including putting the operation in a query parameter (viable when the operation string isn’t too long), build-time persisted queries (usually just used with private APIs), and automatic persisted queries. Cache directives can be provided at the field level (the shortest value in the whole response is used for the Cache-Control header’s `max-age`).
Schema and types
GraphQL has a schema that the server publishes for client devs and uses to process requests. It defines all the possible queries and mutations and all the data types and their relations to each other (the graph). The schema makes it easy to combine data from multiple services. GraphQL has the concepts of schema stitching (imperatively combining multiple GraphQL APIs into a single API that proxies parts of the schema) and federation (each downstream API declares how to associate shared types, and the gateway automatically resolves a request by making requests to downstream APIs and combining the results) for creating a supergraph (a graph of all our data that combines smaller subgraphs / partial schemas). There are also libraries that proxy other protocols to GraphQL, including gRPC.
Along with GraphQL’s schema comes further developed introspection: the ability to query the server in a standard way to determine what its capabilities are. All GraphQL server libraries have introspection, and there are advanced tools based on introspection like GraphiQL, request linting with graphql-eslint, and Apollo Studio, which includes a query IDE with field autocompletion, linting, autogenerated docs, and search. gRPC has reflection, but it’s not as widespread, and there’s less tooling that uses it.
The GraphQL schema enables a reactive normalized client cache: because each (nested) object has a type field, types are shared between different queries, and we can tell the client which field to use as an ID for each type, the client can store data objects normalized. This enables advanced client features, such as a query result or optimistic update triggering updates to view components that depend on different queries that include the same object.
There are a few differences between gRPC and GraphQL types:
- gRPC version 3 (latest as of writing) does not have required fields: instead, every field has a default value. In GraphQL, the server can differentiate between a value being present and absent (null), and the schema can indicate that an argument must be present or that a response field will always be present.
- In gRPC, there is no standard way to know whether a method will mutate state (vs GraphQL, which separates queries and mutations).
- Maps are supported in gRPC but not in GraphQL: if you have a data type like `{[key: string] : T}`, you need to use a JSON string type for the whole thing.
A downside of GraphQL’s schema and flexible queries is that rate limiting is more complex for public APIs (for private APIs, we can allowlist our persisted queries). Since we can include as many queries as we’d like in a single request, and those queries can ask for arbitrarily nested data, we can’t just limit the number of requests from a client or assign cost to different methods. We need to implement cost analysis rate limiting on the whole operation, for example by using the graphql-cost-analysis library to sum individual field costs and pass them to a leaky bucket algorithm.
Summary
Here’s a summary of the topics we’ve covered:
Similarities between gRPC and GraphQL
- Typed interfaces with codegen
- Abstract away the network layer
- Can have JSON responses
- Server streaming
- Good forward compatibility
- Can avoid overfetching
gRPC
Strengths
- Binary format:
- Faster transfer over network
- Faster serializing, parsing, and validation
- However, harder to view and debug than JSON
- HTTP/2:
- Multiplexing
- Client and bidirectional streaming
- Built-in retries and deadlines
Weaknesses
- Need proxy or Connect to use from the browser
- Unable to use most API proxies
- No standard way to know whether a method will mutate state
GraphQL
Strengths
- Client determines which data fields it wants returned. Results in:
- No underfetching
- Team decoupling
- Increased visibility
- Easier to combine data from multiple services
- Further developed introspection and tooling
- Declarative data fetching
- Reactive normalized client cache
Weaknesses
- If we already have gRPC services that can be exposed to the public, it takes more backend work to add a GraphQL server.
- HTTP GET caching doesn’t work by default.
- Rate limiting is more complex for public APIs.
- Maps aren’t supported.
- Inefficient text-based transport
Verdict
Server-to-server
In server-to-server communication, where low latency is often important, and more types of streaming are sometimes necessary, gRPC is the clear standard. However, there are cases in which we may find some of the benefits of GraphQL more important:
- We’re using GraphQL federation or schema stitching to create a supergraph of all our business data and decide to have GraphQL subgraphs published by each service. We create two supergraph endpoints: one external to be called by clients and one internal to be called by services. In this case, it may not be worth it for services to also expose a gRPC API, because they can all be conveniently reached through the supergraph.
- We know our services’ data fields are going to be changing and want field-level visibility on usage so that we can remove old deprecated fields (and aren’t stuck with maintaining them forever).
There’s also the question of whether we should be doing server-to-server communication ourselves at all. For data fetching (GraphQL’s queries), it’s the fastest way to get a response, but for modifying data (mutations), things like Martin Fowler’s “synchronous calls considered harmful” (see sidebar here) have led to using async, event-driven architecture with either choreography or orchestration between services. Microservices Patterns recommends using the latter in most cases, and to maintain DX and development speed, we need a code-based orchestrator instead of a DSL-based one. And once we’re working in a code-based orchestrator like Temporal, we no longer make network requests ourselves—the platform reliably handles it for us. In my opinion, that’s the future.
Client-server
In client-server communication, latency is high. We want to be able to get all the data we need in a single round trip, have flexibility in what data we fetch for different views, and have powerful caching, so GraphQL is the clear winner. However, there are cases in which we may choose to use gRPC instead:
- We already have a gRPC API that can be used, and the cost of adding a GraphQL server in front of that isn’t worth the benefits.
- JSON is not a good fit for the data (e.g. we’re sending a significant amount of binary data).
I hope this article aided your understanding of the protocols and when to use them! If you’d like to learn more about GraphQL, check out their site or my book, The GraphQL Guide. For more about gRPC, here’s their site and documentation.
Thanks to Marc-André Giroux, Uri Goldshtein, Sashko Stubailo, Morgan Kestner, Andrew Ingram, Lenny Burdette, Martin Bonnin, James Watkins-Harvey, Josh Wise, Patrick Rachford, and Jay Miller for reading drafts of this.