February 12, 2021

What could better cross-(micro)service messaging look like?

For anyone who knows me at all, this is something I’ve tinkered with at great length in my pet project around message queuing / pubsub / HTTP proxying, singyeong. Lately, I’ve been thinking about the interface to it, and wondering if I’ve been approaching this from the wrong direction.

As a quick recap, singyeong is a message queue + pubsub + HTTP proxy that allows routing payloads via client metadata. To show what this means in an example, suppose you were running some large multiplayer game server. You’re most likely splitting up players across a bunch of servers, and you need to be able to send a message to the server holding a specific player, for any number of reasons. You might do this with:

pubsub + discarding if the player isn’t on that server, but then you have to pay the price of a fanout to every game server instance, which could add up at scale.
consistent hashing, which is a pretty optimal solution to this problem.
maintaining your own internal “database” of what players are where, and separately publishing messages to servers based on that.

singyeong chooses the third approach listed. All clients update the server with some metadata about themselves, which is stored internally in an indexed sorta-key-value store (Mnesia!). When sending a message, the message provides a routing query describing where it wants to go, and the server figures out how to get it there. For example, you might say “send this message to the game-server instance where 1234 in current_player_ids. How that actually is routed is opaque to the calling client, as it’s dynamically queried on the server and resolved where possible.

While in this particular example, a consistent hashing approach works nicely, singyeong allows for some pretty complex stuff to route messages:

"Send this player to the server that: isn’t full, is in region us-east, running version 1.2.3 for v1.2.x backwards-compat, selected by the minimum player count.”
“Spawn this container on a machine in the staging-api cluster, where the candidate machines have enough CPU and RAM unreserved, and specifically choose the machine that has the smallest number of containers.”
“Push this to the Discord bot worker node that handles guild id 12345.”
…

However, this approach still requires you to always specify how a message is sent to a client. Specifically, is it:

Pub/sub?
Send-to-single-client?
Dropped into a message queue?
Sent as a proxied HTTP request?

Beyond that, you also have to be confident about a bunch of things:

Is this the right message to send to the target service?
Will it send the message I expect back?
What if it disappears in the middle of processing my request?
What about failover and load-balancing for sending messages?
…

singyeong takes care of failover and load-balancing for you, transparently, by moving the routing layer into itself and out of your services. That way, you bypass the potential issues that DNS brings, and you can be confident that if it’s possible to route your message, it will be routed. singyeong also handles services disappearing in the middle of request processing, via message queues and HTTP proxying. The former has dead-letter queues (DLQs) for all queues, and the latter will just return an error. As long as you handle idempotency correctly, and use the right messaging primitives, singyeong can do its best to guarantee at-least-once delivery, assuming you work with it to get there in some cases.

However, this doesn’t solve the first two problems mentioned. While exciting new tools like protobufs, gRPC, GraphQL, and many more all solve this problem in their own way, it still has drawbacks. Specifically, when using these tools, you’ll always end up with either an ad-hoc implementation of all the relevant bits and pieces for each service, or you end up with an extra shared code library, or… And, to single out protobufs for a moment, you can see plenty of articles claiming that protobufs got it wrong, or that they don’t make sense in some relatively-common cases, and, of course, the many opinions that HN has about things. Tho y'know, if you’re basing this decision off of what HN thinks, I have a lot of questions…

Anyway, the point that this is getting at is that messaging is hard. And if after all this time it’s still not solved? Well, that’s just a sign that maybe we can do better (:

You might notice that singyeong doesn’t solve these problems directly either. While it does provide building blocks therefor, I think these problems are at a slightly-higher layer than the messaging layer.

So what might this look like? #

Rather than try to finagle a bunch of words together, I think it’ll be easier to just show some code:

# Elixir!
Magic.Singyeong.send_message %Message{
  # Expect a response of this type, and marshal the response data
  # into this.
  # If this isn't provided or is `nil`, no response will be expected.
  # By inspecting this value, it can be determined whether to send
  # an HTTP request, or something else.
  expect_response: MyApp.Payload.SomeOtherSpecificPayload,

  # If we're not expecting a response, we can choose to fanout
  # messages to things matching the query instead.
  # Someday `fanout` will be compatible with `expect_response`
  fanout: false,

  # If neither of those are happening, then we can choose whether
  # it's being pushed to a client, or if the client has to pull
  # instead.
  queue: "queue-name", # or `false`/`nil` to not queue

  # And based on just these three fields, it can be decided
  # whether to send an HTTP request, push to a client, push
  # to a queue, or fanout-push to many clients.

  # Routing query
  # This ensures that the target selected takes the given input
  # payload and returns the target output payload.
  target: %Query{
    app: "backend-api",
    ops: [
      %Op{
        # Input objects contains
        path: "/docs/in/names",
        op: "$contains",
        to: %{value: "MyApp.Payload.SomeSpecificPayload"},
      },
      %Op{
        # Output objects contains
        path: "/docs/out/names",
        op: "$contains",
        to: %{value: "MyApp.Payload.SomeOtherSpecificPayload"},
      },
    ],
  },
  # Actual payload
  payload: whatever,
}

While this seems like it’s just weasel-wording the difference between all those things, there’s a meaningful difference: The existing form that’s currently in use requires you to think about the transport used to move them around, whereas the hypothetical form above pushes you to think about the semantics of the message. You don’t think about “this message is HTTP” / “this message is pubsub / fanout” / etc., but you think about “this message needs a reply” / “this message is pushed out to lots of clients directly” / etc.

However, this code is still too verbose – after all, why should I have to specify all of this? Heck, I named my example module for this Magic for a reason: It should just fucking work and not make me care about this. But the singyeong layer is really convenient as-is, and I’m not sure I want to shove all of this stuff into it. So I could, say, squish some of this into an app-layer client library, and then write a smol singyeong plugin that can handle these sorts of messages appropriately, perhaps by peeking at the relevant services’ metadata.

So does this like, exist? #

Not yet!

I ran into this as a part of creating mahou, my prototype microservices infrastructure layer. I’m slowly realising more and more that the semantics matter more than anything else. Ideally, this would just be straight-up transparent to the calling client, and it can just say “move this message from here to there” and all of this stuff is figured out automagically.

You should check out mahou on github if you’re interested in this sort of stuff

Thanks for reading ^^

Kudos

What could better cross-(micro)service messaging look like?

So what might this look like? #

So does this like, exist? #

Now read this

Building app infrastructure in Elixir: Data/state store