DNS and the December 2024 OpenAI Outage

In December 2024, OpenAI faced a significant outage lasting approximately four hours. This incident highlighted a critical challenge in container orchestration: maintaining reliable service discovery when the control plane fails. An action item in their post-mortem hints that this is not an issue isolated to them, but a broader vulnerability in Kubernetes' DNS configuration around the separation of the control plane and data plane:

Decouple the Kubernetes data plane and control plane - Our dependency on Kubernetes DNS for service discovery creates a link between our Kubernetes data plane and control plane.

This blog post will investigate the root cause of this issue and explore how Kubernetes could be made more robust to this type of outage.

One big caveat to this investigation is the OpenAI post-mortem has some ambiguity, particularly regarding DNS cache records expiring after 20 minutes—a behavior that diverges from Kubernetes' default settings. More than that, it does not even specify the DNS service in use. So, given that CoreDNS is the de-facto standard for self-hosted Kubernetes DNS, let’s examine how its default configuration manages availability in the face of the control plane dependency.

Could a Default Setup Have this Outage?

CoreDNS’s default TTL (Time-to-Live) settings are relatively low, ensuring that pod changes propagate quickly to clients. There is some confusion online about whether the default TTL is 5 seconds or 30 seconds. Diving in, CoreDNS defaults to 5 seconds, but then the default configuration of Kubernetes sets it to 30 seconds - here is that config in a 1.32 cluster installed from https://pkgs.k8s.io:

kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}

30 seconds is clearly a very short time to tolerate an API Server brownout. To mitigate such a failure, the CoreDNS Kubernetes plugin proactively caches resources using an Indexer Informer. You can find the code that initializes it here. This design allows CoreDNS to have the lookup data in memory regardless of an API Server outage, theoretically decoupling the control plane from the data plane. 

But obviously that did not happen here, so looking deeper into the code. Maybe CoreDNS has logic to crash out and restart when it can’t contact the API Server any more? That does not appear to be the case. Here you can see the client sits in an infinite retry loop in the case of failures, without dumping the cache, nor exiting out of the process.

Could it be the local Kubelet making a decision to kill CoreDNS based on failed health checks, say because of stale data? It does not seem so. We can see in the source the health check itself is just a shallow one that the process is still up. (the readiness probe is a deeper check, but that does not cause Kubelet to terminate the process).

To test all this I created a Kubernetes cluster, deployed a test pod, and caused an API server outage:

# On the api-server node:
$ mv /etc/kubernetes/manifests/kube-apiserver.yaml .
$ kubectl get pods
The connection to the server 172.31.11.81:6443 was refused - did you specify the right host or port?

# On a node with a test pod:
$ crictl exec -it aea55c844157c sh -c 'while true; do echo "$(date '+%H:%M:%S') $(dig +short kubernetes.default.svc.cluster.local)"; sleep 300; done'
20:20:31 10.96.0.1
20:25:31 10.96.0.1
20:30:31 10.96.0.1
20:35:31 10.96.0.1
20:40:31 10.96.0.1
20:45:31 10.96.0.1
20:50:31 10.96.0.1
20:55:31 10.96.0.1
21:00:31 10.96.0.1
21:05:31 10.96.0.1
21:10:31 10.96.0.1
21:15:31 10.96.0.1

This shows that cached DNS records remained available for over an hour, verifying that after a control plane courage of the API Server, a default CoreDNS setup will continue to serve DNS data plane requests.

So what actually happened?

We know from the post-mortem that Open AI later had a thundering-herd problem with DNS queries hitting their API Servers, as they also had node-level caches that were expiring. But if those caches were just hitting the CoreDNS pods, they just would still have hit the cache. 

You can see from blog posts like this one, that OpenAI was doing plenty of in-house customization based on their scale.So one of two other things also had to be true:

  1. The caches were empty as something restarted the CoreDNS pods
  2. OpenAI were not using Core DNS (or not using it in its default configuration).

I admit leaving it here is unsatisfying, but that's all we can do without knowing more about their setup. The main takeaway is that with the default Kubernetes CoreDNS setup, it isn’t just the API Server being down that took down the data plane, rather it was the API Server being down, and then the CoreDNS pods being restarted by a second agent.

Could the CoreDNS Architecture Be Improved?

Since the DNS data plane serves up pod IPs, that change because of the control plane, there is always going to be some coupling between the two. In the default architecture, you can see how control and data plane are tied in this diagram, with all potential data plane dependencies marked in red:

The action item in the post mortem suggests that a better separation of control vs data plane can lead to better data plane availability. However, for the default case, we have seen that Kubernetes has already taken a good amount of effort to separate this dependency by using eager caching. But it does still have an issue if the CoreDNS pods are restarted. That raises, is there an architecture that is better?

I believe the answer is yes, with a big caveat I will come to, which means it isn’t something I would recommend except to those at high scale. During my time in the EC2 Networking team in AWS repeatedly tackling issues like this as we grew, we learned the importance of an architectural principle called "static stability." As AWS describes it:

“...systems operate in a static state and continue to operate as normal without the need to make changes during the failure or unavailability of dependencies. One way we do this is by preventing circular dependencies in our services that could stop one of those services from successfully recovering. Another way we do this is by maintaining existing state.” 

There’s a good argument that by eagerly fetching from the API Server Kubernetes does a pretty good job of the “operating as normal during the failure or unavailability of dependencies” part of this. What it does not do though is attempt to “maintain existing state” across CoreDNS pod restarts. I admit this is a slightly controversial stance; anyone who remembers the early days of Kubernetes will remember that statelessness was pushed as the ultimate virtue of all applications. But I think it is clear now that there are large classes of applications which need persistence of data. 

Because of the criticality of its availability, data plane lookups are one such place persistence is needed, with the rationale captured in its final sentence from the AWS post:

“Thus, eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads.” Saying it again: static stability is about recovering from unknown unknowns, and so pure in-memory caches is not sufficient.

To address this architectural issue then, it's not just pre-emptively pulling lookup info from the API Server, but also persisting in a high availability manner as well. So the architecture changes into:

Having proposed this, there is still the big caveat - this improved static stability of the data plane is not a free lunch. It increases the amount of infrastructure, which costs money, but worse, is clearly more work to operate. In fact at small scale, the operations load in getting persistence right is more likely to cause an outage then the problem we are trying to mitigate. That is an inherent tension - Kubernetes has to straddle the line between being cheap and simple to set up at small scale, while also being able to be highly robust at scale. I think its success shows it has done a pretty good job in balancing those concerns. But it is a balance.

You shouldn’t need to run your own service discovery architecture

Back in my circa 2006 early days at Amazon.com before I joined AWS, and well before AWS was in any way capable of meeting Amazon’s reliability needs, DNS was at a level of infrastructure that was called “tier 0” - more or less unthinkable that it could go down, particularly across availability zones. Part of the “tier 0” data plane mindset was the infrastructure team keeping things as simple as possible, and if that meant it took 2 hours to provision a server for a new application, that was actually a good thing, as it limited the risk that came with flexibility and elasticity.

Now in the 19 years since, AWS has shown that with rigor around both architecture design and excellence in operations, it has been able to build “tier 0” systems that do give that flexibility and elasticity at incredible scale. And of course hence their success. So when companies decide to run their own flexible and elastic tier 0 systems based on widely used open source, they are taking a long term risk as they scale up, as the open source projects already face a difficult compromise between architecting for small and large scale deployments. 

So just use vendor SaaS then? Well, container orchestration presents a particular challenge for networking, in that if you can stay on one cloud, you can take advantage of innovations like EKS Automode or VPC Lattice that in-house all the hard tradeoffs so you benefit from their expertise and scale. However, every vendor's version of Kubernetes networking is slightly different, particularly as they want to integrate it into their own networking and identity ecosystems. This means if you have multi-cloud needs or on-premise requirements, you're left operating your own tier-0 open source data plane. 

That is why at Junction we are solving the multi-cloud and on-prem service discovery challenge, by building the reliable data and control plane architecture, and the culture around operating it to match, that functions at any scale. Junction lets you focus on building your applications, rather than operating open source and incompatible vendor data plane infrastructure.

Thanks to Laurent Bernaille from Datadog for his feedback in writing this blog post.

Subscribe to the Junction Blog

New blogs, product updates, events and more

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.