close
close
OpenAI blames a “new telemetry service” for the massive ChatGPT outage

OpenAI blames a “new telemetry service” that failed for one of the longest outages in its history.

On Wednesday, OpenAI’s AI-powered chatbot platform, ChatGPT; his video generator, Sora; and its developer-focused API experienced significant disruption beginning at approximately 3:00 p.m. Pacific Time. OpenAI recognized the problem soon after and began working on a solution. However, it would take the company about three hours to restore all services.

In a postmortem published late Thursday, OpenAI wrote that the outage was not caused by a security incident or a recent product launch, but by a telemetry service the company deployed on Wednesday to collect Kubernetes metrics. Kubernetes is an open source program that helps manage containers or packages of apps and associated files used to run software in isolated environments.

“Telemetry services have a very long reach, so configuring this new service inadvertently caused… resource-intensive Kubernetes API operations,” OpenAI wrote in the postmortem. “(Our) Kubernetes API servers were overwhelmed and brought down the Kubernetes control plane in most of our large (Kubernetes) clusters.”

That’s a lot of jargon, but essentially the new telemetry service impacts OpenAI’s Kubernetes operations, including a resource that many of the company’s services rely on for DNS resolution. DNS resolution converts IP addresses into domain names. For this reason, you can enter “Google.com” instead of “142.250.191.78”.

OpenAI’s use of DNS caching, which contains information about previously searched domain names (e.g. website addresses) and their corresponding IP addresses, complicated matters by “delaying visibility,” OpenAI wrote, and “the “Introducing (the telemetry enabled service) proceeded before the full extent of the problem was recognized.”

OpenAI says it was able to detect the issue “a few minutes” before customers actually noticed the impact, but that it was unable to quickly implement a fix because it had to work around overloaded Kubernetes servers.

“This was a coincidence of multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote. “Our testing did not capture the impact of the change on the Kubernetes control plane (and) the remediation was very slow due to the lockdown effect.”

OpenAI says it will take several measures to prevent similar incidents from occurring in the future, including improvements to phased rollout with better monitoring of infrastructure changes and new mechanisms to ensure OpenAI engineers are responsive to Kubernetes under all circumstances. Access the company’s API server.

“We apologize for the impact this incident has had on all of our customers – from ChatGPT users to developers to companies that rely on OpenAI products,” OpenAI wrote. “We fell short of our own expectations.”

TechCrunch has an AI-focused newsletter! Register here to receive it in your inbox every Wednesday.

Leave a Reply

Your email address will not be published. Required fields are marked *