Microsoft-owned GitHub, which provides a code hosting platform for version control and collaboration, faced three disruptions in its services last week, following 13 such incidents in the past three months.
“Last week, GitHub experienced several availability incidents, both long-running and shorter duration. We have since mitigated these incidents and all systems are now operating normally,” Mike Hanley, chief security officer at GitHub, said in a blog post.
“The root causes for these incidents were unrelated but in aggregate, they negatively impacted the services that organisations and developers trust GitHub to deliver. This is not acceptable nor the standard we hold ourselves to,” Hanley added.
The three incidents, which occurred on May 9, May 10, and May 11, affected a majority of the critical services that GitHub provides, the company said.
Incidents take out critical GitHub services
The incident that occurred on May 9, disrupted GitHub’s databases due to a configuration change, according to the company.
“On May 9, we had an incident that caused 8 of the 10 services on the status portal to be impacted by a major (status red) outage. The majority of downtime lasted just over an hour,” Hanley said in the blog post.
At the time of the outage, many services could not read newly written Git data, causing widespread failures, Hanley explained, adding that post the outage, there was an extended timeline for post-incident recovery of some pull request and push data.
The outage, according to Hanley, was triggered by a configuration change to the internal service serving Git data.
“The change was intended to prevent connection saturation and had been previously introduced successfully elsewhere in the Git backend. Shortly after the rollout began, the cluster experienced a failover. We reverted the config change and attempted a rollback within a few minutes, but the rollback failed due to an internal infrastructure error,” Hanley said.
The incident on May 10, which occurred due to the degradation of GitHub’s App authentication token issuance capability, also saw six out of ten critical GitHub services affected.
“On May 10, the database cluster serving GitHub App auth tokens saw a 7x increase in write latency for GitHub App permissions (status yellow). The failure rate of these auth token requests was 8-15% for the majority of this incident, but did peak at 76% percent for a short time,” Hanley said in the blog post.
The issue with token issuance was a result of “inefficient implementation” of an API for managing GitHub App permissions, the chief security officer explained, adding that the company was updating the API to check for the shift in installation state.
GitHub’s database was hit again on May 11 due to a loss of read replicas, the company said.
“In the Git database incidents, Git reads and writes are at the core of many GitHub scenarios, so increased latency and failures resulted in GitHub Actions workflows unable to pull data or pull requests not updating,” Hanley said in the blog post.
GitHub working on avoiding similar incidents in the future
In order to avoid similar incidents in the future, Hanley said that the company was working on several aspects, such as carefully reviewing its internal processes and making adjustments to ensure that changes are always deployed more safely moving forward.
“In addition to the standard post-incident analysis and review, we are analysing the breadth of impact these incidents had across services to identify where we can reduce the impact of future similar failures,” Hanley said, adding that GitHub was working to improve the observability of high-cost, low-volume query patterns and general ability to diagnose and mitigate this class of issue quickly.
Other measures include addressing the database failover issues to ensure that failover always recovers fully without intervention and understanding the multiple Git database crash incidents.
Although the company claims to be working on addressing outages, GitHub has continued to face disruptions in the last four months with four incidents in April, six incidents in March, and three in February.