A few days ago there was a big screwup that crashed Google's main container engine.
On Friday afternoon, users had trouble connecting to the metadata server. That's the error message they were getting.
And the issue continued for quite a while. Then a string of updates over Friday night and Saturday morning (Pacific Time) explain how Google identified the problem.
The company crafted a fix, implemented it and then declared that the issue "should have been resolved for all affected clusters” as of 10:45 Pacific Time on October 25.
Note the “should have” because at 21:30 on the same day Google had to admit that “We have identified a small number of additional Google Container Engine clusters that were not fixed by the previous round of repair.”
To reconstruct this timeline, on Friday afternoon, Google had a problem. By the next day, it declared it had restored normal service. But it hadn't and some people lost a large chunk of their Saturday night cleaning up a corner of the Google cloud that someone didn't fix properly the first time.
Sounds familiar? Google's last cloud glitch was about five weeks before this new incident. That's a more-than-decent record even if the fix for this fail didn't work the first time.
Source: Google.
Không có nhận xét nào:
Đăng nhận xét