谷歌失败案例赏析：那些年在微服务上踩的坑

Hi everyone, I'm sharing some lessons learned from failures with everyone here today. It's much more fun to talk about this category than those success stories. The industry is improving and it's encouraging to know that we can learn from past mistakes and proactively avoid them in our future plans.

background information

Before I start, a little bit about my experience at Google... I joined Google straight out of college in 2003, and before that I was a camp counselor at a music camp, and before camp counselor I worked at an ice cream store. I remember my first day at Google, the technical lead on my first project was Andrew Fights, who is now something like a Distinguished Engineer at Google, and I remember telling him that I had to go talk to someone because I really didn't know what I was doing, which is still a funny thing to think about today. At Google I soak up technology and other information like a sponge quickly. Some of the things I'm talking about here today actually predate my time at Google, around 2000 and 2001. Let's start with microservices, Google's version of microservices.

At the time, Google's business was still betting on the GSA (Google Search Server) product, which actually didn't end up going as well as expected. Of course, that's true of other things as well - after all, you can't compare a virtual monopoly product to a multi-billion dollar mega-business like advertising. However, Google started out with search and focused on solving this type of technical problem.

The original driver for much of what follows comes from this slide. Before the economic crisis, a lot of organizations were building their infrastructure on top of Sun Microsystems hardware with Solaris as the operating system. If you don't consider the cost, this solution was better than anything else available, and a lot of people bought a lot of these Sun boxes for that reason. But Sun boxes are really expensive, and especially for an organization with a huge data center that needs to fill its entire data center with these boxes to support its business, the cost can impact its business pipeline and its bottom line to stay alive.

Google was in such a situation. People at the time would naturally say "Linux isn't perfect, but it's functional enough, and its hardware is cheap, so on balance we can go with Linux as an alternative". To a certain extent, I also agree that these past events are true, people at that time were very cost-conscious, so they would spare no effort to solve a series of RAM, chip, and all other failures of Linux in order to reduce the cost. And that had one result - namely that Linux was really unreliable, especially when using dumpster hardware, and had serious problems. I think Google benefited greatly from the Compaq DEC merger, which was responsible for the death of some really incredible research labs in the 90s. A lot of people like Jeff Dean and Sanjay Kumar came from that world, and they're almost all quality engineers now. They were powerfully interested in the problem of how to build software on top of incredibly unreliable hardware, and what happened later is a lot of what's going to be shared next.

However in 2001 there were no alternatives, so one had to do it oneself. Another problem was the very odd extension requirements. They were trying to do something very bold at the time, which was to index every word of every web page. Some people included and indexed every word of every web page, others just indexed it and then discarded the raw data that limited the competitors' ability to do so. It was a daunting task that required the use of computer software that didn't even exist at the time.

So, because of the unreliable Linux box, the software must scale horizontally and must accommodate frequent routine failures in any component of the stack. There was a great article earlier that suggested that "machines are cows, not pets". I think Google got it right on this one. These machines don't have cool names from Star Trek, they're just things like AB 1,2,5,7, which is also a machine name. The system doesn't have much dependency on it, and if it dies or continues to run, it doesn't affect the rest of the system. This problem got people thinking about how to build more resilient systems.

The above is how I describe things. Many people at Google have PhDs. I remember when I interviewed for a job, I didn't have a PhD. And, I only talked to one person who didn't have a Ph.D., and at the end of the interview he said, "Don't worry, they're starting to hire people without Ph.D.s now," and there were a lot of people there who were much smarter than I was and really wanted to apply their knowledge to CS systems research, and it was interesting to apply that type of experience and knowledge to real-world problems. very interesting thing to do.

I think the only good reason to build microservices is organizational structure, and that should be the only reason most organizations build microservices. However, that's not why Google builds microservices. Google builds microservices for computer science, and I'm not going to argue here that there's really no benefit to building microservices from that perspective, but of course there are certainly a lot of pain points driving it.

After starting to build microservices, if you simply assume that it will be smooth and you don't research all the possible failure scenarios beforehand, it will definitely not be smooth and may actually lead to a lot of regrettable results. I have discussed this issue with many organizations that have also given up on migrating to microservices because the migration process was just too painful. So make sure you understand the motivation for building microservices beforehand. Just as there are a lot of people in Google who emulate large infrastructure projects, sometimes I think they are building architectures that are not necessary. A sensible way to invest would be to follow the principle: "If you don't need it, don't do it, or it will only make things harder."

The main reason for doing this is to minimize the cost of communicating people between teams, a team of more than 10 or 12 people can't collaborate successfully on an engineering project, it has a lot to do with the people communication structure and delegation of work. Therefore, mapping the project team to microservices reduces the overhead of communication between people and thus increases the speed of development. This is a valid reason for choosing microservices, but it is also not the reason why we build them at Google.

I think observability consists of two things, one is the detection of critical signals, the SLI part, which needs to be very accurate, and the other is the improvement of the search space. With each additional microservice, the number of possible failure modes grows geometrically with the number of services. I don't think machine learning or AI will magically solve this problem. We need to discover ways soon that can help reduce the assumptions of the human brain, and the bootstrapping process can only be realized when using technologies outside of the mega-dashboard. Giant dashboards work well in monolithic environments, but I've seen people adopt the idea and build microservices around it for observability. I think there is a need for dashboards, but certainly not enough. The SRE team I interviewed was building huge dashboards at the time, and we were significantly less efficient than the team that made it more compact by design, and then later used other tools to improve the search space. So don't confuse visualizing the search space with refining and optimizing it. The entire search space is too large and impossible to visualize, and humans have so far been unable to process that much information.

At LightStep, we see a lot of customers struggling with these types of issues all the time. I don't know if anyone here has experienced the same thing, but I think it's a failure mode, and Google certainly understands that. There was a large Google service, probably named something like Family Type, that had to use a code generator to generate alert configurations, which ended up being 35,000 lines of even longer code. I don't remember all the reasons for that. But then they had to start maintaining those 35,000 lines of code manually, and yet those configurations were written in Google's internal, completely obscure DSL, and the level of pain associated with manual maintenance was incomparable, and that's because they were confusing alert messages for SLIs with alert messages that might be root causes. Monitoring should not be alerting on root causes, which should be part of the refinement process; it should be alerting on SLIs, which do not have so much information that they are impossible to handle for any given system.

Posted by Anvon, please cite the source when reprinting or quoting this article:https://anvon.com/en/154.html