Ben J. Christensen

Fault Tolerance in a High Volume, Distributed System

Originally posted on the Netflix Tech Blog:


In an earlier post by Ben Schmaus, we shared the principles behind our circuit-breaker implementation. In that post, Ben discusses how the Netflix API interacts with dozens of systems in our service-oriented architecture, which makes the API inherently more vulnerable to any system failures or latencies underneath it in the stack. The rest of this post provides a more technical deep-dive into how our API and other systems isolate failure, shed load and remain resilient to failures.

Fault Tolerance is a Requirement, Not a Feature

The Netflix API receives more than 1 billion incoming calls per day which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 100k dependency requests per second.

This all occurs in the cloud across thousands of EC2 instances.

Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99%30 = 99.7% uptime = 2+ hours in a month).

When a single API dependency fails at high volume with increased latency (causing blocked request threads) it can rapidly (seconds or sub-second) saturate all available Tomcat (or other container such as Jetty) request threads and take down the entire API.

Thus, it is a requirement of high volume, high availability applications to build fault tolerance into their architecture and not expect infrastructure to solve it for them.

Netflix DependencyCommand Implementation

The service-oriented architecture at Netflix allows each team freedom to choose the best transport protocols and formats (XML, JSON, Thrift, Protocol Buffers, etc) for their needs so these approaches may vary across services.

In most cases the team providing a service also distributes a Java client library.

Because of this, applications such as API in effect treat the underlying dependencies as 3rd party client libraries whose implementations are “black boxes”. This in turn affects how fault tolerance is achieved.

In light of the above architectural considerations we chose to implement a solution that uses a combination of fault tolerance approaches:

  • network timeouts and retries
  • separate threads on per-dependency thread pools
  • semaphores (via a tryAcquire, not a blocking call)
  • circuit breakers

Each of these approaches to fault-tolerance has pros and cons but when combined together provide a comprehensive protective barrier between user requests and underlying dependencies.

The Netflix DependencyCommand implementation wraps a network-bound dependency call with a preference towards executing in a separate thread and defines fallback logic which gets executed (step 8 in flow chart below) for any failure or rejection (steps 3, 4, 5a, 6b below) regardless of which type of fault tolerance (network or thread timeout, thread pool or semaphore rejection, circuit breaker) triggered it.

We decided that the benefits of isolating dependency calls into separate threads outweighs the drawbacks (in most cases). Also, since the API is progressively moving towards increased concurrency it was a win-win to achieve both fault tolerance and performance gains through concurrency with the same solution. In other words, the overhead of separate threads is being turned into a positive in many use cases by leveraging the concurrency to execute calls in parallel and speed up delivery of the Netflix experience to users.

Thus, most dependency calls now route through a separate thread-pool as the following diagram illustrates:

If a dependency becomes latent (the worst-case type of failure for a subsystem) it can saturate all of the threads in its own thread pool, but Tomcat request threads will timeout or be rejected immediately rather than blocking.

In addition to the isolation benefits and concurrent execution of dependency calls we have also leveraged the separate threads to enable request collapsing (automatic batching) to increase overall efficiency and reduce user request latencies.

Semaphores are used instead of threads for dependency executions known to not perform network calls (such as those only doing in-memory cache lookups) since the overhead of a separate thread is too high for these types of operations.

We also use semaphores to protect against non-trusted fallbacks. Each DependencyCommand is able to define a fallback function (discussed more below) which is performed on the calling user thread and should not perform network calls. Instead of trusting that all implementations will correctly abide to this contract, it too is protected by a semaphore so that if an implementation is done that involves a network call and becomes latent, the fallback itself won’t be able to take down the entire app as it will be limited in how many threads it will be able to block.

Despite the use of separate threads with timeouts, we continue to aggressively set timeouts and retries at the network level (through interaction with client library owners, monitoring, audits etc).

The timeouts at the DependencyCommand threading level are the first line of defense regardless of how the underlying dependency client is configured or behaving but the network timeouts are still important otherwise highly latent network calls could fill the dependency thread-pool indefinitely.

The tripping of circuits kicks in when a DependencyCommand has passed a certain threshold of error (such as 50% error rate in a 10 second period) and will then reject all requests until health checks succeed.

This is used primarily to release the pressure on underlying systems (i.e. shed load) when they are having issues and reduce the user request latency by failing fast (or returning a fallback) when we know it is likely to fail instead of making every user request wait for the timeout to occur.

How do we respond to a user request when failure occurs?

In each of the options described above a timeout, thread-pool or semaphore rejection, or short-circuit will result in a request not retrieving the optimal response for our customers.

An immediate failure (“fail fast”) throws an exception which causes the app to shed load until the dependency returns to health. This is preferable to requests “piling up” as it keeps Tomcat request threads available to serve requests from healthy dependencies and enables rapid recovery once failed dependencies recover.

However, there are often several preferable options for providing responses in a “fallback mode” to reduce impact of failure on users. Regardless of what causes a failure and how it is intercepted (timeout, rejection, short-circuited etc) the request will always pass through the fallback logic (step 8 in flow chart above) before returning to the user to give a DependencyCommand the opportunity to do something other than “fail fast”.

Some approaches to fallbacks we use are, in order of their impact on the user experience:

  • Cache: Retrieve data from local or remote caches if the realtime dependency is unavailable, even if the data ends up being stale
  • Eventual Consistency: Queue writes (such as in SQS) to be persisted once the dependency is available again
  • Stubbed Data: Revert to default values when personalized options can’t be retrieved
  • Empty Response (“Fail Silent”): Return a null or empty list which UIs can then ignore

All of this work is to maintain maximum uptime for our users while maintaining the maximum number of features for them to enjoy the richest Netflix experience possible. As a result, our goal is to have the fallbacks deliver responses as close to what the actual dependency would deliver.

Example Use Case

Following is an example of how threads, network timeouts and retries combine:

The above diagram shows an example configuration where the dependency has no reason to hit the 99.5th percentile and thus cuts it short at the network timeout layer and immediately retries with the expectation to get median latency most of the time, and accomplish this all within the 300ms thread timeout.

If the dependency has legitimate reasons to sometimes hit the 99.5th percentile (i.e. cache miss with lazy generation) then the network timeout will be set higher than it, such as at 325ms with 0 or 1 retries and the thread timeout set higher (350ms+).

The threadpool is sized at 10 to handle a burst of 99th percentile requests, but when everything is healthy this threadpool will typically only have 1 or 2 threads active at any given time to serve mostly 40ms median calls.

When configured correctly a timeout at the DependencyCommand layer should be rare, but the protection is there in case something other than network latency affects the time, or the combination of connect+read+retry+connect+read in a worst case scenario still exceeds the configured overall timeout.

The aggressiveness of configurations and tradeoffs in each direction are different for each dependency.

Configurations can be changed in realtime as needed as performance characteristics change or when problems are found all without risking the taking down of the entire app if problems or misconfigurations occur.

Conclusion

The approaches discussed in this post have had a dramatic effect on our ability to tolerate and be resilient to system, infrastructure and application level failures without impacting (or limiting impact to) user experience.

Despite the success of this new DependencyCommand resiliency system over the past 8 months, there is still a lot for us to do in improving our fault tolerance strategies and performance, especially as we continue to add functionality, devices, customers and international markets.

Filed under: Architecture, Code, Infrastructure, Performance, Production, Production Problems, , , , ,

Technical Debt Quadrant

Martin Fowler wrote a blog entry on technical debt this week that communicates the concepts of “technical debt” and classifies them very well.

techDebtQuadrant

Some favorite portions:

“A mess is a reckless debt which results in crippling interest payments or a long period of paying down the principal.”

“The prudent debt to reach a release may not be worth paying down if the interest payments are sufficiently small – such as if it were in a rarely touched part of the code-base.”

“Not just is there a difference between prudent and reckless debt, there’s also a difference between deliberate and inadvertent debt. The prudent debt example is deliberate because the team knows they are taking on a debt, and thus puts some thought as to whether the payoff for an earlier release is greater than the costs of paying it off. A team ignorant of design practices is taking on its reckless debt without even realizing how much hock it’s getting into.”

“while you’re programming, you are learning. It’s often the case that it can take a year of programming on a project before you understand what the best design approach should have been. Perhaps one should plan projects to spend a year building a system that you throw away and rebuild, but that’s a tricky plan to sell. Instead what you find is that the moment you realize what the design should have been, you also realize that you have an inadvertent debt.”

Filed under: Architecture, Code, Management & Leadership

Detail Oriented

I’m working on a fairly significant document that covers the requirements and high-level architecture for a new system based on SOA principles to replace an aging application.

The document is approaching 150 pages and beginning to approach something that properly describes the vision and needs so that business and technical folks can have a solid understanding of scope, requirements, priorities and how it will work (plus general direction on how it will be built).

It takes a lot of effort and time to sit down and not only read, but understand the scope of this system. It’s not a simple thing. As much as we try to keep it conceptually simple, there’s a lot going on and a lot of details.

In a few weeks I’m to sit down with a partner team to review and collaborate. My plan was to send over the detailed document for review so we could have a productive meeting to truly discuss missing requirements, strategy, execution planning etc.

However, I’ve been told they don’t “want” to read the detailed version — just a summary — bullet points with perhaps 10-15 pages.

These are senior developers and architects. Not high level business folks.

Attention to detail is in my opinion an absolute requirement for technical people. Decision making without details is dangerous and fairly useless.

My opinion is not high for people who do not care or have the attention span to be detail oriented.

An executive who’s not directly involved in the operations of something – a summary makes sense.

A team or person who is supposed to directly impact the design and delivery of something and its ongoing operations – if the details aren’t part of their focus, they don’t deserve to be involved.

If they don’t have the time to be detail oriented, then either they shouldn’t be working on the project, their time needs to be re-prioritized, or the project isn’t worth doing.

Filed under: Architecture, Management & Leadership, Skills

Speed of Thought

I’ve focused on performance for several years in my server-side and web application development – as much as I’ve been able to fit into the timelines. It has involved digging into minute details of Java and JVM tuning that rarely get explored by most java developers (from what I can tell anyways) and focusing on tuning the CSS, images, caching, GZIP and other settings of the front-end. It has generally paid off. Today my team operates servers processing millions of complex, dynamic, uncacheable web service transactions completing on average in around 250ms each (server side, not including network transport to client). I believe with further investment we could improve that even more.

I have read comments from companies such as Google and Amazon how the performance of an application can dramatically affect how much people use it. I agree. The slightest friction in searching makes me search less, or shop less, etc.

This past week I’ve been using the new iPhone 3GS which is at least 2x faster than the previous iPhone 2G I had. In some cases it’s 4x and 6x faster.

I already used the iPhone a lot. The increase in speed has further reduced the “friction” of use to the point that if I even have a thought of quickly looking something up or performing some other action, I am much more likely to do it.

On my last iPhone, I consciously chose to not bother at certain times because of the time it would take. Yes, I’m talking in seconds and even milliseconds here — but when it’s a “thought”, if the tool doesn’t work at the same speed, then it’s friction. Same goes for another application I use which involves looking up reference materials and documents. Before I kind of had to avoid “flipping around’ while someone was referring to things. It was actually faster to use the paper documents. Now, I can keep up or be faster with my iPhone than the paper version ‘users’. Therefore it encourages use.

The new user experience of using the iPhone 3GS, so significantly improved just by the performance improvement, has reminded me as a developer and architect how critical it is to design, plan for and develop to achieve high performance. Functionality isn’t enough — we should be aiming for the “speed of thought”.

Interestingly, Google has just launched a new site just for “speeding up the web“.

The following video shows “the experts” talking about how the human mind perceives changes of 100ms (one tenth of a second).

It’s my belief that this isn’t just a “nice to have” feature. If a product, service or application wants to be adopted and deemed “necessary” by its users, its performance must reduce friction as much as technically feasible to the point where it approaches or achieves “speed of thought”.

Filed under: Architecture, Performance, Production, User Interface

Websphere Multi-JVM jsessionid

Works just like it should :-)

IBM Support Link

Two JVMs with different contexts but the same domain now use the same jsessionid so they can talk back and forth in the same browser without jsessionid schizophrenia.

Filed under: Architecture, Code, Production

Elements of a Roadmap

I found a good list of elements for a roadmap by NickMalik

  • Purpose of this Document – why write the doc
  • Stakeholder Chart – whose needs are being met with the solution
  • Signoff Chart – timestamp and id of person who signs off.
  • High Level Platform Capabilities – the solution capabilities demanded by the business
  • Business Drivers for the Capabilities – the list of business programs or business teams that will use the system
  • Capability Demand – which business are asking for which capabilities. An interesting grid.
  • Technical Interdependencies – what other systems rely on this solution. What other systems does this solution rely on?
  • Enterprise Architecture Concerns – what agreements have been made with EA to get alignment of the solution to EA standards.
  • Architectural Context – one or more diagrams showing how the solution platform will evolve over time.
  • Methodology – how this document and concensus was created. What meetings were held. Who was in the room. (it matters)
  • Roadmap schedule – what timelines everything thinks are reasonable for delivering the needs to different business customers
  • System Quality Attributes – what quality attributes will be stressed in each of the iterations.
  • Alternatives considered – what alternatives to this roadmap were considered and why this one was chosen.
  • Roadmap Risks – what could go wrong and who is assigned to watch for it
  • Platform Historical Narrative – previous decisions and narrative so that we can always answer the question: how did we ever get in a bind like this?

Filed under: Architecture, Management & Leadership

Content Management Systems — JSR 170 and Magnolia

In the past week I’ve done a bunch of research on content management systems, and in particular the Java standard JSR 170.

I’ve been playing with Magnolia and so far am fairly impressed with how it works — though not so impressed with the backend repository which in this case is JackRabbit, the open source reference implementation.

It appears though that for a fairly reasonable price I can get the Enterprise version of Magnolia that uses the commercial JSR-170 compliant repository from Day (http://www.day.com/site/en/index.html). It sounds like it should be MUCH better … and is very likely what all the big name companies use that Magnolia lists on their site. I’ve requested them to contact me but haven’t heard back yet.

JSR-170 definitely is sounding nice … though it seems like it still has a way to go before it fully matures as it’s quite new.

Also, the opensource community obviously has not had enough time to make a decent enterprise implementation, as Jackrabbit doesn’t cut it. For example … I have the system running with JackRabbit … it only support filesystem persistence right now, so I try to connect another client to it and …. it doesn’t work ( the files are locked) …. so that means it won’t work so well except for small things.

Also, I see no nice way of connecting remotely to the repository using the open source version of Magnolia.

I’ll continue researching this in the coming week.

Filed under: Architecture, Code

Twitter Updates

View Ben Christensen's profile on LinkedIn
Follow

Get every new post delivered to your Inbox.