Ben J. Christensen

Fault Tolerance in a High Volume, Distributed System

Originally posted on the Netflix Tech Blog:


In an earlier post by Ben Schmaus, we shared the principles behind our circuit-breaker implementation. In that post, Ben discusses how the Netflix API interacts with dozens of systems in our service-oriented architecture, which makes the API inherently more vulnerable to any system failures or latencies underneath it in the stack. The rest of this post provides a more technical deep-dive into how our API and other systems isolate failure, shed load and remain resilient to failures.

Fault Tolerance is a Requirement, Not a Feature

The Netflix API receives more than 1 billion incoming calls per day which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 100k dependency requests per second.

This all occurs in the cloud across thousands of EC2 instances.

Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99%30 = 99.7% uptime = 2+ hours in a month).

When a single API dependency fails at high volume with increased latency (causing blocked request threads) it can rapidly (seconds or sub-second) saturate all available Tomcat (or other container such as Jetty) request threads and take down the entire API.

Thus, it is a requirement of high volume, high availability applications to build fault tolerance into their architecture and not expect infrastructure to solve it for them.

Netflix DependencyCommand Implementation

The service-oriented architecture at Netflix allows each team freedom to choose the best transport protocols and formats (XML, JSON, Thrift, Protocol Buffers, etc) for their needs so these approaches may vary across services.

In most cases the team providing a service also distributes a Java client library.

Because of this, applications such as API in effect treat the underlying dependencies as 3rd party client libraries whose implementations are “black boxes”. This in turn affects how fault tolerance is achieved.

In light of the above architectural considerations we chose to implement a solution that uses a combination of fault tolerance approaches:

  • network timeouts and retries
  • separate threads on per-dependency thread pools
  • semaphores (via a tryAcquire, not a blocking call)
  • circuit breakers

Each of these approaches to fault-tolerance has pros and cons but when combined together provide a comprehensive protective barrier between user requests and underlying dependencies.

The Netflix DependencyCommand implementation wraps a network-bound dependency call with a preference towards executing in a separate thread and defines fallback logic which gets executed (step 8 in flow chart below) for any failure or rejection (steps 3, 4, 5a, 6b below) regardless of which type of fault tolerance (network or thread timeout, thread pool or semaphore rejection, circuit breaker) triggered it.

We decided that the benefits of isolating dependency calls into separate threads outweighs the drawbacks (in most cases). Also, since the API is progressively moving towards increased concurrency it was a win-win to achieve both fault tolerance and performance gains through concurrency with the same solution. In other words, the overhead of separate threads is being turned into a positive in many use cases by leveraging the concurrency to execute calls in parallel and speed up delivery of the Netflix experience to users.

Thus, most dependency calls now route through a separate thread-pool as the following diagram illustrates:

If a dependency becomes latent (the worst-case type of failure for a subsystem) it can saturate all of the threads in its own thread pool, but Tomcat request threads will timeout or be rejected immediately rather than blocking.

In addition to the isolation benefits and concurrent execution of dependency calls we have also leveraged the separate threads to enable request collapsing (automatic batching) to increase overall efficiency and reduce user request latencies.

Semaphores are used instead of threads for dependency executions known to not perform network calls (such as those only doing in-memory cache lookups) since the overhead of a separate thread is too high for these types of operations.

We also use semaphores to protect against non-trusted fallbacks. Each DependencyCommand is able to define a fallback function (discussed more below) which is performed on the calling user thread and should not perform network calls. Instead of trusting that all implementations will correctly abide to this contract, it too is protected by a semaphore so that if an implementation is done that involves a network call and becomes latent, the fallback itself won’t be able to take down the entire app as it will be limited in how many threads it will be able to block.

Despite the use of separate threads with timeouts, we continue to aggressively set timeouts and retries at the network level (through interaction with client library owners, monitoring, audits etc).

The timeouts at the DependencyCommand threading level are the first line of defense regardless of how the underlying dependency client is configured or behaving but the network timeouts are still important otherwise highly latent network calls could fill the dependency thread-pool indefinitely.

The tripping of circuits kicks in when a DependencyCommand has passed a certain threshold of error (such as 50% error rate in a 10 second period) and will then reject all requests until health checks succeed.

This is used primarily to release the pressure on underlying systems (i.e. shed load) when they are having issues and reduce the user request latency by failing fast (or returning a fallback) when we know it is likely to fail instead of making every user request wait for the timeout to occur.

How do we respond to a user request when failure occurs?

In each of the options described above a timeout, thread-pool or semaphore rejection, or short-circuit will result in a request not retrieving the optimal response for our customers.

An immediate failure (“fail fast”) throws an exception which causes the app to shed load until the dependency returns to health. This is preferable to requests “piling up” as it keeps Tomcat request threads available to serve requests from healthy dependencies and enables rapid recovery once failed dependencies recover.

However, there are often several preferable options for providing responses in a “fallback mode” to reduce impact of failure on users. Regardless of what causes a failure and how it is intercepted (timeout, rejection, short-circuited etc) the request will always pass through the fallback logic (step 8 in flow chart above) before returning to the user to give a DependencyCommand the opportunity to do something other than “fail fast”.

Some approaches to fallbacks we use are, in order of their impact on the user experience:

  • Cache: Retrieve data from local or remote caches if the realtime dependency is unavailable, even if the data ends up being stale
  • Eventual Consistency: Queue writes (such as in SQS) to be persisted once the dependency is available again
  • Stubbed Data: Revert to default values when personalized options can’t be retrieved
  • Empty Response (“Fail Silent”): Return a null or empty list which UIs can then ignore

All of this work is to maintain maximum uptime for our users while maintaining the maximum number of features for them to enjoy the richest Netflix experience possible. As a result, our goal is to have the fallbacks deliver responses as close to what the actual dependency would deliver.

Example Use Case

Following is an example of how threads, network timeouts and retries combine:

The above diagram shows an example configuration where the dependency has no reason to hit the 99.5th percentile and thus cuts it short at the network timeout layer and immediately retries with the expectation to get median latency most of the time, and accomplish this all within the 300ms thread timeout.

If the dependency has legitimate reasons to sometimes hit the 99.5th percentile (i.e. cache miss with lazy generation) then the network timeout will be set higher than it, such as at 325ms with 0 or 1 retries and the thread timeout set higher (350ms+).

The threadpool is sized at 10 to handle a burst of 99th percentile requests, but when everything is healthy this threadpool will typically only have 1 or 2 threads active at any given time to serve mostly 40ms median calls.

When configured correctly a timeout at the DependencyCommand layer should be rare, but the protection is there in case something other than network latency affects the time, or the combination of connect+read+retry+connect+read in a worst case scenario still exceeds the configured overall timeout.

The aggressiveness of configurations and tradeoffs in each direction are different for each dependency.

Configurations can be changed in realtime as needed as performance characteristics change or when problems are found all without risking the taking down of the entire app if problems or misconfigurations occur.

Conclusion

The approaches discussed in this post have had a dramatic effect on our ability to tolerate and be resilient to system, infrastructure and application level failures without impacting (or limiting impact to) user experience.

Despite the success of this new DependencyCommand resiliency system over the past 8 months, there is still a lot for us to do in improving our fault tolerance strategies and performance, especially as we continue to add functionality, devices, customers and international markets.

Filed under: Architecture, Code, Infrastructure, Performance, Production, Production Problems, , , , ,

WebService Performance on EC2 Instances

As part of a cost/performance analysis of Amazon EC2 instances I wrote a very simple Java webapp to allow simple comparison benchmarking.

The webapp runs in Tomcat and has a servlet which returns a JSON response after looping to generate the JSON to cause CPU computation time as if the service were doing real work. It results in a lot of iteration and string creation activity – things web services do a lot of.

ApacheBenchmark (ab) was used as the test client to simulate traffic and capture statistics.

The instance types tested were m2.2xlargem2.4xlargecc1.4xlarge, and cc2.8xlarge.

The tests determined that the cc2.8xlarge instance type – though the most expensive by unit – is potentially the most cost effective type to use when serving large volume traffic involving a cluster of servers.

The cost to serve 1000 requests/second at a response time of approximately 18ms for each instance type is:

  • m2.2xlarge: $1.98
  • m2.4xlarge: $2.48
  • cc1.4xlarge: $1.36
  • cc2.8xlarge: $1.19

Of course, this test is very simplistic and ignores a wide variety of variables, but it demonstrates that the cc1 and cc2 boxes are well worth the time to evaluate for production application usage for achieving cost/performance efficiency.

The following graphs show the test results that were used to calculate the above costs.

Results per Second with Increasing Concurrency

This demonstrates how the 32 CPU cores on the cc2.8xlarge enable significantly higher throughput.

Time per Request (ms) with Increasing Concurrency

This shows how the response time degrades with increasing thread counts and how the larger boxes (particularly the cc2) scale better.

Spreadsheet

This screenshot of the spreadsheet shows the cost calculations at each level of concurrency.

The hi-lighted green is considered the optimal “ceiling” beyond which performance degrades and throughput plateaus. This point is considered the high-water-mark that can be used to determine the number of machines in a cluster needed to serve a given amount of traffic and thus allow comparison of different machines.

Price per 1000 Requests per Second with Increasing Concurrency

This scatter chart shows the cost to serve 1000 requests per second at increasing levels of concurrency on each instance type.

Overall, the performance of the cc1 and cc2 boxes are impressive and their cost/performance appears to be far better than the m2 instances.

The raw results of the tests can be found on Github.

Filed under: Infrastructure, Performance, ,

AtomicCircularArray Wins My Concurrency Throughput Test

I recently made several implementations of a RollingNumber (allowing a sum to be calculated over a 10 second period with 1 second increments with continual updates) to determine what would perform best under high-concurrency.

In my case, concurrency means 4-8000 writes/second from hundreds of threads on an 8-core machine and only a handful of reads per second.

In the end, a circular-array (using AtomicReferenceArray internally) won my throughput tests and also achieved my side goal of being non-blocking.

The code is on GitHub and the winning implementation is RollingNumberViaTryLockWithAtomicCircularArray.

Note that this code is NOT what I would deploy in production, since I forced each implementation to comply to the same interface so they could all be run by my test-harness.

However, a modified version of RollingNumberViaTryLockWithAtomicCircularArray is being run in production processing billions of writes per day.

Another aspect of the test is the use of ReentrantLock.tryLock() instead of a synchronized block. The tryLock() works well in this use-case, allowing only 1 thread to get the lock and routing all others around it instead of blocking them.

Test Results

Here are charts (higher is better) showing the differences in performance between implementations on two different machines:

MacBook Pro 2.2GHz Intel Core i7, 8GB Memory, SSD Drive, OSX Lion 10.7.1

$ uname -a
Darwin 11.1.0 Darwin Kernel Version 11.1.0: Tue Jul 26 16:07:11 PDT 2011; root:xnu-1699.22.81~1/RELEASE_X86_64 x86_64

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11A511)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

The winner is the blue bar which is the highest.

Amazon EC2 m2.4xlarge
High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform

$ uname -a
Linux 2.6.21.7-2.ec2.v1.3.fc8xen #1 SMP Sat Sep 25 01:16:50 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

$ /usr/java/latest/bin/java -version
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

The winner is the orange bar which is the highest.

Running the Test

You can run the test using an executable JAR with the command:

java -Xmx1g -jar RollingNumberThroughputTest.jar

The test also validates that the code does what it should and is able to find non-thread-safe implementations by looking for incorrect math caused by concurrency bugs.

Conclusion

I found it very interesting how differently the results were on the 2 machines (operating systems and CPUs are very different, the JVM implementations are also different). In both cases though the AtomicCircularArray performed better, far better on the EC2 instance running the Sun JVM.

These tests provided me with the data to choose AtomicCircularArray and as mentioned above it is similar to what I’m now using in production.

Filed under: Code, Performance

Statistics for Why Web Performance Matters

Following are some good posts with further evidence of the impact of webpage/webapp responsiveness to the user experience, amount of usage and ultimately revenue.

Good introductory quote to the external links and images shown below:

There’s no longer any debate. There’s reliable, reproducible evidence that web page latency is directly tied to the bottom line. At Velocity, Microsoft, Google and Shopzilla made this abundantly clear in a series of awesome presentations: detailed, controlled testing proves that slower pages hurt the bottom line. In Google’s case, adding delay reduces the average number of searches a visitor does each day even after the delay is removed.

Regarding “smaller scale” sites more typical than Google and Bing:

The results of their analysis show how significant a reduction in page latency can be. In addition to reducing bounce rates, and increasing pages per visit & time on site, they found a 16.07% increase in conversion rates and a 5.50% increase in average order value.http://radar.oreilly.com/2009/10/watching-websites.html

External Links:

http://radar.oreilly.com/2009/10/watching-websites.html
http://radar.oreilly.com/2009/07/velocity-making-your-site-fast.html
http://www.watchingwebsites.com/archives/proof-that-speeding-up-websites-improves-online-business

Good summary PDF showing metrics of performance impact.

conversion-rate-and-order-value

bing-delayimpact

Filed under: Performance, Production, User Interface

Initial Impressions on Ruby Performance

I’ve spent the past day playing with Ruby and decided to test some basic performance – iterating and string parsing – to get an idea of what the performance is really like.

Ruby is not doing so well in my tests.

Note to Ruby Experts: If anyone can demonstrate what I’m doing wrong in my code or testing, I would love to be corrected.

My approach was to write the exact same code in Java and Ruby that loads up a file, reads each lines, tokenizes it into words using whitespace as the delimiter and counts up the tokens.

This avoids network or database IO and other external resources – except the filesystem which I don’t consider a significant variable in the test.

Further testing I’m planning on doing will test Rails, multi-threading and other common things I do in my apps.

Picture 3

On my laptop, a MacBook Pro 2.53Ghz Core 2 Duo with 4GB memory, the average times are:

  • Ruby 1.8.6 app: 8022ms
  • Java 5 app: 2986ms
  • Java 6 app: 1443ms

Source Code

After downloading, strip the .doc ending off of the files.

Program Output

macbook-pro:src benjc$ /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java FileReadParse
Starting to read file…
The number of tokens is: 7764115
It took 1502 ms
macbook-pro:performance benjc$ ruby file_read_parse.rb
Starting to read file …
The number of tokens is: 7764115
It took 7999.955 ms

macbook-pro:src benjc$ /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java FileReadParse

Starting to read file…

The number of tokens is: 7764115

It took 1502 ms

macbook-pro:performance benjc$ ruby file_read_parse.rb

Starting to read file …

The number of tokens is: 7764115

It took 7999.955 ms


Filed under: Code, Performance

Speed of Thought

I’ve focused on performance for several years in my server-side and web application development – as much as I’ve been able to fit into the timelines. It has involved digging into minute details of Java and JVM tuning that rarely get explored by most java developers (from what I can tell anyways) and focusing on tuning the CSS, images, caching, GZIP and other settings of the front-end. It has generally paid off. Today my team operates servers processing millions of complex, dynamic, uncacheable web service transactions completing on average in around 250ms each (server side, not including network transport to client). I believe with further investment we could improve that even more.

I have read comments from companies such as Google and Amazon how the performance of an application can dramatically affect how much people use it. I agree. The slightest friction in searching makes me search less, or shop less, etc.

This past week I’ve been using the new iPhone 3GS which is at least 2x faster than the previous iPhone 2G I had. In some cases it’s 4x and 6x faster.

I already used the iPhone a lot. The increase in speed has further reduced the “friction” of use to the point that if I even have a thought of quickly looking something up or performing some other action, I am much more likely to do it.

On my last iPhone, I consciously chose to not bother at certain times because of the time it would take. Yes, I’m talking in seconds and even milliseconds here — but when it’s a “thought”, if the tool doesn’t work at the same speed, then it’s friction. Same goes for another application I use which involves looking up reference materials and documents. Before I kind of had to avoid “flipping around’ while someone was referring to things. It was actually faster to use the paper documents. Now, I can keep up or be faster with my iPhone than the paper version ‘users’. Therefore it encourages use.

The new user experience of using the iPhone 3GS, so significantly improved just by the performance improvement, has reminded me as a developer and architect how critical it is to design, plan for and develop to achieve high performance. Functionality isn’t enough — we should be aiming for the “speed of thought”.

Interestingly, Google has just launched a new site just for “speeding up the web“.

The following video shows “the experts” talking about how the human mind perceives changes of 100ms (one tenth of a second).

It’s my belief that this isn’t just a “nice to have” feature. If a product, service or application wants to be adopted and deemed “necessary” by its users, its performance must reduce friction as much as technically feasible to the point where it approaches or achieves “speed of thought”.

Filed under: Architecture, Performance, Production, User Interface

Java JDK 1.5 vs 1.6 Performance

In a benchmarking of our server software I found about a 34% speed improvement (if my math is correct) on using JDK 1.6.0_12 over JDK 1.5.0_17.

picture-8

Filed under: Performance

ConcurrentHashMap vs HashMap

A simple test of performance for ConcurrentHashMap.

The insert is only a little slower, but the retrieval is 3x slower in this quick and dirty single-threaded test.

 

TimeUtil timer = new TimeUtil();

        timer.start();

        HashMap map = new HashMap();

        for (int i = 0; i < 1000000; i++) {

            map.put(i, i);

        }

        timer.printCurrent();

        

        timer.start();

        ConcurrentHashMap cmap = new ConcurrentHashMap();

        for (int i = 0; i < 1000000; i++) {

            cmap.put(i, i);

        }

        timer.printCurrent();

        

 

        

        timer = new TimeUtil();

        timer.start();

        map = new HashMap();

        for (int i = 0; i < 1000000; i++) {

            map.get(i);

        }

        timer.printCurrent();

        

        timer.start();

        cmap = new ConcurrentHashMap();

        for (int i = 0; i < 1000000; i++) {

            cmap.get(i);

        }

        timer.printCurrent();

 

Action completed in: 458 ms 0.458 seconds

Action completed in: 574 ms 0.574 seconds

Action completed in: 30 ms 0.03 seconds

Action completed in: 113 ms 0.113 seconds

Filed under: Code, Performance

MySQL JDBC Memory Usage on Large ResultSet

I recently came across the problem of large resultsets being pulled into a java app via MySQL JDBC. I had dealt with this years ago but forgotten about it.

The test case below shows how the entire ResultSet is buffered in memory by default — which can be a “very bad thing” when dealing with hundreds or thousands of megabytes of data when it’s intended to be processed row by row.

Using mysql-connector-java-3.1.12-bin.jar and a JDK 5 with 32MB heap:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Ran out of memory at row: 0

java.lang.OutOfMemoryError: Java heap space

at com.mysql.jdbc.ByteArrayBuffer.getBytes(ByteArrayBuffer.java:128)

at com.mysql.jdbc.ByteArrayBuffer.readLenByteArray(ByteArrayBuffer.java:248)

at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1304)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2272)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:423)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:1962)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1385)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1728)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2988)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2917)

at com.mysql.jdbc.Statement.executeQuery(Statement.java:824)

at JDBCTest.main(JDBCTest.java:26)

 

Using mysql-connector-java-5.1.6-bin.jar and the same JDK 5 with 32MB heap:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Ran out of memory at row: 0

java.lang.OutOfMemoryError: Java heap space

at com.mysql.jdbc.ByteArrayBuffer.getBytes(ByteArrayBuffer.java:128)

at com.mysql.jdbc.ByteArrayBuffer.readLenByteArray(ByteArrayBuffer.java:248)

at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1304)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2272)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:423)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:1962)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1385)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1728)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2988)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2917)

at com.mysql.jdbc.Statement.executeQuery(Statement.java:824)

at JDBCTest.main(JDBCTest.java:26)

 

Thus we see that both the old and new versions of the MySQL JDBC driver by default attempt to load the entire resultset into memory.

 

I now increase the heap to 1GB to allow it to grow and find that the test query uses up > 500MB of heap before it even starts the rs.next() loop.

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 298  Used: 183  Free: 115

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 527  Used: 381  Free: 146

Starting to retrieve data. Memory Used: 517

Done retrieving data => 2318284   Memory Used: 551

 

Here is the code for this:

 

            ResultSet rs = conn.createStatement().executeQuery(“<sql query that returns lots of data>”);

            System.out.println(“Starting to retrieve data. Memory Used: “ + getUsedMemory());

            while (rs.next()) {

                rs.getString(1);

                rowsReturned++;

            }

            System.out.println(“Done retrieving data => “ + rowsReturned + ”   Memory Used: “  

+ getUsedMemory());

 

Thus you can see that the “executeQuery()” method loads up 500MB of data before it passes on the “rs.next()” loop. The full ResultSet is being buffered in memory.

 

Solution

To make the JDBC driver stream the results instead of buffer them all first we do the following:

            stmt.setFetchSize(Integer.MIN_VALUE);

Then we get this result instead:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Starting to retrieve data. Memory Used: 2

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 2  Free: 31

Done retrieving data => 2318284   Memory Used: 2

 

Now it behaves like we expect it to … only 2MB used instead of > 500MB.

 

There are some caveats:

  • http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html
  • http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
In the second link of official documentation we read (emphasis in red added by myself):
———————————————————————-

ResultSet

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this any result sets created with the statement will be retrieved row-by-row.

There are some caveats with this approach. You will have to read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.

The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.

If the statement is within scope of a transaction, then locks are released when the transaction completes (which implies that the statement needs to complete first). As with most other databases, statements are not complete until all the results pending on the statement are read or the active result set for the statement is closed.

Therefore, if using streaming results, you should process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.

 

 

Filed under: Code, Performance

Java Memory Usage – Ints

As I work on another set of indexes with large numbers of ints I’ve revisited how different storage mechanisms behave.

I’ve set the JVM to just over 64MB (70MB to be precise).

An int being 4 bytes means that 64MB of ints is as follows:

       int sixtyFourMB = 64 * 1024 * 1024 / 4;

That is 16,777,216 ints.

An int array can successfully store 64MB in the JVM meaning it is properly assigning each int to 4 bytes without overhead.

      int[] vals = new int[sixtyFourMB];

However, if you use ArrayList<Integer> you run out of memory after 3,392,918 ints being added.

That is 1/5 of what can be stored in the int[]!

Now, assigning the space to ArrayList is fine:

                ArrayList<Integer> ints = new ArrayList<Integer>(sixtyFourMB);

It’s when you add the Integer objects (int is cast to Integer behind the scenes in JDK 5) that it dies 1/5 in.

So the overhead is obviously in the Integer object.

I next try the Trove library and use the TIntArrayList.

If I leave it to assign its memory space by default, it runs out of memory at 10,485,760 (62%).

If however I tell it how many ints to expect it sizes itself properly and does not try to grow too large and fits the entire 64MB.

     TIntArrayList ints = new TIntArrayList(sixtyFourMB);

Thus, when backed by an int[] the memory can be efficiently stored, but when using an Object[] memory is very inefficient when attempting to store primitive ints.

Out of curiosity for what an Object() really does with memory I did the following:

            ArrayList<Object> ints = new ArrayList<Object>(sixtyFourMB);

            for (i = 0; i < sixtyFourMB; i++) {

                ints.add(null);

            }

That works, it all fits in memory.

If however I change the “null” to “new Object()” such as:

                ints.add(new Object());

It fails at 702065 … which means that the object[] has already taken up the space and there is no room for the objects on the heap.

Thus, for storage of simple data primitive arrays is MUCH more efficient.

If the flexibility of the Collections API is needed for dynamic manipulation of those arrays then use GNU Trove … but carefully or the dynamic sizing of the arrays in the background can cause a lot of waste.

 

An old but interesting article specifies memory usage by normal objects and classes:

http://www.roseindia.net/javatutorials/determining_memory_usage_in_java.shtml

Here are two points from that article of note:

  1. The class takes up at least 8 bytes. So, if you say new Object(); you will allocate 8 bytes on the heap.
  2. Each data member takes up 4 bytes, except for long and double which take up 8 bytes. Even if the data member is a byte, it will still take up 4 bytes! In addition, the amount of memory used is increased in 8 byte blocks. So, if you have a class that contains one byte it will take up 8 bytes for the class and 8 bytes for the data, totalling 16 bytes (groan!).

Filed under: Code, Performance

Twitter Updates

View Ben Christensen's profile on LinkedIn
Follow

Get every new post delivered to your Inbox.