Ben J. Christensen

Software Development and Other Random Stuff

WebService Performance on EC2 Instances

As part of a cost/performance analysis of Amazon EC2 instances I wrote a very simple Java webapp to allow simple comparison benchmarking.

The webapp runs in Tomcat and has a servlet which returns a JSON response after looping to generate the JSON to cause CPU computation time as if the service were doing real work. It results in a lot of iteration and string creation activity – things web services do a lot of.

ApacheBenchmark (ab) was used as the test client to simulate traffic and capture statistics.

The instance types tested were m2.2xlargem2.4xlargecc1.4xlarge, and cc2.8xlarge.

The tests determined that the cc2.8xlarge instance type – though the most expensive by unit – is potentially the most cost effective type to use when serving large volume traffic involving a cluster of servers.

The cost to serve 1000 requests/second at a response time of approximately 18ms for each instance type is:

  • m2.2xlarge: $1.98
  • m2.4xlarge: $2.48
  • cc1.4xlarge: $1.36
  • cc2.8xlarge: $1.19

Of course, this test is very simplistic and ignores a wide variety of variables, but it demonstrates that the cc1 and cc2 boxes are well worth the time to evaluate for production application usage for achieving cost/performance efficiency.

The following graphs show the test results that were used to calculate the above costs.

Results per Second with Increasing Concurrency

This demonstrates how the 32 CPU cores on the cc2.8xlarge enable significantly higher throughput.

Time per Request (ms) with Increasing Concurrency

This shows how the response time degrades with increasing thread counts and how the larger boxes (particularly the cc2) scale better.

Spreadsheet

This screenshot of the spreadsheet shows the cost calculations at each level of concurrency.

The hi-lighted green is considered the optimal “ceiling” beyond which performance degrades and throughput plateaus. This point is considered the high-water-mark that can be used to determine the number of machines in a cluster needed to serve a given amount of traffic and thus allow comparison of different machines.

Price per 1000 Requests per Second with Increasing Concurrency

This scatter chart shows the cost to serve 1000 requests per second at increasing levels of concurrency on each instance type.

Overall, the performance of the cc1 and cc2 boxes are impressive and their cost/performance appears to be far better than the m2 instances.

The raw results of the tests can be found on Github.

Filed under: Infrastructure, Performance, ,

AtomicCircularArray Wins My Concurrency Throughput Test

I recently made several implementations of a RollingNumber (allowing a sum to be calculated over a 10 second period with 1 second increments with continual updates) to determine what would perform best under high-concurrency.

In my case, concurrency means 4-8000 writes/second from hundreds of threads on an 8-core machine and only a handful of reads per second.

In the end, a circular-array (using AtomicReferenceArray internally) won my throughput tests and also achieved my side goal of being non-blocking.

The code is on GitHub and the winning implementation is RollingNumberViaTryLockWithAtomicCircularArray.

Note that this code is NOT what I would deploy in production, since I forced each implementation to comply to the same interface so they could all be run by my test-harness.

However, a modified version of RollingNumberViaTryLockWithAtomicCircularArray is being run in production processing billions of writes per day.

Another aspect of the test is the use of ReentrantLock.tryLock() instead of a synchronized block. The tryLock() works well in this use-case, allowing only 1 thread to get the lock and routing all others around it instead of blocking them.

Test Results

Here are charts (higher is better) showing the differences in performance between implementations on two different machines:

MacBook Pro 2.2GHz Intel Core i7, 8GB Memory, SSD Drive, OSX Lion 10.7.1

$ uname -a
Darwin 11.1.0 Darwin Kernel Version 11.1.0: Tue Jul 26 16:07:11 PDT 2011; root:xnu-1699.22.81~1/RELEASE_X86_64 x86_64

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11A511)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

The winner is the blue bar which is the highest.

Amazon EC2 m2.4xlarge
High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform

$ uname -a
Linux 2.6.21.7-2.ec2.v1.3.fc8xen #1 SMP Sat Sep 25 01:16:50 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

$ /usr/java/latest/bin/java -version
java version "1.6.0_25"
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

The winner is the orange bar which is the highest.

Running the Test

You can run the test using an executable JAR with the command:

java -Xmx1g -jar RollingNumberThroughputTest.jar

The test also validates that the code does what it should and is able to find non-thread-safe implementations by looking for incorrect math caused by concurrency bugs.

Conclusion

I found it very interesting how differently the results were on the 2 machines (operating systems and CPUs are very different, the JVM implementations are also different). In both cases though the AtomicCircularArray performed better, far better on the EC2 instance running the Sun JVM.

These tests provided me with the data to choose AtomicCircularArray and as mentioned above it is similar to what I’m now using in production.

Filed under: Code, Performance

Statistics for Why Web Performance Matters

Following are some good posts with further evidence of the impact of webpage/webapp responsiveness to the user experience, amount of usage and ultimately revenue.

Good introductory quote to the external links and images shown below:

There’s no longer any debate. There’s reliable, reproducible evidence that web page latency is directly tied to the bottom line. At Velocity, Microsoft, Google and Shopzilla made this abundantly clear in a series of awesome presentations: detailed, controlled testing proves that slower pages hurt the bottom line. In Google’s case, adding delay reduces the average number of searches a visitor does each day even after the delay is removed.

Regarding “smaller scale” sites more typical than Google and Bing:

The results of their analysis show how significant a reduction in page latency can be. In addition to reducing bounce rates, and increasing pages per visit & time on site, they found a 16.07% increase in conversion rates and a 5.50% increase in average order value.http://radar.oreilly.com/2009/10/watching-websites.html

External Links:

http://radar.oreilly.com/2009/10/watching-websites.html
http://radar.oreilly.com/2009/07/velocity-making-your-site-fast.html
http://www.watchingwebsites.com/archives/proof-that-speeding-up-websites-improves-online-business

Good summary PDF showing metrics of performance impact.

conversion-rate-and-order-value

bing-delayimpact

Filed under: Performance, Production, User Interface

Initial Impressions on Ruby Performance

I’ve spent the past day playing with Ruby and decided to test some basic performance – iterating and string parsing – to get an idea of what the performance is really like.

Ruby is not doing so well in my tests.

Note to Ruby Experts: If anyone can demonstrate what I’m doing wrong in my code or testing, I would love to be corrected.

My approach was to write the exact same code in Java and Ruby that loads up a file, reads each lines, tokenizes it into words using whitespace as the delimiter and counts up the tokens.

This avoids network or database IO and other external resources – except the filesystem which I don’t consider a significant variable in the test.

Further testing I’m planning on doing will test Rails, multi-threading and other common things I do in my apps.

Picture 3

On my laptop, a MacBook Pro 2.53Ghz Core 2 Duo with 4GB memory, the average times are:

  • Ruby 1.8.6 app: 8022ms
  • Java 5 app: 2986ms
  • Java 6 app: 1443ms

Source Code

After downloading, strip the .doc ending off of the files.

Program Output

macbook-pro:src benjc$ /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java FileReadParse
Starting to read file…
The number of tokens is: 7764115
It took 1502 ms
macbook-pro:performance benjc$ ruby file_read_parse.rb
Starting to read file …
The number of tokens is: 7764115
It took 7999.955 ms

macbook-pro:src benjc$ /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java FileReadParse

Starting to read file…

The number of tokens is: 7764115

It took 1502 ms

macbook-pro:performance benjc$ ruby file_read_parse.rb

Starting to read file …

The number of tokens is: 7764115

It took 7999.955 ms


Filed under: Code, Performance

Speed of Thought

I’ve focused on performance for several years in my server-side and web application development – as much as I’ve been able to fit into the timelines. It has involved digging into minute details of Java and JVM tuning that rarely get explored by most java developers (from what I can tell anyways) and focusing on tuning the CSS, images, caching, GZIP and other settings of the front-end. It has generally paid off. Today my team operates servers processing millions of complex, dynamic, uncacheable web service transactions completing on average in around 250ms each (server side, not including network transport to client). I believe with further investment we could improve that even more.

I have read comments from companies such as Google and Amazon how the performance of an application can dramatically affect how much people use it. I agree. The slightest friction in searching makes me search less, or shop less, etc.

This past week I’ve been using the new iPhone 3GS which is at least 2x faster than the previous iPhone 2G I had. In some cases it’s 4x and 6x faster.

I already used the iPhone a lot. The increase in speed has further reduced the “friction” of use to the point that if I even have a thought of quickly looking something up or performing some other action, I am much more likely to do it.

On my last iPhone, I consciously chose to not bother at certain times because of the time it would take. Yes, I’m talking in seconds and even milliseconds here — but when it’s a “thought”, if the tool doesn’t work at the same speed, then it’s friction. Same goes for another application I use which involves looking up reference materials and documents. Before I kind of had to avoid “flipping around’ while someone was referring to things. It was actually faster to use the paper documents. Now, I can keep up or be faster with my iPhone than the paper version ‘users’. Therefore it encourages use.

The new user experience of using the iPhone 3GS, so significantly improved just by the performance improvement, has reminded me as a developer and architect how critical it is to design, plan for and develop to achieve high performance. Functionality isn’t enough — we should be aiming for the “speed of thought”.

Interestingly, Google has just launched a new site just for “speeding up the web“.

The following video shows “the experts” talking about how the human mind perceives changes of 100ms (one tenth of a second).

It’s my belief that this isn’t just a “nice to have” feature. If a product, service or application wants to be adopted and deemed “necessary” by its users, its performance must reduce friction as much as technically feasible to the point where it approaches or achieves “speed of thought”.

Filed under: Architecture, Performance, Production, User Interface

Java JDK 1.5 vs 1.6 Performance

In a benchmarking of our server software I found about a 34% speed improvement (if my math is correct) on using JDK 1.6.0_12 over JDK 1.5.0_17.

picture-8

Filed under: Performance

ConcurrentHashMap vs HashMap

A simple test of performance for ConcurrentHashMap.

The insert is only a little slower, but the retrieval is 3x slower in this quick and dirty single-threaded test.

 

TimeUtil timer = new TimeUtil();

        timer.start();

        HashMap map = new HashMap();

        for (int i = 0; i < 1000000; i++) {

            map.put(i, i);

        }

        timer.printCurrent();

        

        timer.start();

        ConcurrentHashMap cmap = new ConcurrentHashMap();

        for (int i = 0; i < 1000000; i++) {

            cmap.put(i, i);

        }

        timer.printCurrent();

        

 

        

        timer = new TimeUtil();

        timer.start();

        map = new HashMap();

        for (int i = 0; i < 1000000; i++) {

            map.get(i);

        }

        timer.printCurrent();

        

        timer.start();

        cmap = new ConcurrentHashMap();

        for (int i = 0; i < 1000000; i++) {

            cmap.get(i);

        }

        timer.printCurrent();

 

Action completed in: 458 ms 0.458 seconds

Action completed in: 574 ms 0.574 seconds

Action completed in: 30 ms 0.03 seconds

Action completed in: 113 ms 0.113 seconds

Filed under: Code, Performance

MySQL JDBC Memory Usage on Large ResultSet

I recently came across the problem of large resultsets being pulled into a java app via MySQL JDBC. I had dealt with this years ago but forgotten about it.

The test case below shows how the entire ResultSet is buffered in memory by default — which can be a “very bad thing” when dealing with hundreds or thousands of megabytes of data when it’s intended to be processed row by row.

Using mysql-connector-java-3.1.12-bin.jar and a JDK 5 with 32MB heap:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Ran out of memory at row: 0

java.lang.OutOfMemoryError: Java heap space

at com.mysql.jdbc.ByteArrayBuffer.getBytes(ByteArrayBuffer.java:128)

at com.mysql.jdbc.ByteArrayBuffer.readLenByteArray(ByteArrayBuffer.java:248)

at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1304)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2272)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:423)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:1962)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1385)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1728)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2988)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2917)

at com.mysql.jdbc.Statement.executeQuery(Statement.java:824)

at JDBCTest.main(JDBCTest.java:26)

 

Using mysql-connector-java-5.1.6-bin.jar and the same JDK 5 with 32MB heap:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Ran out of memory at row: 0

java.lang.OutOfMemoryError: Java heap space

at com.mysql.jdbc.ByteArrayBuffer.getBytes(ByteArrayBuffer.java:128)

at com.mysql.jdbc.ByteArrayBuffer.readLenByteArray(ByteArrayBuffer.java:248)

at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1304)

at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2272)

at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:423)

at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:1962)

at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1385)

at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1728)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2988)

at com.mysql.jdbc.Connection.execSQL(Connection.java:2917)

at com.mysql.jdbc.Statement.executeQuery(Statement.java:824)

at JDBCTest.main(JDBCTest.java:26)

 

Thus we see that both the old and new versions of the MySQL JDBC driver by default attempt to load the entire resultset into memory.

 

I now increase the heap to 1GB to allow it to grow and find that the test query uses up > 500MB of heap before it even starts the rs.next() loop.

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 298  Used: 183  Free: 115

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 527  Used: 381  Free: 146

Starting to retrieve data. Memory Used: 517

Done retrieving data => 2318284   Memory Used: 551

 

Here is the code for this:

 

            ResultSet rs = conn.createStatement().executeQuery(“<sql query that returns lots of data>”);

            System.out.println(“Starting to retrieve data. Memory Used: “ + getUsedMemory());

            while (rs.next()) {

                rs.getString(1);

                rowsReturned++;

            }

            System.out.println(“Done retrieving data => “ + rowsReturned + ”   Memory Used: “  

+ getUsedMemory());

 

Thus you can see that the “executeQuery()” method loads up 500MB of data before it passes on the “rs.next()” loop. The full ResultSet is being buffered in memory.

 

Solution

To make the JDBC driver stream the results instead of buffer them all first we do the following:

            stmt.setFetchSize(Integer.MIN_VALUE);

Then we get this result instead:

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

Retrieving data …

Starting to retrieve data. Memory Used: 2

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 1  Free: 32

ET-COMMONS INFO: JVM MEMORY MONITOR => Total: 33  Used: 2  Free: 31

Done retrieving data => 2318284   Memory Used: 2

 

Now it behaves like we expect it to … only 2MB used instead of > 500MB.

 

There are some caveats:

  • http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html
  • http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
In the second link of official documentation we read (emphasis in red added by myself):
———————————————————————-

ResultSet

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this any result sets created with the statement will be retrieved row-by-row.

There are some caveats with this approach. You will have to read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.

The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.

If the statement is within scope of a transaction, then locks are released when the transaction completes (which implies that the statement needs to complete first). As with most other databases, statements are not complete until all the results pending on the statement are read or the active result set for the statement is closed.

Therefore, if using streaming results, you should process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.

 

 

Filed under: Code, Performance

Java Memory Usage – Ints

As I work on another set of indexes with large numbers of ints I’ve revisited how different storage mechanisms behave.

I’ve set the JVM to just over 64MB (70MB to be precise).

An int being 4 bytes means that 64MB of ints is as follows:

       int sixtyFourMB = 64 * 1024 * 1024 / 4;

That is 16,777,216 ints.

An int array can successfully store 64MB in the JVM meaning it is properly assigning each int to 4 bytes without overhead.

      int[] vals = new int[sixtyFourMB];

However, if you use ArrayList<Integer> you run out of memory after 3,392,918 ints being added.

That is 1/5 of what can be stored in the int[]!

Now, assigning the space to ArrayList is fine:

                ArrayList<Integer> ints = new ArrayList<Integer>(sixtyFourMB);

It’s when you add the Integer objects (int is cast to Integer behind the scenes in JDK 5) that it dies 1/5 in.

So the overhead is obviously in the Integer object.

I next try the Trove library and use the TIntArrayList.

If I leave it to assign its memory space by default, it runs out of memory at 10,485,760 (62%).

If however I tell it how many ints to expect it sizes itself properly and does not try to grow too large and fits the entire 64MB.

     TIntArrayList ints = new TIntArrayList(sixtyFourMB);

Thus, when backed by an int[] the memory can be efficiently stored, but when using an Object[] memory is very inefficient when attempting to store primitive ints.

Out of curiosity for what an Object() really does with memory I did the following:

            ArrayList<Object> ints = new ArrayList<Object>(sixtyFourMB);

            for (i = 0; i < sixtyFourMB; i++) {

                ints.add(null);

            }

That works, it all fits in memory.

If however I change the “null” to “new Object()” such as:

                ints.add(new Object());

It fails at 702065 … which means that the object[] has already taken up the space and there is no room for the objects on the heap.

Thus, for storage of simple data primitive arrays is MUCH more efficient.

If the flexibility of the Collections API is needed for dynamic manipulation of those arrays then use GNU Trove … but carefully or the dynamic sizing of the arrays in the background can cause a lot of waste.

 

An old but interesting article specifies memory usage by normal objects and classes:

http://www.roseindia.net/javatutorials/determining_memory_usage_in_java.shtml

Here are two points from that article of note:

  1. The class takes up at least 8 bytes. So, if you say new Object(); you will allocate 8 bytes on the heap.
  2. Each data member takes up 4 bytes, except for long and double which take up 8 bytes. Even if the data member is a byte, it will still take up 4 bytes! In addition, the amount of memory used is increased in 8 byte blocks. So, if you have a class that contains one byte it will take up 8 bytes for the class and 8 bytes for the data, totalling 16 bytes (groan!).

Filed under: Code, Performance

Sort Performance

We did some simple testing on various sort algorithms we found and compared them with the Java Arrays.sort QuickSort implementation.

This is rudimentary testing since it’s not attempting to test all the variety of distributions of numbers that could exist and should be tested.

That being said, we are creating a collection of 50k, 500k, or 5million numbers randomly generated, and then cloning that collection to be sorted by each of the algorithms.

This first test had 1/2 the numbers as 4 digits, the other half as 5 digits meaning that in the larger collects, there should be a lot of duplicates.

The second test below had 1/2 the numbers as 4 digits, the other half as 8 digits.

This change in number of digits makes dramatic changes for FlashSort and completely breaks it.

Radix is just bad all the time … I have yet to figure out where it’s strength lies.

This is not thorough enough to make any conclusion, but I found it interesting nonetheless.

***Test 1 with 4 & 5 digit numbers***

—————————–
Run with 50k
—————————–
Arrays.sort(int) => 15ms
FlashSort.sort(int) => 11ms Valid: true
fastqs.sort(int) => 10ms Valid: true
eqsort.sort(int) => 16ms Valid: true
radix.sort(int) => 175ms Valid: true
radix.sort(int, 5) => 130ms Valid: true
—————————–
Run with 500k
—————————–
Arrays.sort(int) => 101ms
FlashSort.sort(int) => 56ms Valid: true
fastqs.sort(int) => 72ms Valid: true
eqsort.sort(int) => 96ms Valid: true
radix.sort(int) => 2164ms Valid: true
radix.sort(int, 5) => 1203ms Valid: true
—————————–
Run with 5million
—————————–
Arrays.sort(int) => 748ms
FlashSort.sort(int) => 717ms Valid: true
fastqs.sort(int) => 780ms Valid: true
eqsort.sort(int) => 1625ms Valid: true
radix.sort(int) => 82218ms Valid: true
radix.sort(int, 5) => 11387ms Valid: true
—————————–

***Test 1 with 4 & 8 digit numbers***

—————————–
Run with 50k
—————————–
Arrays.sort(int) => 14ms
FlashSort.sort(int) => 662ms Valid: true
fastqs.sort(int) => 10ms Valid: true
eqsort.sort(int) => 11ms Valid: true
radix.sort(int) => 294ms Valid: true
radix.sort(int, 8 ) => 225ms Valid: true
—————————–
Run with 500k
—————————–
Arrays.sort(int) => 91ms
FlashSort.sort(int) => 17832ms Valid: true
fastqs.sort(int) => 72ms Valid: true
eqsort.sort(int) => 95ms Valid: true
radix.sort(int) => 2066ms Valid: true
radix.sort(int, 8 ) => 2003ms Valid: true
—————————–
Run with 5million
—————————–
Arrays.sort(int) => 909ms
FlashSort.sort(int) => 197796ms Valid: true
fastqs.sort(int) => 825ms Valid: true
eqsort.sort(int) => 1494ms Valid: true
radix.sort(int) => 22957ms Valid: true
radix.sort(int, 8 ) => 20041ms Valid: true
—————————–

Filed under: Code, Performance

Twitter Updates

View Ben Christensen's profile on LinkedIn
Follow

Get every new post delivered to your Inbox.