Ben J. Christensen

Software Development and Other Random Stuff

Sort Performance

We did some simple testing on various sort algorithms we found and compared them with the Java Arrays.sort QuickSort implementation.

This is rudimentary testing since it’s not attempting to test all the variety of distributions of numbers that could exist and should be tested.

That being said, we are creating a collection of 50k, 500k, or 5million numbers randomly generated, and then cloning that collection to be sorted by each of the algorithms.

This first test had 1/2 the numbers as 4 digits, the other half as 5 digits meaning that in the larger collects, there should be a lot of duplicates.

The second test below had 1/2 the numbers as 4 digits, the other half as 8 digits.

This change in number of digits makes dramatic changes for FlashSort and completely breaks it.

Radix is just bad all the time … I have yet to figure out where it’s strength lies.

This is not thorough enough to make any conclusion, but I found it interesting nonetheless.

***Test 1 with 4 & 5 digit numbers***

—————————–
Run with 50k
—————————–
Arrays.sort(int) => 15ms
FlashSort.sort(int) => 11ms Valid: true
fastqs.sort(int) => 10ms Valid: true
eqsort.sort(int) => 16ms Valid: true
radix.sort(int) => 175ms Valid: true
radix.sort(int, 5) => 130ms Valid: true
—————————–
Run with 500k
—————————–
Arrays.sort(int) => 101ms
FlashSort.sort(int) => 56ms Valid: true
fastqs.sort(int) => 72ms Valid: true
eqsort.sort(int) => 96ms Valid: true
radix.sort(int) => 2164ms Valid: true
radix.sort(int, 5) => 1203ms Valid: true
—————————–
Run with 5million
—————————–
Arrays.sort(int) => 748ms
FlashSort.sort(int) => 717ms Valid: true
fastqs.sort(int) => 780ms Valid: true
eqsort.sort(int) => 1625ms Valid: true
radix.sort(int) => 82218ms Valid: true
radix.sort(int, 5) => 11387ms Valid: true
—————————–

***Test 1 with 4 & 8 digit numbers***

—————————–
Run with 50k
—————————–
Arrays.sort(int) => 14ms
FlashSort.sort(int) => 662ms Valid: true
fastqs.sort(int) => 10ms Valid: true
eqsort.sort(int) => 11ms Valid: true
radix.sort(int) => 294ms Valid: true
radix.sort(int, 8 ) => 225ms Valid: true
—————————–
Run with 500k
—————————–
Arrays.sort(int) => 91ms
FlashSort.sort(int) => 17832ms Valid: true
fastqs.sort(int) => 72ms Valid: true
eqsort.sort(int) => 95ms Valid: true
radix.sort(int) => 2066ms Valid: true
radix.sort(int, 8 ) => 2003ms Valid: true
—————————–
Run with 5million
—————————–
Arrays.sort(int) => 909ms
FlashSort.sort(int) => 197796ms Valid: true
fastqs.sort(int) => 825ms Valid: true
eqsort.sort(int) => 1494ms Valid: true
radix.sort(int) => 22957ms Valid: true
radix.sort(int, 8 ) => 20041ms Valid: true
—————————–

Filed under: Code, Performance

ARL Radix Sort

Very cool whitepaper on ARL Radix Sort showing how it’s better than Quicksort and normal Radix sort.

http://www.nik.no/2002/Maus.pdf

… (April 2nd) … never was able to find a working implementation to try out though …

Filed under: Code, Performance

High Performance Java Collections

I’ve been using Gnu Trove collections for years very happily, but someone just sent me a link to something new …

http://fastutil.dsi.unimi.it/

It looks quite interesting to go try out.

Another related project by the same guys is: http://mg4j.dsi.unimi.it/

Filed under: Code, Performance

Double Check Locking

Recently re-visited the double-check-locking pattern (or anti-pattern).

It doesn’t work in JDK 1.4 and earlier … however, in JDK 5, it’s updated memory model allows it to be used safely.

Wikipedia – Double Check Locking

Sun Blog on JDK 5 and Double Check Locking

Another possible solution that has the benefit of enforcing compilation ONLY with JDK 1.5 and later is a use of something called AtomicReference.

I haven’t played with them yet, but some information can be found at:

Introduction to Non-Blocking Algorithms

Filed under: Code

32-bit versus 64-bit JDK Memory Usage

Summary of analysis performed September 2006.

Testing Strategy

To ensure the tests performed were reliable and behaved the same every time with as few variables as possible, a simple program (a single java class) was written which loops indefinitely, adding objects to a collection until it runs out of memory.

This test was performed with both Integer and String objects, and then executed with 3 different JVMs with varying heap settings.

The results and details of these tests are documented in later chapters.

JVMs tested:
- JDK 1.4.2 32-bit
- JDK 1.5 32-bit
- JDK 1.5 64-bit

Operating Systems Tested
- Solaris 10 on Opterons
- Suse Linux 10 on Opterons

The tests are far from exhaustive, but were enough to derive numbers and patterns whereby the JVMs can be sized accurately and recommendations made.

Findings

- Tests performed consistently on both Linux and Solaris (except for maximum heap sizes)
- JDK 5 64-bit takes between 40% – 50% more memory on average than either 32-bit JVM
- JDK 1.4 32-bit is slightly more memory efficient than JDK 5 32-bit
o It is slightly better than JDK 5 32-bit in Integer tests
o It is exactly the same as JDK 5 32-bit in String tests

- Solaris 10 can run a 32-bit JVM up to 3.5GB
- Suse Linux can run a 32-bit JVM up to 2GB
- A 64-bit JVM was successfully run at 20GB on Suse Linux
- A 2GB 32-bit JVM is approximately equivalent to a 3GB 64-bit JVM in number of objects stored
o ie. On Suse Linux, if more than 2GB heap is needed, then one must jump to a 3GB 64-bit JVM to achieve the same amount of storage
o ie. On Solaris, if more than 3.5GB heap is needed, then one must jump to a 5GB 64-bit JVM to achieve the same amount of storage

Recommendations

Based upon these findings, it is recommended that JDK 5 32-bit be used as the default JVM, and 64-bit used if more memory is needed than what can be allocated to a 32-bit JVM with the understanding of the ratio one must increase to account for the change to a 64-bit data model.

The reasons for choosing JDK 5 over JDK 1.4 are:

- JDK 5 is the current recommended production JVM from Sun Microsystems
o It has been out for more than 2 years and is at its 8th maintenance release (JDK 1.5_08)

- General performance improvements from JDK 1.4 to JDK 5
o http://java.sun.com/j2se/1.5.0/docs/guide/performance/speed.html
o http://java.sun.com/j2se/1.5.0/docs/guide/vm/index.html

- Improved garbage collection algorithms, particularly for multi-processor machines
o http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

- 64-bit JVM available on AMD Opterons when larger heap sizes required (not available for JDK 1.4)
o http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html#platform_proc64

Summary

The tests performed confirmed that the 64-bit JVM does indeed use more memory than the 32-bit JVMs, but that it behaves consistently and can be calculated, and therefore planned for.
It is recommended to use JDK 5 32-bit as the default JVM, and the 64-bit variant when larger heap sizes are required.

Java Source File: MemoryTest.java

Filed under: Code, Performance

Keyword Search – Spelling Suggestions

Here are some valuable links related to how to provide spelling suggestions in keyword search.

Here is a commercial option in Java if Jazzy isn’t good enough …

And some interesting links on N-Grams that Lucene and Google use:

Filed under: Code

C3P0 Validation

We were getting random broken connections … the following properties got rid of them.

c3p0.testConnectionOnCheckout=true
c3p0.preferredTestQuery=select 1

See Adnan’s site for more information on C3P0.

Filed under: Code

Example of Manually Doing a GZIP HTTP Request

GET /inQuireWeb/pages/home.jsp HTTP/1.1
Host: inquire.etilize.com
Connection: close
Accept-Encoding: gzip
Accept: text/xml,application/xml,application/xhtml+xml,text/html;
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Filed under: Code

Databases in appserver environments …

Jisql is the only way to survive in a server environment — on a server with no client access to the db, except for the app you’re trying to debug …

Use Jisql from the command line :-)

Filed under: Code

Add All New Files with SVN

Add this to .bash_profile for a quick command line for adding all new files in a working directory.

svn_add_all(){
svn status | grep “^?” | awk ‘{print $2}’ | xargs svn add
}

Filed under: Code

Twitter Updates

View Ben Christensen's profile on LinkedIn
Follow

Get every new post delivered to your Inbox.