Sort statistics – parallelism

By | November 7, 2013

This week I’ve been working on two specific algortihms in sorting, probably the two most famous ones: merge sort and bubble sort. Statistical analysis of the code is quite hard in some programming languages. We know for sure that Java (the language that I’ve been told to use) requires a lot of time for calling methods, creating variables and thus the overhead can be quite substantial for small datasets where the actual computation takes place almost instantly. Normally you can neglect all these overheads as you don’t care about the time it takes to compute something (usually because it’s so quick).

Now in my case it was necessary to have running times for a number of different sized datasets. The measurement is nanoseconds, something that I think a computer cannot actually mesure. But let’s believe that the Java library for it works well enough (and nevertheless most of my measurements are in the 10^-5s range, so plus 4 digits is mostly just junk. Merge sort is a recursive algorithm. This means that actually on small datasets this will be extremely slow – about 10 times slower than bubble sort. But this doesn’t matter really, it’s not what I’m talking about. It’s the overheads in general. I ran the tests and watched how my server breaks up into pieces as the bubble sort hits it (merge sort takes less than a second even for million number sets).

I noticed something very impressive. First, that java will never try by itself to run on multiple cores even if it has multiple things to do (like with mergesort). It always uses 100% of one core, but never anything on the other. Then I noticed something very strange. When I started looking at my processor’s usage in htop, I found that there were actually about 15-20 threads running the same java command! All of them ran on the same core though. Now this makes me angry. I wanted to write a sequential program, running on one core only so that my measurements are most accurate. And java, all by itself, created a number of zombie threads, all waiting for something. This makes the results of my tests slightly unreliable. Although it seems that the times are actually consistent enough to draw a conclusion.

Still, why does the JVM decide to run multiple threads when I haven’t written any code that would require that? I’m starting to dislike that computers today try to be smarter than the programmer. I’m not claiming that assembly would be better, I like highly abstracted programming languages such as Python and Java, but I feel that the ability to restrict the program to one thread should be there. Terrible post, I know.

Category: IT

Leave a Reply