New Parallelism APIs in Java 8: Behind The Glitz and Glamour

By Tal Weiss —  April 3, 2014 — 5 Comments

Blog_socks03

I’m a great multi-tasker. Even as I’m writing this post, I can still find room to feel awkward about a remark I made yesterday at a party that had everyone looking at me strange. Well, the good news is I’m not alone – Java 8 is also pretty good at multi-tasking. Let’s see how.

One of the key new features introduced in Java 8 is parallel array operations. This includes things like the ability to sort, filter and group items using Lambda expressions that automatically leverage multi-core architectures. The promise here is to get an immediate performance boost with minimal effort from our end as Java developers. Pretty cool.

So the question becomes – how fast is this thing, and when should I use it? Well, the quick answer is sadly – it depends. Wanna know on what? read on.

The new APIs

The new Java 8 parallel operation APIs are pretty slick. Let’s look at some of the ones we’ll be testing.

1. To sort an array using multiple cores all you have to do is -


Arrays.parallelSort(numbers);

2. To group a collection into different groups based on a specific criteria (e.g. prime and non-prime numbers) -


Map<Boolean, List<Integer>> groupByPrimary = numbers
    .parallelStream().collect(Collectors.groupingBy(s -> Utility.isPrime(s)));

3 . To filter out values all you have do is -


Integer[]  prims = numbers.parallelStream().filter(s -> Utility.isPrime(s))
    .toArray();

Compare this with writing multi-threaded implementations yourself. Quite the productivity boost! The thing I personally liked about this new architecture is the new concept of Spliterators used for splitting a target collection into chunks which could then be processed in parallel and stitched back. Just like their older brothers iterators that are used to go over a collection of items, this is a flexible architecture that enables you to write custom behaviour for going over and splitting collections that you can directly plug into.

So how does it perform?

To test this out I examined how parallel operations work under two scenarios – low and high contention. The reason is that running a multi-core algorithm by itself will usually yield pretty nice results. The kicker comes in when it begins running in a real-world server environment. That’s where a large number of pooled threads are constantly vying for precious CPU cycles to process messages or user requests. And that’s where things start slowing down. For this I set up the following test. I randomized arrays of 100K integers with a value range between zero to a million. I then ran sort, group, and filter operations on them using both a traditional sequential approach and the new Java 8 parallelism APIs. The results were not surprising.

  • Quicksort is now 4.7X times faster.
  • Grouping is now 5X times faster.
  • Filtering is now 5.5X times faster.

A happy ending? Unfortunately not.

table2

* The results are consistent with an additional test that ran 100 times * The test machine was a MBP i7 Quad Core.

So what happens under load?

So far things have been quite peachy, the reason being that there’s little contention between threads for CPU cycles. That’s an ideal situation, but unfortunately, one that doesn’t happen a lot in real life. To simulate a scenario which is more on par with what you’d normally see in a real-world environment I set up a second test. This test runs the same set of algorithms, but this time executes them on ten concurrent threads to simulate processing ten concurrent requests performed by a server when it’s under pressure (sing it Kermit!). Each of those requests will then be handled either sequentially using a traditional approach, or the new Java 8 APIs.

The results

  • Sorting in now only 20% faster – a 23X decline.
  • Filtering is now only 20% faster – a 25X decline.
  • Grouping is now 15% slower.

Higher scale and contention levels will most likely bring these numbers further down. The reason is that adding threads in what already is a multi-threaded environment doesn’t help you. We’re only as good as how many CPUs we have – not threads.

unnamed

Conclusions

While these are very strong and easy-to-use APIs, they’re not a silver bullet. We still need to apply judgment as to when to employ them. If you know in advance that you’ll be doing multiple processing operations in parallel, it might be a good idea to think about using a queuing architecture to match the number of concurrent operations to the actual number of processors available to you. The hard part here is that run-time performance will rely on the actual hardware architecture and levels of stress. Your code will most likely only see those during load-testing or in production, making this a classic case of  “easy to code, hard to debug”.

Questions, comments? I would love to hear them in the comments section.

blog_lambada_2

The Dark Side Of Lambda Expressions in Java 8 - read more

Duke- t-shirt stand2

Takipi detects all your exceptions and errors and tells you why they happen. Even across multiple threads and machines. Installs in 1min. Less than 2% overhead - Deploy Takipi now and get a free T-shirt

Tal Weiss

Posts Twitter

Tal is the CEO of Takipi. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years. He still enjoys analyzing a good bug though, and instrumenting code. In his free time Tal plays Jazz drums.
  • Jake Toronto

    This blog discusses another reason why parallelism might not be as effective:

    http://server.dzone.com/articles/think-twice-using-java-8

    It basically says that there is a fundamental weakness in Oracle’s design of threading that can prevent parallelism from being an improvement.

    Here’s another one:

    http://coopsoft.com/ar/Calamity2Article.html

    • http://www.takipi.com/ Tal Weiss

      Hey Jake,

      Thanks for sharing these. Good stuff!

    • Ross Judson

      The IntArraySum code provided at the coopsoft.com site doesn’t do much more than demonstrate how well the JVM can optimize simple int summation. If you rework the code to sum the square roots of the array elements (instead of the array elements themselves), an entirely different picture emerges. I’ve altered some other parameters as well, but here’s the a sampling of stable (over hundreds of runs) timings for the four different methods, in seconds. In this case, ThreadedSum is given the same number of threads as the default parallel threads (12 in my case).

      SequentialSum : 1.057732947
      ThreadedSum : 0.292155605
      StreamSum : 1.078417131
      ParallelStreamSum: 0.195790588

      You need to actually do some work in the computation.

  • Sébastien Tardif

    The system used has only 8 live threads. So the test is validating which approach behave the best in a bad configuration where we let more jobs enter the system that the system is designed to support. It’s an invalid use case. If the test was only processing 8 jobs concurrently that would be valid. In real life, like in a web environment, the contention is not CPU, or what the business care is not about reaching the highest number of jobs that can be process concurrently but instead getting the faster response time, and so in all those cases the parallel JDK is a benefit. Limiting the number of “jobs”/”requests” allowed to go in a system is basic concept very often not taken care of, like with Obama Care web site.

    • http://www.takipi.com/ Tal Weiss

      Hi Sebestian,

      Thanks for the comment. The danger I was pointing at was that these APIs are used naively in a multi-threaded environment (such as Servlet) when the number of active threads is high (the assumption containers make is that most of the time spent in the servlet would be waiting for net / db IO, in in which case it makes sense to open up a lot of threads). In that case you run the risk of over parallelyzing compared to the number of course available, as these lambda functions and the sort are CPU hungry) in which case as we’ve seen performance may actually be degraded. In such a case as you rightly pointed out it’s up to developer to build a queueing scheme which will throttle the number of concurrent requests to maximize performance and avoid redundant context switches.