net.kolotyluk.loom.Experiment11_PrimeThreads

public class Experiment11_PrimeThreads extends Object

TODO - Experiment with Java Flow

Experiment 3 - Prime Numbers

Calculating Prime Numbers is always fun, and it's most fun when we try to compute prime numbers as fast as possible, or as many at once as we can. As we will see, Project Loom offers us no benefits in this use case, but it's interesting to experiment and see why.

As I have mentioned before, with Concurrent Programming, higher levels of abstraction are generally better and safer because they have been designed and implemented by experts, and it's better to stand on the shoulders of giants. Generally, for CPU Bound use cases, the Stream Interface is a good place to start as it has been specifically design for such use cases.

Streams

The Streams Architecture has three important phases

Source: Something that generates a stream of objects, such as a range of numbers, in our case, numbers that could be prime
Pipeline: A sequence of Functions on each object in the stream, in our case, testing if a number is prime, which leverages the Stream.filter(Predicate) function.
Sink: Ultimately, we need somewhere to Collect the Stream of objects. This is also known as *Terminating* the Stream.

Note: Unlike Akka Streams, Java Streams are not Infinite, they must terminate, and indeed, the Pipeline does not start until it is actually terminated with some collection function. The Java Flow is more akin to Akka Streams, and in Akka Streams, they actually use the term Flow. Hopefully later we can see the effects of running Java Flows with Project Loom...

There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton

Threads

For the rest of the experiments, we can conclude that that attempts to use Threads to do better than basic BaseStream.parallel() generally fail. We can also see that the code is substantially more complicated too...

However, where threads really shine is when we are dealing with concurrency where there are a lot of blocking operations, such as transaction processing, network access, etc. Every blocking operation is an opportunity for some other task to run and progress.

For these experiments we warp our primes experiments into a pseudo networking application, where we simulate farming isPrime() out to HTTP Endpoints. This simulation basically wraps the Primes.isPrime(long, long, long) calculation with network latency via Thread.sleep(long), one to simulate Request Latency, another to simulate Response Latency.

The spirit of the experiments here is that we might have some application that is computationally expensive and takes a long time. While BaseStream.parallel() works just fine on limited scale, we might imagine a magic cloud that has some /isPrime endpoints where the work can be done faster, more efficiently, etc. Indeed, this is a classic MapReduce model. To be sure, if we really wanted to generate prime numbers in a High-Performance Computing style, we might use something like Apache Spark, but here we are trying to keep things simple, and make a point, not a perfect application.

Benchmarking

When trying to compare the performance of different Concurrent and Parallel approaches, it's important to conduct benchmarks, and one tool to use is the Java Microbenchmark Harness (JMH). However, it can take quite a long time for JMH to run benchmarks. By default, JMH runs 25 warmups, and 25 benchmarks, then averages the results. For quicker results, the experiments in this code take a rather casual approach to get a feel for things, leaving the heavy lifting to JMH.

Streams

See benchmarks/PrimeNumbers for the code that runs these the JMH Benchmarks.

 Benchmark                                  Mode  Cnt        Score        Error
 PrimeNumbers.parallelPrimesTo_1000         avgt   25       46.620 ±      1.549
 PrimeNumbers.parallelPrimesTo_10_000       avgt   25      379.036 ±      7.604
 PrimeNumbers.parallelPrimesTo_10_000_000   avgt   25  1360362.823 ±  14723.084
 PrimeNumbers.serialPrimesTo_1000           avgt   25       32.998 ±      0.315
 PrimeNumbers.serialPrimesTo_10_000         avgt   25      644.467 ±     16.160
 PrimeNumbers.serialPrimesTo_10_000_000     avgt   25  8199848.272 ±  84514.067

where the Score is the Average Microseconds (μS) to complete the run. Given we only test odd numbers we can conclude from testing numbers if they are prime, where throughput = tests per μS

 Benchmark                                     tested  throughput  ratio
 PrimeNumbers.parallelPrimesTo_1000               500      10.725  0.708
 PrimeNumbers.parallelPrimesTo_10_000           5,000      13.191  1.700
 PrimeNumbers.parallelPrimesTo_10_000_000   5,000,000       3.675  6.025
 PrimeNumbers.serialPrimesTo_1000                 500      15.152  1.413
 PrimeNumbers.serialPrimesTo_10_000             5,000       7.758  0.588
 PrimeNumbers.serialPrimesTo_10_000_000     5,000,000       0.610  0.166

From this we can conclude

For smaller numbers of computations, O(1,000), Serial Stream has more throughput than Parallel Stream. This is because, Parallel Stream requires Concurrency, and Concurrency has overhead.

For larger numbers of computations, O(10,000), Parallel Stream has more throughput than Serial Stream. This is because the power of parallelism overcomes the overhead of concurrency.

For really large numbers of computations, O(10,000,000), Parallel Stream really shines over Serial Stream. In this use case, it is computationally more expensive to test large numbers if they are prime, so the dominant factor is raw CPU, and in this case, computation has been spread across 12 CPUs.

For O(10,000,000) computations we can see the throughput of Parallel Streams is much lower than O(10,000), but still much better than O(10,000,000) Serial Stream, so the throughput sweet spot for Parallel Stream in this use case is somewhere between O(10,000) and O(10,000,000). This is explained because it takes longer to test larger numbers if they are prime, therefore throughput will begin decreasing. This is why it's important to benchmark real applications, and not synthetic experiments. As we will see later, however, synthetic experiments do have their place.

As an aside, computing primes up to

1,000, there are 167 primes
10,000, there are 1228 primes
10,000,000 there are 664578 primes

Project Loom

Pure Computation

When I first started playing with these experiments I naïvely thought that using Project Loom could improve upon the previous benchmarks. Even though I had already watched Ron Pressler - Loom: Bringing Lightweight Threads and Delimited Continuations to the JVM, I had not really appreciated or internalized that knowledge. But, by naïvely concocting my own experiments, I had empirical evidence that supported what Ron Pressler had already stated.

Without using JMH, playing around here I discovered

Running BaseStream.parallel() within a VirtualThread ExecutorService context did not improve throughput. In most cases it was a little worse.

Replacing Stream.parallel() with a VirtualThread ExecutorService that calls ExecutorService.invokeAll(Collection) has about 1/3 the throughput of Stream.parallel() for testing up to 10,000,000 primes.

Replacing Stream.parallel() with a VirtualThread ExecutorService that calls ExecutorService.submit(Callable) has about 1/2 the throughput of Stream.parallel(), which is better than ExecutorService#.invokeAll(Collection)

Comparing Parallel Streams to Virtual Threads is sort of like comparing oranges to grapefruit, but in a sense they are both citrus. However, comparing ExecutorService#.invokeAll(Collection) to ExecutorService.submit(Callable) is more like comparing navel oranges to mandarin oranges. While ExecutorService#.invokeAll(Collection) looks more attractive, it might not perform as well as we would expect. This is still a naïve test, so it should be invested further with more rigour and analysis.

Simulated Networking

The bottom line is, that for raw computation, stick with the Java Streams API, and Parallel Streams. However, how does Project Loom compare when simulating a Network Application, where we might be farming prime computations out to some HTTP Endpoint?

Using the same Primes.isPrime(long, long, long) code as the previous benchmarks, where we use isPrime(candiate, minimumLag, maximumLag) with minimumLag = 10 ms and maximumLag = 30 ms, times 2, or a total of 20 ms minimum and 60 ms maximum, where the actual lag is random; we simulate some network blocking overhead by using Thread.sleep(long) before the prime test and after the prime test to simulate the HTTP Request overhead and the HTTP Response overhead. More importantly, these two sleep() requests give the Thread schedulers a chance to let other tasks run. Implicitly, there is the actual lag of the isPrime(candidate) computation itself, which leads me to believe that this is actually a pretty good simulation.

 Benchmark                                 Mode  Cnt        Score        Error
 PrimeThreads.platformPrimesTo_1000        avgt   25      122.004 ±      4.888
 PrimeThreads.platformPrimesTo_10_000      avgt   25      986.577 ±     55.197
 PrimeThreads.platformPrimesTo_10_000_000  avgt   25  1043651.917 ± 139003.151
 PrimeThreads.virtualPrimesTo_1000         avgt   25       33.420 ±      1.142
 PrimeThreads.virtualPrimesTo_10_000       avgt   25       53.476 ±      0.610
 PrimeThreads.virtualPrimesTo_10_000_000   avgt   25    33058.254 ±   1171.181

 Benchmark                                     tested  throughput   ratio
 PrimeThreads.platformPrimesTo_1000               500       4.098   0.274
 PrimeThreads.platformPrimesTo_10_000           5,000       5.068   0.054
 PrimeThreads.platformPrimesTo_10_000_000   5,000,000       4.791   0.032
 PrimeThreads.virtualPrimesTo_1000                500      14.961   3.651
 PrimeThreads.virtualPrimesTo_10_000            5,000      93.500  18.449
 PrimeThreads.virtualPrimesTo_10_000_000    5,000,000     151.248  31.569

Based on the JMH results (after a time of 14:56:54), where we are using units of Milliseconds,

I am really surprised I was able to have 5,000,000 Platform Threads running, but it took an incredibly long time

On a scale of 5,000,000 threads, Virtual Threads have 32 times the throughput of Platform Threads

Given how long my system was running with fans at full, I think Project Loom deserves bragging rights that it also can save energy. It would be interesting to run this benchmark again with my power meter to see how many Joules was consumed by each benchmark.

Many thanks to Ron Pressler who responded to my email regarding benchmarking Project Loom, and patiently clarified my thinking on how to benchmark, and more importantly how to interprest and explain benchmarks.

Author:

eric@kolotyluk.net

See Also:

Constructor Summary

Constructors

Constructor

Description

Experiment11_PrimeThreads()
Method Summary

Modifier and Type

Method

Description

static void

main(String[] args)

static List<Future<Long>>

primeThreads(long limit, ExecutorService executorService)

static List<Future<Long>>

primeThreadsOld(long limit, ExecutorService executorService)

static void

suite3(long limit)

static void

suite3Old(long limit, ThreadFactory threadFactory)

Hello Prime Threads PID = 31560 CPU Cores = 12 Heap Size = 17179869184 ______________________________________________________________________________ primes to 1,000 primeThreads: threadMaximum = 12 primeThreads: threadMaximum = 3 primeThreads: threadMaximum = 1 primeThreads: threadMaximum = 3 virtualCachedThreadPool = 72, 67, 5 virtualFixedThreadPool = 6, 5, 1 virtualSingleThreadTaskExecutor = 4, 4, 0 virtualThreadPerTaskExecutor = 18, 18, 0 primes to 10,000 primeThreads: threadMaximum = 3 primeThreads: threadMaximum = 3 primeThreads: threadMaximum = 1 primeThreads: threadMaximum = 3 virtualCachedThreadPool = 78, 76, 2 virtualFixedThreadPool = 15, 14, 1 virtualSingleThreadTaskExecutor = 9, 8, 1 virtualThreadPerTaskExecutor = 36, 35, 1 primes to 10,000,000 primeThreads: threadMaximum = 11 primeThreads: threadMaximum = 11 primeThreads: threadMaximum = 1 primeThreads: threadMaximum = 12 virtualCachedThreadPool = 14338, 14167, 171 virtualFixedThreadPool = 3162, 3071, 91 virtualSingleThreadTaskExecutor = 9349, 9297, 52 virtualThreadPerTaskExecutor = 4124, 3939, 185 primes to 50,000,000 primeThreads: threadMaximum = 12 primeThreads: threadMaximum = 12 primeThreads: threadMaximum = 1 primeThreads: threadMaximum = 12 virtualCachedThreadPool = 96528, 95790, 738 virtualFixedThreadPool = 18029, 17619, 410 virtualSingleThreadTaskExecutor = 81046, 80751, 295 virtualThreadPerTaskExecutor = 40608, 39787, 821

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Experiment11_PrimeThreads
  
  public Experiment11_PrimeThreads()

Method Details

main

public static void main(String[] args)
suite3

public static void suite3(long limit)

suite3Old

public static void suite3Old(long limit, ThreadFactory threadFactory)

 Hello Prime Threads
 PID = 31560
 CPU Cores = 12
 Heap Size = 17179869184
 ______________________________________________________________________________

 primes to 1,000
 primeThreads: threadMaximum = 12
 primeThreads: threadMaximum = 3
 primeThreads: threadMaximum = 1
 primeThreads: threadMaximum = 3
 virtualCachedThreadPool          = 72, 67, 5
 virtualFixedThreadPool           = 6, 5, 1
 virtualSingleThreadTaskExecutor  = 4, 4, 0
 virtualThreadPerTaskExecutor     = 18, 18, 0

 primes to 10,000
 primeThreads: threadMaximum = 3
 primeThreads: threadMaximum = 3
 primeThreads: threadMaximum = 1
 primeThreads: threadMaximum = 3
 virtualCachedThreadPool         = 78, 76, 2
 virtualFixedThreadPool          = 15, 14, 1
 virtualSingleThreadTaskExecutor = 9, 8, 1
 virtualThreadPerTaskExecutor    = 36, 35, 1

 primes to 10,000,000
 primeThreads: threadMaximum = 11
 primeThreads: threadMaximum = 11
 primeThreads: threadMaximum = 1
 primeThreads: threadMaximum = 12
 virtualCachedThreadPool         = 14338, 14167, 171
 virtualFixedThreadPool          = 3162, 3071, 91
 virtualSingleThreadTaskExecutor = 9349, 9297, 52
 virtualThreadPerTaskExecutor    = 4124, 3939, 185

 primes to 50,000,000
 primeThreads: threadMaximum = 12
 primeThreads: threadMaximum = 12
 primeThreads: threadMaximum = 1
 primeThreads: threadMaximum = 12
 virtualCachedThreadPool         = 96528, 95790, 738
 virtualFixedThreadPool          = 18029, 17619, 410
 virtualSingleThreadTaskExecutor = 81046, 80751, 295
 virtualThreadPerTaskExecutor    = 40608, 39787, 821

Parameters:: limit -; threadFactory -

primeThreads

public static List<Future<Long>> primeThreads(long limit, ExecutorService executorService)
primeThreadsOld

public static List<Future<Long>> primeThreadsOld(long limit, ExecutorService executorService)

Class Experiment11_PrimeThreads

Experiment 3 - Prime Numbers

Streams

Threads

Benchmarking

Streams

Project Loom

Pure Computation

Simulated Networking

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Experiment11_PrimeThreads

Method Details

main

suite3

suite3Old

primeThreads

primeThreadsOld