Class Experiment02_Throughput
Experiment Suite 02 - Throughput
Consider these results obtained via the Java Microbenchmark Harness (JMH) in BasicThroughput that we will analyze below...
Benchmark runs/second Error baselineStream1To_10 7687469.693 ± 224814.056 baselineStream1To_100 1104471.967 ± 27214.069 baselineStream1To_1000 114529.797 ± 3105.523 baselineStream1To_10000 11468.640 ± 65.506 baselineStream2To_10 581039.445 ± 2984.074 baselineStream2To_100 48674.207 ± 119.580 baselineStream2To_1000 5359.420 ± 17.450 baselineStream2To_10000 534.121 ± 1.441 parallelStream1To_10 70921.902 ± 364.130 parallelStream1To_100 43634.949 ± 308.947 parallelStream1To_1000 40419.203 ± 149.770 parallelStream1To_10000 14114.597 ± 152.243 parallelStream2To_10 69896.785 ± 656.540 parallelStream2To_100 30738.958 ± 162.498 parallelStream2To_1000 13686.162 ± 105.587 parallelStream2To_10000 2799.335 ± 21.269 structuredPlatformThreads1To_10 522.304 ± 142.136 structuredPlatformThreads1To_100 56.890 ± 15.478 structuredPlatformThreads1To_1000 5.908 ± 1.781 structuredPlatformThreads1To_10000 0.584 ± 0.141 structuredPlatformThreads2To_10 522.890 ± 118.930 structuredPlatformThreads2To_100 57.186 ± 18.233 structuredPlatformThreads2To_1000 5.994 ± 1.816 structuredPlatformThreads2To_10000 0.589 ± 0.175 structuredVirtualThread2sTo_1000 1755.882 ± 86.524 structuredVirtualThreads1To_10 5595.565 ± 221.392 structuredVirtualThreads1To_100 4019.994 ± 104.544 structuredVirtualThreads1To_1000 1725.087 ± 47.855 structuredVirtualThreads1To_10000 231.667 ± 4.741 structuredVirtualThreads2To_10 5738.564 ± 303.156 structuredVirtualThreads2To_100 4099.271 ± 121.266 structuredVirtualThreads2To_10000 230.352 ± 5.745 transactionalBaselineStream1To_10 47.043 ± 1.342 transactionalBaselineStream1To_100 4.718 ± 0.112 transactionalBaselineStream1To_1000 0.461 ± 0.014 transactionalBaselineStream1To_10000 0.048 ± 0.006 transactionalBaselineStream2To_10 46.969 ± 0.969 transactionalBaselineStream2To_100 4.731 ± 0.773 transactionalBaselineStream2To_1000 0.462 ± 0.015 transactionalBaselineStream2To_10000 0.077 ± 0.007 transactionalParallelStream1To_10 452.607 ± 47.752 transactionalParallelStream1To_100 50.794 ± 2.796 transactionalParallelStream1To_1000 5.299 ± 0.192 transactionalParallelStream1To_10000 0.521 ± 0.004 transactionalParallelStream2To_10 362.114 ± 7.574 transactionalParallelStream2To_100 50.556 ± 1.149 transactionalParallelStream2To_1000 5.246 ± 0.079 transactionalParallelStream2To_10000 0.521 ± 0.011 transactionalStructuredPlatformThreads1To_10 228.233 ± 59.875 transactionalStructuredPlatformThreads1To_100 48.290 ± 11.494 transactionalStructuredPlatformThreads1To_1000 5.391 ± 1.313 transactionalStructuredPlatformThreads1To_10000 0.564 ± 0.133 transactionalStructuredPlatformThreads2To_10 235.796 ± 68.204 transactionalStructuredPlatformThreads2To_100 50.826 ± 14.387 transactionalStructuredPlatformThreads2To_1000 5.640 ± 1.552 transactionalStructuredPlatformThreads2To_10000 0.561 ± 0.113 transactionalStructuredVirtualThreads1To_10 62.805 ± 0.372 transactionalStructuredVirtualThreads1To_100 62.546 ± 2.281 transactionalStructuredVirtualThreads1To_1000 62.873 ± 0.582 transactionalStructuredVirtualThreads1To_10000 47.135 ± 16.391 transactionalStructuredVirtualThreads2To_10 62.818 ± 0.159 transactionalStructuredVirtualThreads2To_100 61.564 ± 7.955 transactionalStructuredVirtualThreads2To_1000 59.581 ± 1.140 transactionalStructuredVirtualThreads2To_10000 51.455 ± 13.161
For all of these benchmarks, identical tasks were run, where there were two types of tasks.
- A simple computation of doubling a value.
- A complex computation of testing if a number is prime.
Parallel vs Sequential
The first thing we should notice is baselineStream1... vs baselineStream2... where the Throughput λ of baselineStream1... is always higher, because the Wait W is lower. Similarly this is true of parallelStream1... vs parallelStream2... for the same reason.
The other impression we should have is that Parallel Streams are capable of better throughput than Sequential Streams, but there is an substantial overhead in running Parallel Streams, so a thoughtful design with performance testing such as benchmarks is useful in understanding where the benefits of Parallelism kick in.
- When running a low W task, baselineStream1To_1000 has better throughput than parallelStream1To_1000, but parallelStream1To_10000 has better throughput than baselineStream1To_1000, so somewhere between 1,000 and 10,000 Parallel Streams gets better throughput than Serial Streams.
- When running a higher W task, parallelStream2To_1000 has better throughput than baselineStream2To_1000, so when W is higher, a higher λ helps us sooner because the overhead of concurrency is less dominant than the cost of the task.
Parallelism is strictly an optimization
— Brian Goetz
Computational vs Transactional
The Stream
API is an excellent framework for computational tasks. In particular,
BaseStream.parallel()
is a simple way to optimize purely computational tasks that are independent.
By purely computational I mean, tasks that do not call any blocking/transactional APIs. By
transactional I mean any API, such as Thread.sleep(Duration)
, that will pin
the task such that it cannot be scheduled for execution until the transaction has completed.
Half of the benchmarks are purely computational, while the other half are transactional, and the first impression we should have is that Streams are good for purely computational tasks, but not for transactional task, while the opposite is largely true for Platform Threads and Virtual Threads.
- In the case of pure computation, at no time in these benchmarks did Virtual Threads perform better than Parallel Streams.
- In the case of transactions, 1 ms Thread Sleep per task, Streams and Parallel Streams only perform well with low λ, as do Platform Threads. It is interesting to note that in these experiments, Platform Threads seems to perform similarly to Streams and Parallel Streams. 🤔
Platform Threads vs Virtual Threads
The most important thing we should note is generally how much better Virtual Threads perform than Platform Threads. The other thing that is striking from the transactional results is that Virtual Threads are much more balanced and predictable in their throughput than Platform Threads under difference levels of concurrency. When we are designing and implementing concurrent applications, balanced and predictable are qualities we want to embrace.
Also
For clarification on the distinction between Concurrency and Parallelism, see Project Loom Brings Structured Concurrency - Inside Java Newscast #17. For better context overall, checkout the lexicon and loom advantages.
Is is worth noting that this benchmark was run on an Intel Xeon W3680 @ 3.33 GHz, with 6 Cores, 12 Hardware Threads, and 24 GB RAM. Indeed, on benchmarks where the demand was less than 12, the the number of Hardware Threads would be underutilized.
- Author:
- eric@kolotyluk.net
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionstatic LongFunction<Long>
static LongFunction<Long>
static LongFunction<Long>
static LongFunction<Long>
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionbaselineStream
(LongFunction<Long> task, long limit) static void
parallelStream
(LongFunction<Long> task, long limit) structuredThreads
(LongFunction<Long> task, long limit, ThreadFactory threadFactory)
-
Field Details
-
doubleIt
-
doubleItTransactionally
-
isPrime
-
isPrimeTransactionally
-
-
Constructor Details
-
Experiment02_Throughput
public Experiment02_Throughput()
-
-
Method Details
-
main
-
baselineStream
-
parallelStream
-
structuredThreads
public static List<Long> structuredThreads(LongFunction<Long> task, long limit, ThreadFactory threadFactory)
-