Microbenchmarking and Performance Tests in Scala: JMH and ScalaMeter

In many contexts, micro-benchmarking and performance are not the main issue and other considerations should drive a developer’s choices regarding implementation. That’s not always the case tough. This post is just to share and illustrate a sort of Scala project template that I put together and found quite useful in few situations where micro-benchmarking and performance might be a concern (or simply a legitimate curiosity). Broadly speaking, in Scala there are two viable approaches: JMH (Java Microbenchmarking Harness) and ScalaMeter and, if you are interested, a discussion about their pros and cons and their place in the Scala ecosystem is available here. To note that JMH is part of OpenJdk project but that doesn’t mean that benchmarks have to run on Open Jdk. As a matter of fact, all the examples presented in this post were executed on an Oracle Jdk. NOTE: all the code is available here.

Settings and dependencies

For JMH, we  can use sbt-jmh, an sbt plugin for JMH, so I added the following line to project/plugins.sbt:

addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.4")

And then we set the dependencies in the build.sbt

import sbt.Keys._

lazy val root = (project in file(".")).
  enablePlugins(JmhPlugin).
  settings(
    name := "scala-jmh-performance-testing",
    organization := "LansaloLtd",
    version := "1.0",
    scalaVersion := "2.12.3",
    libraryDependencies ++= Seq(
      "org.openjdk.jmh" % "jmh-generator-annprocess" % "1.21",
      "commons-codec" % "commons-codec" % "1.9",
      "com.storm-enroute" %% "scalameter" % "0.8.2",
      "org.scalatest" %% "scalatest" % "3.0.1" % Test
    ),
    testFrameworks += new TestFramework("org.scalameter.ScalaMeterFramework"),
    parallelExecution in Test := false
  )

Personally I like to use both ScalaMeter and JMH (that is why both are present as dependency above).

Far to be an exhaustive list of pros and cons between the two but I would like to mention at least that the first allows to create graphs really easily and follow the same approach as ScalaCheck with generators that can be composed via for-comprehension, on the other hand is not actively maintained. The second seems to be the preferred choice for Scala and for some measurement is really straight forward. At the present time, the only documentation available for JMH is via examples. Note also that we have introduced a dependency on OpenJDK (org.openjdk.jmh) but that is independent from the runtime jdk.

Targets

Purely for demonstration purposes we need two candidates or targets for our benchmark. For no particular reason I chose:

// candidate 1
def mapOnFunctionsAndList(input: String, slice: Int, funcs: List[String => Int]): List[Int] = {
    funcs.map(f => slide(input, slice).map(f).min)
}

and:

// candidate 2
def foldOnFunctionsList(input: String, slice: Int, funcs: List[String => Int]): List[Int] = {
    funcs.foldLeft(List.empty[Int])((acc, func) => {
      slide(input, slice).map(str => func(str)).min :: acc
    })
}

Basically, we start from a certain number n of functions (String => Int), then we slide a given String obtaining z substrings from it. For each function, 1 to n, we pass all the substrings obtaining z Int but we keep only the minimum.  Therefore, if we start with a list of n functions, we end up with a list on n Int. These two implementations are equivalent and return same results given the same inputs but are there differences when it comes to performances?

JMH

With JMH we can work out pretty easily the throughput for each of our two candidates. First we need to define the targets of our benchmark and a State for them:

package com.lansalo.jmh

import com.lansalo.Target._
import org.openjdk.jmh.annotations._

class TargetPerformance {

  import Scopes._

  def testMapOnFunctionsAndList(state: BenchmarkState): Unit = {
    mapOnFunctionsAndList(state.title, state.slice, state.funcs)
  }

  def testFoldOnFunctionsList(state: BenchmarkState): Unit = {
    foldOnFunctionsList(state.title, state.slice, state.funcs)
  }

}

object Scopes {

  import com.lansalo.HashingFunctions.hashingFunctions

  val novelTitle = "The Persecution and Assassination of Jean-Paul Marat as Performed by the Inmates of the Asylum of Charenton Under the Direction of the Marquis de Sade"

  @State(Scope.Benchmark)
  class BenchmarkState {
    val funcs: List[String => Int] = hashingFunctions(200)
    val title: String = novelTitle
    val slice: Int = 4
  }
}

And then the proper benchmark (that can also be executed within an IDE like IntelliJ):

package com.lansalo.jmh.benchmark

import java.util.concurrent.TimeUnit

import com.lansalo.jmh.{Scopes, TargetPerformance}
import org.openjdk.jmh.annotations._
import org.openjdk.jmh.results.format.ResultFormatType
import org.openjdk.jmh.runner.Runner
import org.openjdk.jmh.runner.options.{Options, OptionsBuilder}

object BenchmarkRunner_ThroughputBenchmark {
  // run sbt clean jmh:compile from terminal first.
  def main(args: Array[String]): Unit = {
    val opt: Options = new OptionsBuilder().include(classOf[ThroughputBenchmark].getSimpleName)
      .resultFormat(ResultFormatType.TEXT).build
    new Runner(opt).run
  }
}

@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 30)
@Measurement(iterations = 30)
@State(Scope.Benchmark)
private class ThroughputBenchmark extends TargetPerformance {

  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime))
  @Fork(value = 1)
  override def testFoldOnFunctionsList(state: Scopes.BenchmarkState): Unit = super.testFoldOnFunctionsList(state)

  @Benchmark
  @BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime))
  @Fork(value = 1)
  override def testMapOnFunctionsAndList(state: Scopes.BenchmarkState): Unit = super.testMapOnFunctionsAndList(state)

}

In this example, we measure the throughput and average time:

@BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime))

in a separate JVM:

@Fork(value = 1)

We also run 30 warmup iterations (ignored for the final results) and 30 proper iterations where the actual measurement is taken:

@Warmup(iterations = 30)
@Measurement(iterations = 30)

Warmup is needed mainly because of JVM features as just-in-time compilation but that would take us to talk about all the background behind microbenchmarking which would be an entire chapter on its own. I can recommend few readings though: Nanotrusting the Nanotime, Avoiding Benchmarking Pitfalls on the JVM, Java JVM Warmup.1
We are almost ready, run sbt clean jmh:compile from the terminal and then we can run ThroughputBenchmark straight from IntelliJ and obtain something like:

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt   Score    Error  Units
ThroughputBenchmark.testFoldOnFunctionsList    thrpt   30  95.529 ±  0.940  ops/s
ThroughputBenchmark.testMapOnFunctionsAndList  thrpt   30  71.549 ±  0.734  ops/s
ThroughputBenchmark.testFoldOnFunctionsList     avgt   30   0.010 ±  0.001   s/op
ThroughputBenchmark.testMapOnFunctionsAndList   avgt   30   0.011 ±  0.001   s/op

Not really interested in the numbers themselves (which also depends upon the hardware) but more in the sort of intelligence we can gather in this way. ScalaMeter offer different possibilities and customizations as well. I quite like the graphs bits and with the following code:

package com.lansalo.scalameter.benchmark

import com.lansalo.HashingFunctions.hashingFunctions
import com.lansalo.Target._

import java.io.File

import org.scalameter.{Bench, Gen}
import org.scalameter.persistence.SerializationPersistor

import scala.util.Random

object PerformanceBenchmark extends Bench.OfflineReport {

  override lazy val persistor = SerializationPersistor(new File("target/scalameter/performance/results"))

  val functionsGen: Gen[Int] = Gen.range("functionsNumber")(100, 1000, 100)
  val inputGen: Gen[Int] = Gen.range("imputLength")(20, 200, 20)

  val inputs: Gen[(String, List[String => Int])] = for {
    functionsNumber  foldOnFunctionsList(param._1, 4, param._2)
      }
    }
  }

}

we can obtain the below graph:

map-fold

Where orange is for fold and blue for map.

We can also investigate the two implementations from a memory usage point of view. JMH provides a series of Profilers that can be quite handy in those cases (in the following example we chose a HotspotMemoryProfiler):

object BenchmarkRunner_MemoryFootprint {
  // run sbt clean jmh:compile from terminal first.
  def main(args: Array[String]): Unit = {
    val opt: Options = new OptionsBuilder().include(classOf[MemoryFootprint].getSimpleName).addProfiler(classOf[HotspotMemoryProfiler])
      .resultFormat(ResultFormatType.TEXT).build
    new Runner(opt).run
  }
}

Resulting in a long list of counters (below a partial output):

Benchmark                                                                                             Mode  Cnt            Score   Error  Units
MemoryFootprint.testFoldOnFunctionsList                                                              thrpt                71.335          ops/s
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.invocations                              thrpt               108.000              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.lastEntryTime                            thrpt       10179095151.000              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.lastExitTime                             thrpt       10178954989.000              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.time                                     thrpt          75456013.000              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.invocations                              thrpt                   ≈ 0              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.lastEntryTime                            thrpt                   ≈ 0              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.lastExitTime                             thrpt                   ≈ 0              ?
MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.time                                     thrpt                   ≈ 0              ?

you can take a look at here to see how to make sense of them. The following ScalaMeter example instead provide a more basic picture of the heap usage:

object MemoryBenchmark extends Bench.OfflineReport {

  override lazy val persistor = SerializationPersistor(new File("target/scalameter/memoryusage/results"))

  override def measurer = new Measurer.MemoryFootprint

// rest of the code is the same as above

}

For the two fold/map implementations the graph wouldn’t be so interesting (as they overlap) but you can take a look at another post I wrote about value classes to get an idea.

Finally, if running the benchmark on the Oracle JDK, we can use JMH to exploit the Flight Recorder:

object BenchmarkRunner_FlightRecorder {
  // run sbt clean jmh:compile from terminal first.
  def main(args: Array[String]): Unit = {
    val opt: Options = new OptionsBuilder().include(classOf[FlightRecorderDump].getSimpleName)
      .resultFormat(ResultFormatType.TEXT).build
    new Runner(opt).run
  }
}

@Warmup(iterations = 30)
@Measurement(iterations = 30)
@State(Scope.Benchmark)
private class FlightRecorderDump extends TargetPerformance {

  @Benchmark
  @Fork(value = 1, jvmArgsAppend = Array("-XX:+UnlockCommercialFeatures",
      "-XX:+FlightRecorder", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints",
      "-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/foldOnFunctionsList.jfr"))
  override def testFoldOnFunctionsList(state: Scopes.BenchmarkState): Unit = super.testFoldOnFunctionsList(state)

  @Benchmark
  @Fork(value = 1, jvmArgsAppend = Array("-XX:+UnlockCommercialFeatures",
    "-XX:+FlightRecorder", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints",
    "-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/mapOnFunctionsAndList.jfr"))
  override def testMapOnFunctionsAndList(state: Scopes.BenchmarkState): Unit = super.testMapOnFunctionsAndList(state)

}

And this will generate two separate reports, /tmp/foldOnFunctionsList.jfr for the fold version and /tmp/mapOnFunctionsAndList.jfr for the version with map. With JMC (Java Mission Control) we can compare a rich collection of parameters and and different aspects  between the two implementations. Below a general overview of the heap usage for the the map implementation, just as an example:

map-perf

—————————————-

1. [In this post a certain familiarity with basic concepts of benchmarking is somehow taken for granted as the main focus it to illustrate how JMH and ScalaMeter can be used to produce some results. Still, the interpretation of those results, require the understanding of the mentioned background]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s