Kevin Menard’s Weblog

Embedding Truffle Languages

2022-05-09T15:16:00+00:00

Introduction

The past several years of my career have been spent predominately working on TruffleRuby, an implementation of the Ruby programming language that can achieve impressive execution speed thanks to the Truffle language implementation framework and the Graal JIT compiler. Taken together, these three technologies form part of the GraalVM distribution. The full distribution includes implementations of other languages (JavaScript, Python, R, and Java), an interpreter for LLVM bitcode (Sulong), tooling such as a profiler and debugger for both host and guest code (VisualVM), tooling to visualize decisions made by the JIT compiler (IGV), and the ability to generate native binaries of Java applications (including any of the listed language interpreters) via Native Image. There’s more to GraalVM as well, which makes defining it and discovering all of its capabilities difficult. In this article, I’d like to focus on two pieces of GraalVM functionality: 1) loading a Truffle interpreter into a Java application to call guest language code (e.g, Ruby) directly from Java; and 2) using Native Image to turn that Java code into a native shared library, allowing Truffle languages to be loaded and called just like any other exposed C function.

Native Image Overview

GraalVM’s Native Image tool can build native executables and native shared libraries from Java code. By default, these binaries will have a dependency on your system’s libc and implementations, but you can instruct Native Image to statically link in libc and zlib libraries if you have them, leaving you with a binary that has no external dependencies. In effect, you can use Java just as you would any other ahead-of-time (AOT) compiled language. In contrast to C/C++, Rust, or other similar systems languages, you still have access to the Java VM facilities such as Java’s IO abstraction and garbage collection (GC). However, the VM facilities are not provided by HotSpot, but rather a new VM written specifically for Native Image binaries called SubstrateVM.

As with most technology decisions, there’s a trade-off: Native Image binaries start considerably faster than running an application in the JVM, but they forego the ability to JIT the application¹. Additionally, the SubstrateVM garbage collector that ships with GraalVM Community Edition is not quite as polished as the HotSpot one (GraalVM Enterprise Edition supports the G1 garbage collector). Despite not having a JIT, that doesn’t mean that there is no optimization at all. The Native Image compilation process will run AOT optimization passes as it builds the image. The enterprise version of GraalVM also supports profile-guided optimization (PGO) to help Native Image make compilation decisions that are favorable to the profiled application. Additionally, Native Image binaries make distribution easier since you don’t need to have a JVM available in your target environment.

While Native Image binaries may not be the best option for long-running server applications, they open up the ability to run Java applications in environments that the language was previously ill-suited towards, such as Functions as a Service (FaaS), which need to start up quickly and are ephemeral. TruffleRuby ships as a native image so it can load quickly for scripting applications, fast REPL start-up, and execute test suites considerably faster than the JVM-based version could.

In order to build a binary of a native application while still supporting broad use of the JDK, the Native Image performs an extensive closed world analysis to figure out exactly what classes and methods your application uses and only compiles those into the application. Just to reiterate, the binary generated by Native Image does not include a JVM, so it can’t support functionality like dynamically loading classes from a JAR. Your application can make use of some of the dynamic features the JVM provides, but such usages must be constrained to something that can be decided and included in the binary. The Native Image compiler is able to detect and resolve some usages of reflection, such as Class.forName and Class.getDeclaredField when the arguments can be reduced to a constant during static analysis (e.g., a field name supplied as a static string). If your reflection usage is more dynamic or otherwise can’t be statically determined, you must provide a configuration file declaring what classes, fields, and methods must be available along with any necessary class/JAR files on the classpath so Native Image can build support for them in the binary. With these two mechanisms, Native Image can handle many use cases that call for reflection or JNI access. However, if your application allows a user to supply their own class files at runtime (e.g., a plugin-based application), please be aware that cannot and will not work in a Native Image binary².

Embedding Truffle Languages

At their core, Truffle language interpreters are just Java applications. Certainly, the language implementations also use non-Java code (e.g., a substantial portion of TruffleRuby is written in Ruby and some parts in C), but they all can be loaded and invoked in Java using the GraalVM Polyglot API. By using libjvm and the Java Native Interface (JNI) Invocation API, we can load a copy of the JVM up into a non-Java application and execute code in a Truffle language via the GraalVM Polyglot API. But, loading an entire copy of the JVM up is rather slow and memory intensive.

As Java applications, a Truffle interpreter can be compiled via Native Image³. Moreover, Native Image can link the entirety of a Truffle interpreter into the resulting binary (executable or shared library). Following this approach, we can generate a library to run our Ruby code that starts quickly, uses less RAM than libjvm, requires less disk space than JVM distribution, and have an integrated JIT to optimize our code running in the interpreter.

Native Image Playground

The GraalVM distribution ships with a dizzying amount of functionality. Most of it is very well documented, but some of it is either lacking or simply assumes the reader has more information than this post will assume. To help illustrate some of the techniques described here, I’ve pulled together a project called the Native Image Playground which has several examples of using Native Image to build standalone executables and shared libraries, for loading a Truffle interpreter into another process, and for executing multiple Truffle languages (e.g., Ruby and JavaScript) within the same process. I will refer to examples from Native Image Playground in this article. If you wish to run the code on your own, please follow the steps outlined in the project’s README to ensure you have all the necessary prerequisites.

Many of the examples in the Native Image Playground compute the Haversine distance: a way to measure geographic distance on the surface of a sphere. This algorithm was chosen because it was the same one used by Chris Seaton in his Top 10 Things To Do With Graal post, which is a spiritual predecessor to this piece. The algorithm is implemented with simple arithmetic and trigonometry, so we can easily evaluate the generated machine code for the operation. As a caveat though, the algorithm implementation was taken from the Apache SIS project and was found to be incorrect. Since the purpose of this post isn’t to be a reference for geopspatial operations, I’ve pushed ahead with the incorrect algorithm in order to retain parity with Chris’s earlier post and because the correct implementation is more involved, complicating our performance analysis.

Calling Methods From a Native Image Shared Library

As of GraalVM 22.1.0, there are two primary mechanisms for calling a method embedded in a Native Image shared library: the Native Image C API and the JNI Invocation API. The Native Image C API is somewhat begrudgingly supported and likely to be removed in the not too distant future. It’s an extra API specific to Native Image that the GraalVM would like to remove in favor of the more standard JNI Invocation API. In a Native Image binary, JNI is retargeted to work with GraalVM Isolates, the mechanism by which GraalVM supports multiple, isolated execution environments within the same process. However, JNI performance within a Native Image is limited pending the merge of Project Panama to the JDK. As a result, we have two methods for calling natively compiled Java methods from a library where neither can be fully endorsed at the moment.

Native Image C API

Don’t be put off by the name “Native Image C API”. While GraalVM makes it easy to use C to call into Native Image shared libraries, the name is more of an indication as to how the functions will be exported from the library. You can use this API in any language with the ability to call foreign functions (e.g., using the FFI or Fiddle libraries in Ruby).

By default, nothing is exported from your shared library other than a function named main should you have a public static void main method somewhere in your Java code. Otherwise, to export a Java method you must do the following:

Declare the method as static
Make the first argument an org.graalvm.nativeimage.IsolateThread
Restrict your parameter and return types to primitive types or a type from the org.graalvm.nativeimage.c.type package
Annotate the method with the org.graalvm.nativeimage.c.function.CEntryPoint annotation

If you look at the various org.graalvm.nativeimage sub-packages, you’ll find some code for handling additional cases that we are not going to do so here, such as mapping Java interfaces to C structs. For the Haversine distance calculations, all parameters will be doubles and the return value will be a double as well, so we won’t need any of the additional functionality that Native Image makes available.

Taking the NativeLibrary example from the Native Image Playground project, we have the following:

@CEntryPoint(name = "distance")
public static double distance(IsolateThread thread,
        double a_lat, double a_long,
        double b_lat, double b_long) {
    return DistanceUtils.getHaversineDistance(a_lat, a_long, b_lat, b_long);
}

Example 1: Haversine distance in Java exposed as C function in Native Image shared library.

The name attribute in the @CEntryPoint annotation may be omitted, but the default name is constructed from the class and method names along with randomly generated number to ensure uniqueness. Naturally, since the methods are being exposed in a shared library, they must have unique names. If you give two exposed methods the same name, the Native Image compiler will fail with a message such as:

duplicate symbol '_distance' in:
    libnative-library-runner.o
ld: 1 duplicate symbol for architecture x86_64

Example 2: Error message building Native Image shared library with duplicate exposed function names.

When you build the binary, Native Image will also generate some C header files for you. If working with C or C++, you can use these header files directly. For other languages, you can use the function declarations in the headers to set up your foreign call bindings. The code found in Example 1 will result in the following function declaration:

double distance(graal_isolatethread_t*, double, double, double, double);

Example 3: Function declaration for the Haversince distance method exposed in the Native Image shared library.

As you can see, the Java double type is mapped to the C double type. The Java IsolateThread type is mapped to a graal_isolatethread_t* in C.

Working with Isolates

Every function you would like to expose in a Native Image shared library using @CEntryPoint must have an IsolateThread as its first parameter and every call to that method through the shared library must supply a Graal Isolate pointer as its first argument. Looking at the code in Example 1, the distance method doesn’t do anything with the Isolate parameter. The actual usage of the Isolate handle is managed by Native Image in the generated binary.

Along with the header file generated with all of the function declarations for exposed methods in the shared library, Native Image also generates a graal_isolate.h file with type definitions and function declarations for working with the Native Image C API.

The naming here might be a bit confusing. There are Graal Isolates and Graal Isolate Threads. When calling a function exposed in Native Image shared library, you must actually supply a pointer to an Isolate Thread and all Isolate Threads must be attached to an Isolate. Creating an Isolate will implicitly create a primary Isolate Thread and that is what the sample projects in Native Image Playground use (i.e., none of the sample projects dig into multi-threading). All Graal Isolates and Isolate Threads must be torn down when you’re done with them; tearing down the Isolate will also teardown the primary Isolate Thread.

Another way of working with Isolates is to expose your own functions in the shared library by using @CEntryPoint built-ins. The Native Image Playground samples do not make extensive use of this form of resource management, but some do for completeness. To expose these methods, you would use something like the following:

@CEntryPoint(builtin = CEntryPoint.Builtin.CREATE_ISOLATE, name = "create_isolate")
static native IsolateThread createIsolate();

@CEntryPoint(builtin = CEntryPoint.Builtin.TEAR_DOWN_ISOLATE, name = "tear_down_isolate")
static native int tearDownIsolate(IsolateThread thread);

Example 4: Using @CEntryPoint built-ins to expose Graal Isolate resource management methods in the Native Image shared library with custom names.

Java Native Interface (JNI) Invocation API

The preferred mechanism for invoking code in a Native Image shared library is to use the Java Native Interface (JNI) Invocation API — a standard JDK API for starting and programmatically controlling a JVM from another process. Usage of JNI Invocation API might seem a bit odd, given a defining feature of Native Image binaries is that they do not include the JVM. Native Image binaries do include a VM though to handle things like GC and thread scheduling. This alternative VM, called the Substrate VM, reimplements the JNI Invocation API to create Graal Isolates and Isolate Threads and adjusts the rest of the API so that JNI calls bind to the appropriate Isolate Thread (see the earlier discussion on Graal Isolates if you’re unsure what that means).

By using the JNI Invocation API, you don’t need to learn a new Native Image-specific way to write code that drives a Java process. However, much of JNI is essentially runtime reflection and Native Image does not allow arbitrary reflection. In order to use JNI with a Native Image binary, you need to supply a JNI configuration file to the native-image command when build your image. Manually creating that file is tedious and error-prone. To simplify the process, I recommend using a tracing agent provided by GraalVM, which will record all JNI calls made at runtime and dump them out to a file. To do so, you’ll need to temporarily swap your application over to using libjvm, which will allow general JNI calls. I found it easiest to set the JAVA_TOOL_OPTIONS environment variable, that way I wouldn’t have to customize the java command in Maven. Using the jni-libjvm-polyglot example from the Native Image Playground, we have:

$ mvn -P jni-libjvm-polyglot -D skipTests=true clean package
$ export JAVA_TOOL_OPTIONS="-agentlib:native-image-agent=config-output-dir=$PWD/target-jni-libjvm/config-output-dir-{pid}-{datetime}/"
$ ./target-jni-libjvm/jni-runner js 51.507222 -0.1275 40.7127 -74.0059

Example 5: Enable the Native Image tracing agent to record JNI calls.

In this example, we really didn’t need to embed the PID or timestamp into the generated directory, but it’s generally useful if you have multiple Java processes running since they’ll all share the environment variable and thus would all dump to their output to the same directory. If we take a look at that directory, we’ll see the agent generated several files for us:

$ ls target-jni-libjvm/config-output-dir-40562-20220329T191144Z/

jni-config.json                 proxy-config.json               resource-config.json
predefined-classes-config.json  reflect-config.json             serialization-config.json

Example 6: Configuration files generated by the Native Image tracing agent.

The jni-config.json file is the one of interest. We can pass that file to the native-image command using the -H:JNIConfigurationFiles option. The jni-native profile from the Native Image Playground does precisely that. Both the jni-libjvm-polyglot and jni-native Maven profiles from the Native Image Playground use the the same exact C++ code launcher application to calculate the Haversine distance using a Truffle language through its Java polyglot API. That’s the primary draw of using the JNI Invocation API with Native Image; you don’t need to learn a new non-standard API and your code will work without modification as you switch between libjvm and the Native Image shared library.

Benchmarks

When starting this project, I was only aware of the Native Image C API, so that’s what I started with. Between documentation, GitHub issues, and discussions with others on the GraalVM Slack, I learned about the JNI support in Native Image. But, I was also told that JNI calls would have higher overhead than using the Native Image C API until Project Panama is finished. This presented a conflict because ultimately I’m investigating ways to embed languages like TruffleRuby into other applications. The choice between fast & deprecated (Native Image C API) and slower but API-stable (JNI Invocation API) is not the sort of trade-off I really wanted to make. I haven’t been actively tracking Project Panama, but it’s not in Java 17 and GraalVM only uses Java LTS releases. The next planned LTS release will be Java 21 and that’s targeted for Sept. 2023 — too far out to wait for this application.

While I’ve spoken with people that experienced significant slowdowns in trying to migrate from the Native Image C API to the JNI Invocation API, I couldn’t find any numbers supporting their claims. Thus, the final aspect of the Native Image Playground is to benchmark different different options for executing code in Truffle languages embedded in a process. Whether using the Native Image C API or the JNI Invocation API, there are several different ways to call into a Truffle language, so the benchmarks include multiple approaches with each of the Native Image shared library APIs.

I want to reiterate that the focus of these benchmarks is on Truffle language performance. While Truffle interpreters are written in Java and compile the same as any other Java method would, Native Image does some extra work to record Graal graphs for Truffle interpreters so those interpreters can be JIT compiled. In contrast, a trade-off when using Native Image is that there is no JIT for arbitrary Java methods. The GraalVM team is working on a Java interpreter called Espresso that will allow Java methods to JIT in Native Image binaries by running the bytecode through the Espresso interpreter, but I did not consider it for any part of the Native Image Playground. The reason I’m calling this out specifically is because I’m not measuring the call overhead of Java methods being run in a Native Image binary. Certainly, I need to make some Java calls to use the GraalVM Polyglot API, but what I’m really concerned with is the performance of executing guest code in a Truffle interpreter.

Methodology

For benchmarking, I’m using Google’s benchmark library in a launcher written in C++. I.e., the benchmark harness is not a Native Image binary. The benchmarks were run on a Ryzen 3700X system with 64 GB ECC RAM running Ubuntu 22.04 (kernel 5.15.0-27-generic) and with CPU frequency scaling disabled. Each benchmark was run for ~30s to allow adequate warm-up of any code that could be JIT compiled. Since Truffle optimizes and deoptimizes based on values profiled an run-time, each benchmark was run in its own Graal Isolate to avoid any cross-benchmark JIT issues. While it’s nearly impossible to eliminate system effects (e.g., cache line pollution), each benchmark was run three times in order to help minimize such effects. Additionally, to help avoid differences related to benchmark execution order, benchmark results were collected using the --benchmark_enable_random_interleaving=true option from the Google Benchmark library.

For each benchmark, I present two values: 1) the mean of three execution times; and 2) the standard deviation for the three executions. Deciding which value to present is an on-going debate in the world of computer science. One theory holds that the minimum value represents the ideal case where system effects have not adversely impacted performance and so that should be used. Another is that you can never run with ideal state, so average values like the mean and median represent more realistic cases. In this situation, I picked the mean mostly because I also want to present error data and the standard deviation is a simple value to use. Even that should be taken with a grain of salt, however, because there’s no guarantee the performance follows a normal distribution. If you’d like to see all of the raw measurements, along with the median, mean, standard deviation, and coefficient of variation, you can download the results.

Much like the examples used to demonstrate the various ways to embed Truffle languages, the benchmarks all compute the Haversine distance. There is a Haversine implementation in C++ intended to be something of a control value. Likewise, there’s an Haversine implementation in Java to establish the baseline for methods compiled by Native Image. From there, all of the other benchmarks call into a Truffle language to calculate the Haversine distance.

Software versions:

C++ Compiler: clang++ (Ubuntu clang version 14.0.0-1ubuntu1)
GraalVM: 22.1.0 (based on Java 17.0.3) - Community Edition
Native Image Playground: ceff9b6e21c6a3d55d426c7c0c2a2cf3c8f7fcbb
Google Benchmark: 705202d22a10154ebc8bf0975e94e1a93f08ce98

Classifications

The benchmark results are presented in four phases:

Baseline performance
Native Image C API
Java Native Interface (JNI) Invocation API
Native Image C API vs JNI Invocation API

Since I’m using Native Image to build a native binary using Java, I wanted to establish a reasonable upper-bound on performance of the generated code. In Phase 1, I present the results of a C-based implementation of the Haversine distance. This is a straightforward implementation compiled with -03 optimizations, but does not make use of profile-guided optimization (PGO), compiler intrinsics, inline ASM, or any other manually-driven optimization.

Having established what Native Image performance would be in an ideal case, Phases 2 & 3 explore performance of the two primary APIs for invoking exposed functions in a Native Image shared library: the Native Image C API and the JNI Invocation API. For each API, I benchmark a Java-based implementation of the Haversine distance, establising a new reasonable upper-bound on performance for that API. From there, the various ways to execute code within a Truffle interpreter are investigated. Phases 2 & 3 explore these different approaches and the best option (* not necessarily the fastest) are used for the head-to-head comparison in Phase 4.

Baseline

The benchmark runner includes an implementation of the Haversine distance written in C++ using trigonometric functions from cmath/math.h from the C/C++ standard library as shipped with LLVM. The Haversine distance implementation is a direct port of the Apache SIS implementation used in Java. While I don’t doubt the algorithm could be tweaked more manually, the implementation is quite straightforward and compact. A large component of these benchmarks is to see how well a compiler is able to optimize code. Since the Native Image builder will perform optimizations when building its binary, the benchmark runner, including the C++ Haversine implementation, is compiled with the -O3 optimization flag.

---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
C++                                                      51.1 ns         51.0 ns    822699656
C++                                                      51.1 ns         51.1 ns    822699656
C++                                                      51.0 ns         51.0 ns    822699656
C++_mean                                                 51.0 ns         51.0 ns            3
C++_median                                               51.1 ns         51.0 ns            3
C++_stddev                                              0.053 ns        0.054 ns            3
C++_cv                                                   0.10 %          0.11 %             3

Figure 1: Benchmark baseline number.

The tabular output in Fig. 1 is generated by the Google Benchmark framework. The execution time is very stable between each of the benchmark runs, with a mean execution time of 51.0 ns.

Native Image C API

There are three types of benchmarks run with the Native Image C API:

A pure Java implementation of the Haversine distance
A hard-coded Ruby implementation of Haversine distance
A general executor of Truffle language code, which is supplied with implementations of the Haversine distance in Ruby and JavaScript.

I chose these three to help establish where overhead may lie. I expect the pure Java version to optimize the best during AOT compilation; however, it will not JIT compile.

The hard-coded Ruby implementation uses the GraalVM Polyglot API to execute a predetermined code fragment with TruffleRuby. The code fragment can be parsed ahead-of-time and the resulting function object stored directly into the Native Image shared library, avoiding the need to parse the code at runtime. Since the requirements of that code fragment are known ahead of time, the implementation can make the exact Truffle polyglot calls needed to execute the Ruby Haversine code. While somewhat limited, this example is illustrative of how you might embed a language like Ruby within a process to run specific workloads.

The final benchmark also makes use of the GraalVM Polyglot API, but rather than hard-code the guest code fragments in the image, they are supplied as a string argument by the benchmark. As a matter of practicality, however, the exposed library function only works with code fragments that take four double arguments and return a double value. The API call that deals with evaluating the code fragment is unaware of the restriction, however, so the interpreter still must discover the shape of data by runtime profiling.

Ideally, everything about the call would be flexible, but there’s a lot of ceremony involved in harmonizing the C and Java type systems using the Native Image C API (JNI does not have this problem). Principally, all return values are typed as org.graalvm.polyglot.Value in the GraalVM Polyglot API, but the Native Image C API cannot work directly with these objects. As a result, the return value needs to be coerced into a native type (in this case, double). That’s fairly straightforward to do when the caller knows the return value should be of a specific type, but it becomes much more complicated when the caller needs to allow any return type. Likewise, a truly general API would need to pack the four double coordinates into a java.lang.Object[] constructed in C/C++. While it’s all doable, the effort required to make this approach truly general is so involved that I can’t believe anyone would do it in practice⁴.

Results

The results of the two simple @CEntryPoint methods — Benchmark 1 & 2 from the previous section — are available in Fig. 2.

Figure 2: Benchmark results for methods exposed via @CEntryPoint (Native Image C API).

The Java Haversine implementation executes in 115 ns, compared to the 51 ns of the C++ implementation, taking 2.25x times as long to execute. Interpreting that result requires contextualizing your application. On the one hand, if performance is your ultimate goal, the C++ implementation is twice as fast as the Java one compiled with Native Image. On the other hand, the Native Image implementation includes a fully functional virtual machine with memory safety, garbage collection, platform API abstraction, and so on. If the overall performance is within your application’s performance target, Native Image can be a compelling option for generating native binaries while benefiting from the Java ecosystem. Either way, this example is not telling the whole story and results should not be extrapolated. There isn’t much going on in terms of memory allocation, I/O, multi-threading, or even functionality like virtual calls and templating/generics. I’d encourage you to run your own benchmarks flexing the functionality your application would require.

From here on out I’m going to use the Java implementation of the Haversine algorithm as the baseline. I think this is a more realistic performance target for the Truffle languages. Additionally, the differences between Java and Truffle languages are smaller than the differences between C++ and Java and that detail would be difficult to see if the ensuing analysis focused on Truffle languages vs C++.

Before taking a look at the performance of the various Haversine implementations invoked via the Native Image C API, we need to sort out which approach to take for making arbitrary GraalVM Polyglot API calls (see previous section for description). Fig. 3 shows the results of executing a guest language code fragment under a variety of Truffle resource management strategies. Results are provided for both TruffleRuby and Graal.js in order to provide results that don’t overfit to a particular guest language. The different strategies measured are:

Polyglot context reused, but code fragments always parsed at runtime (No Parse Cache)
Polyglot context reused and evaluated code fragments parsed (Thread-Safe Parse Cache)
Polyglot context reused and evaluated code fragments parsed (Thread-Unsafe Parse Cache)
Polyglot context recreated for each call (Not graphed)

Figure 3: Benchmark results for Truffle polyglot methods exposed via @CEntryPoint (Native Image C API).

Benchmarks 3a - 3c execute an exposed Native Image library function that takes both a Truffle language identifier and a code fragment to execute. These values are supplied at runtime, so there’s no ability to parse them ahead-of-time and snapshot them into the image as Benchmark 2 could.

Each of these benchmarks reuse a GraalVM Polyglot context throughout all of their iterations. Benchmark 3a parses the Haversine distance code fragment each time the benchmark runs. Benchmark 3b uses a thread-safe parse cache, parsing the Haversine distance code fragment once and reusing the resulting Truffle function object for subsequent calls. Benchmark 3c does essentially the same things as 3b, but does away with the overhead of a ConcurrentHashMap. Benchmark 3c is a horrible way to use the GraalVM Polyglot API and exists only to give us a sense of the overhead of a protected parse cache.

Of the three approaches displayed in Benchmark 3, I think the thread-safe parse cache (3b) is the one to go with. It outperforms executing without a parse cache (3a) without introducing race conditions that will be very difficult to debug (3c). This is the value that will be used in Phase 4 where the results of the Native Image C API will be compared to the results of the JNI Invocation API.

Having evaluated several GraalVM Polyglot resource management strategies and settling on the thread-safe parse cache with polyglot context reuse, we can now look at the performance of the various Native Image C API calls (see Fig. 4). I was happy to see that the hard-coded Ruby implementation (Benchmark 2) runs just as fast as the Java implementation (Benchmark 1). It’s a little difficult to see in the graphs, but when accounting for measurement errors, the differences are virtually non-existent: 114.6 ± 1.56 ns for the Java implementation versus 117.2 ± 1.15 ns for the Ruby one. If you have a well-defined operation you need to run and that can be baked right into the Native Image binary, you can write your code in a guest language and not have to worry about rewriting parts of it in Java for performance.

Figure 4: Benchmark results for Native Image C API calls.

Unfortunately, the general use GraalVM Polyglot API calls are much slower than the Java Haversine implementation. The polyglot Ruby call takes 12x as long to process as the hard-code Ruby call. This isn’t isolated to TruffleRuby as the Graal.js calls take 14x as long. I haven’t spent any time digging into why there’s such a large performance gap so I have no concrete suggestions on how to fix it.

I will say that using the GraalVM Polyglot API by exposing Java methods with @CEntryPoint is quite awkward and probably not the best way to write polyglot code to begin with. GraalVM also ships with a library called libpolyglot that exposes a more natural C API for the GraalVM Polyglot API and you can see an example of that in the Native Image Playground project. I did not benchmark any examples libpolyglot.

Notionally, libpolyglot uses the same machinery as the the Java GraalVM Polyglot API, so I’d expect performance to be quite similar. Moreover, it’s a big library that includes every Truffle native image you have installed locally (261 MB with TruffleRuby and Graal.js in GraalVM 21.3.0 on Linux) and must be built manually (i.e., you can’t install it from the GraalVM component catalog). Due to the effort involved and minimal gain anticipated, I opted to defer a deeper analysis of libpolyglot performance as future work.

JNI Invocation API

In order to easily compare results between the Native Image C API and the JNI Invocation API, the same workloads were tested with each API. As a reminder, those benchmarks are:

A pure Java implementation of the Haversine distance
A hard-coded Ruby implementation of Haversine distance
A general executor of Truffle language code, which is supplied with implementations of the Haversine distance in Ruby and JavaScript.

As with the Native Image C API benchmarks, the hard-coded Ruby implementation uses the GraalVM Polyglot API to execute a predetermined code fragment with TruffleRuby. That code fragment can be parsed ahead-of-time and the resulting function object stored directly into the Native Image shared library. Rather than call an exposed library function, as was needed with the Native Image C API, we can call the representative Java method directly with the JNI Invocation API. In this way, the same exact Ruby implementation can be called using two different foreign access APIs.

With the JNI Invocation API, we have considerably more control over reusing the Graal Isolate, GraalVM Polyglot context, and parsed guest language functions. Consequently, the JNI benchmarks do not explore different Truffle caching strategies as we did with the Native Image C API benchmarks; we just use the most straightforward implementation, which happens to be the most performant.

Results

The results of the two simple JNI methods — Benchmarks 1 & 2 from the previous section — are available in Fig. 5.

Figure 5: Benchmark results for methods exposed invoked via JNI Invocation API.

As with the Native Image C API results, we start by looking at the performance the Java implementation of the Haversine distance algorithm. At a mean value of 130 ns, the Java implementation takes 2.5x as long to execute as the C++ implementation (51 ns). As noted in the Native Image C API benchmark results, there’s a natural trade-off between executing the C++ version and Java version, as the latter has a supporting virtual machine. It’s important to know that there is a performance difference and what that is, but that should be evaluated in context of your functional requirements. The Haversine distance algorithm is a computation-heavy benchmark; you should establish your own representative benchmarks if you’re trying to decide between a systems language and Native Image for a particular task.

Having established the performance difference between the C++ Haversine implementation and the Java implementation compiled with Native Image, I’ll be using the Java implementation as the baseline for the Truffle benchmarks. While the C++ implementation helps establish a competitive performance target, ultimately I’m interested in embedding Truffle languages into another process. As such, using Java as the baseline is more useful as it highlights where there’s room for improvement in Truffle interpreters running in Native Image binaries.

Figure 6: Benchmark results for Truffle polyglot methods invoked via JNI Invocation API.

Fig. 6 shows the performance of the JNI polyglot experiments relative to the Java Haversine implementation. As with the Native Image C API, the case where the Ruby code can be parsed and snapshotted into the Native Image binary (Benchmark 2) is as fast as the Java implementation: 130.4 ± 0.37 ns (Java) versus 128.2 ± 0.79 ns (Ruby). My conclusion is the same as before: if you have a well-defined operation you need to run and can bake that right into your Native Image binary, you can write your code in a guest language and not have to worry about rewriting parts of it in Java for performance. That’s an amazing result to me. I suspect most people would expect the Ruby version to run substantially slower than Java and possibly expect the Java version to run substantially slower than C++. But, here, with Native Image we can load TruffleRuby into a foreign process and run a math-heavy operation that only takes 2.5x as long as an optimized C++ version.

The GraalVM polyglot calls that take both a Truffle language ID and a code fragment to evaluate at runtime (Benchmark 3) were a fair bit slower. Calling Ruby in this way took 5x as long to execute as the version where the Ruby code could be compiled right into the shared library. The Graal.js result (7.2x as long the Java implementation) helps to establish that this is not a result specific to TruffleRuby.

In an ideal world, there would be no steady state difference between using the GraalVM Polyglot API via JNI and hard-coding the guest language code fragment into the image. This is an area I’d like to dig into more. I would’ve expected the hard-coded version to have an advantage in execution speed before things have warmed up just by virtue of not needing to parse the code fragment. However, once warmed up, both approaches should have generated the same machine code since they’re running the same code fragments with the same inputs. I don’t know if the difference is due to JNI call overhead, issues with the JIT process, or something else.

I started down the path of looking at how these different invocation techniques compiled into the Native Image binary, but found it rather difficult as the compiler generates label names divorced from the Java method names. Running a debug build helped map the labels back to their source methods, but it’s still a time-consuming process of following JMP and CALL instructions in a debugger and consulting the backtrace to see where I logically was in the application. There’s almost certainly a better way to dump the machine code for a method compiled by Native Image.

Overall

The previous benchmark sections looked at the performance of different ways to execute code compiled into a Native Image shared library from an external process, with an emphasis on executing code in a Truffle interpreter. There are two primary invocation APIs for exposing Java methods and making them accessible via C calling conventions: the Native Image C API and the Java Native Interface (JNI) Invocation API. Thus far we’ve looked at the relative performance difference of executing methods written in C++, Java, Ruby, and JavaScript in each of these invocation APIs. In Fig. 7, we can now see how the invocation APIs perform relative to each other.

Figure 7: Benchmark results for Truffle polyglot methods invoked via both the Native Image C API and the JNI Invocation API.

When calling a Java method compiled into binary without caller use of the GraalVM Polyglot API, the Native Image C API does come out ahead, although not by much. I want to qualify that statement by making it clear I was measuring warmed up benchmarks and did not get into wrapping object handles very much; the only data passed from the benchmark harness to the shared library were double values and C strings. The double values mapped directly to the Java primitive type, but the C strings needed to be decoded to java.lang.String. If there was more data coercion or coercion of more sophisticated types, I wouldn’t be surprised to see differences in overhead between the two APIs.

While it was interesting to see how each API handled calling plain old Java methods, my real goal was learning how embedded Truffle interpreters performed. When it comes to making calls using the GraalVM Polyglot API, the JNI Invocation API comes out way ahead. Polyglot calls made with JNI were roughly twice as fast as using the Native Image C API. This was a fortuitous outcome; the promoted invocation API is also the one that performs best for executing guest language code.

I suspect much of the performance difference is attributable to JNI providing a more natural and refined mechanism for managing Truffle objects. JNI can store and work with Java types right in C++. The Native Image C API only supports a narrow set of foreign objects and the supporting API is quite difficult to work with. Accordingly, the JNI benchmarks can parse guest language code and store the resulting Truffle function object right in a local field, which it can then use for each iteration of the benchmark. Whereas with the Native Image C API, I needed to create a thread-safe map of guest code to Truffle functions (to avoid repeatedly parsing the same code fragment) and that cache needed to be read from on each benchmark iteration.

Lessons Learned: GraalVM Polyglot API

I struggled a fair amount deciding where to put this section. It feels somewhat buried here at the end of the benchmark presentation, but I think the benchmark results help contextualize the notes on the GraalVM Polyglot API interactions.

I found the GraalVM Polyglot API rather awkward to work with from C and C++. I ran into issues working with it from both the Native Image C API and from the JNI Invocation API. There are risks with public APIs that are difficult to use. One is that users simply give up and moves on to another project or solution. For those that persevere, there’s a risk that they’re using the API in dangerous ways and just don’t know it. Moreover, they can pass this incorrect knowledge off to others, exacerbating the problem. Yet another risk is that users will look for ways to simplify the API usage and kill performance in the process.

I started down the path of doing the simplest thing first. It wasn’t just laziness or ineptitude though. The GraalVM Embedding Languages reference manual uses very simplified examples throughout the whole document. Notably, nearly every GraalVM Polyglot API example uses a try-with-resources statement to create a polyglot context, which is then used to initialize a Truffle language engine and ultimately execute some guest language code. When the containing Java method exits, the context is freed. It’s a very tidy way of doing resource management and it looks sensible. Having not thought too deeply about, this approach looked right to me.

I started this project working with the Native Image C API. When you write a @CEntryPoint method using try-with-resources for managing the polyglot context, as in Example 7, you have a completely self-contained function you can call from C. Here, too, I thought everything looked nice and tidy. Java functions exposed with @CEntryPoint are supposed to be self-contained; they must be static and they only have access to their parameters and other static data.

@CEntryPoint(name = "distance_polyglot_no_cache")
public static double distance(IsolateThread thread,
                              CCharPointer cLanguage,
                              CCharPointer cCode,
                              double aLat, double aLong,
                              double bLat, double bLong) {
    try (Context context = Context.newBuilder()
            .allowExperimentalOptions(true)
            .option("ruby.no-home-provided", "true")
            .build()) {
        final String code = CTypeConversion.toJavaString(cCode);
        final String language = CTypeConversion.toJavaString(cLanguage);

        var function = context.eval(language, code);

        return function.execute(aLat, aLong, bLat, bLong).asDouble();
    }
}

// Function declaration in header file generated by Native Image.
double distance_polyglot_no_cache(graal_isolatethread_t*,
                                  char*, char*, double, double, double, double);

Example 7: Using try-with-resources for polyglot context management in a @CEntryPoint method.

When I finally had everything come together such that a C application could successfully execute Ruby code by calling a function written in Java that was exposed in a Native Image shared library, I was ecstatic. The amount of technology that had to come together to make all of this happen is staggering and ten years ago I wouldn’t have thought it was possible. However, my excitement was tempered by the abysmal execution time. Each time I ran this function, it took hundreds of milliseconds — sometimes even approaching a full second.

The probablem was every time I called this function, TruffleRuby needed to bootstrap from scratch. There’s ongoing work to make that bootstrap process faster, particularly in native images. But, the proximate cause was the polyglot context never lasted more than a single function call. Even if TruffleRuby bootstrapped instantenously, my code would never have the ability to optimize in any meaningful way. Each time the context was closed, any JIT-generated code went along with it.

At face value, the solution seemed simple enough: share the context across multiple calls. However, I could not find any documentation or code samples on how to do this with the Native Image C API. The @CEntryPoint functions that Native Image can expose in a shared library only support a narrow range of types for parameters and return types. You might think that you could pass arbitrary Java objects around as void *, treating them as opaque values to be passed around. However, Java is a garbage collected language and that presents problems. If the GC were to free an object you still have a pointer to, you would have a use-after-free problem if you ever used that pointer again. An equally bad situation is if the GC moves the object, since there would be no way to update the calling process. To prevent the GC from processing an object, you can pin it with the Native Image C API. However, this should be done sparingly and is intended to prevent an object from moving during a very narrow window; long-term pinning of an object is not recommended as it will have an adverse impact on GC. Moreover, you won’t find documentation on pinning objects with the Native Image C API; you will be in decidedly unsupported territory.

There is a C-based GraalVM Polyglot API partially hidden inside GraalVM via the libpolyglot image (gu rebuild libpolyglot to install it). With this API you can create a polyglot context from C, but you forfeit the nice, simple functions exposed via @CEntryPoint. For instance, looking back at the C function declaration from Example 7, we can call distance_polyglot_no_cache with a C string and C double values. The Java side of the call takes care of any necessary type coercion from C to Java types and dispatches the appropriate arguments to the polyglot function call. The GraalVM Polyglot C API, on the other hand, requires using its own API-specific types. Making a similar call with this API involves converting a C double to a type called poly_value (using poly_create_double) and then packing all four values into an array for a call to the polyglot function via poly_value_execute.

To the best of my knowledge, there’s no documentation for the GraalVM Polyglot C API. You have to piece it together by realizing it’s a mirror of the Java-based GraalVM Polyglot API, which I did not know at first owing to the lack of documentation. From there, you have to match up data types and function declarations from the API’s header files with the JavaDoc for the Java-based API. You can see my foray into using this API in my Native Image Playground project. While I now appreciate the API symmetry and understand the design, I still find the GraalVM Polyglot C API a bit obtuse to use and rather error-prone.

For me, at least, it was decidedly easier to try to find a way to share a polyglot context across multiple @CEntryPoint method calls. The ugly approach I landed on, and the one explored in the benchmarks for the Native Image C API, was to build and store the context in a static field. For the hard-coded Ruby example, I evaluated the Haversine code snippet in a static initializer and stored the polyglot function object in a static field so the code would not need to be re-parsed each time the distance_ruby function was called. For the polyglot cases, the caller supplies both the Truffle language identifier and the code to evaluate. Since parsing the code on each call would incur overhead, I set up a parsed code cache, keyed by the language ID and code. The cache serves another function: with the polyglot context being shared across each @CEntryPoint method call, evaluating a code fragment repeatedly will fail if the the fragment is not idempotent. The benchmarks explore the performance impact of using such a cache and measures the overhead of ensuring its thread-safety. To avoid issues re-parsing the same code fragment multiple times when the cache is disabled, the fragments for the Haversine distance were all implemented as anonymous functions.⁶

I don’t doubt someone more intimately familiar with the Native Image C API and the GraalVM Polyglot API could find a more optimized way of calling guest code than I’ve used in this project. But, that circles back to the dearth of documentation and examples on how to embed Truffle interpreters in a performance-sensitive manner. If there’s a better way to share compiled code across multiple @CEntryPoint method calls, I haven’t found it.

In contrast, the JNI Invocation API makes polyglot resource management much easier. The design of the API allows storing and passing Java objects to and from C++. The difficulty is that anyone using JNI to make GraalVM Polyglot API calls is going to need to map that API to a JNI configuration file to be used when building the Native Image binary. The Native Image Playground has such a configuration file, generated by the GraalVM tracing agent when the application was run using JNI against libjvm.

The file can be hand-crafted, but Java has its own custom format for representing type signatures and it’s easy to get a mapping wrong or overlook one. If you get it wrong, you won’t know until you run your application and it’ll likely manifest as a segfault due to the mismapped function returning nullptr. There may very well be an accompanying Java exception, but JNI does not print those by default; you must invoke yet another pair of functions to check if an exception object exists and then to print it out if so. Every call could fail, so a robust application would have extensive error-checking. But, that’s tedious and makes the business logic much harder to read. For Native Image Playground I wrote an exception-checking macro that I sprinkled around the application when I encountered a segfault and then removed after the bug was fixed.

Once you’ve identified what’s missing or incorrect in your JNI configuration file, you need to go and rebuild the Native Image shared library. It’s a slow and unforgiving process.

When I started this project, I couldn’t find much in way of documentation or examples for using the JNI Invocation API to call into a Native Image shared library. That changed with the release of GraalVM 22.1.0. Now the various Truffle languages are launched by a C++ application that uses JNI to run the interpreter either in a Graal-enabled JVM (via libjvm) or as a native application by calling into a Native Image shared library version of the interpreter. The launcher doesn’t make use of the GraalVM Polyglot API, but it’s still nice seeing how the JNI Invocation API should be used to call into a Native Image shared library.

Conclusion

This turned out to be a much larger project than I had anticipated and there’s still much left to explore. Unfortunately, the various mechanisms for exposing Java methods in a Native Image shared library and then calling into those methods were not easy to discover. I frequently had to dig into the Native Image source code to work things out. To be fair, some stuff I thought was undocumented turned out in fact to be documented; I simply didn’t know which set of keywords to use to find them. Maybe there’s even more documentation out there that I’ve yet to discover. Be that as it may, I hope this blog post and the examples in the Native Image Playground project can help steer others in the right direction and save them some frustration.

I don’t have any data to back it up, but I get the sense that the predominant use case of Native Image is turning JVM-based applications into native applications. Using Native Image in this way is much easier than using it to build a shared library. And while there are plenty of benchmarks for Native Image applications, to the best of my knowledge no one has published comprehensive benchmarks on either the Native Image C API or the JNI Invocation API for calling functions within a Native Image shared library. I hope the experiments and results from this blog post can help developers make an informed decision about how best to expose and call Java methods in a Native Image shared library.

All of the benchmarks live in my Native Image Playground project. It’s also home to self-contained examples that demonstrate everything discussed in this post. I intend for the playground to be a testbed for other experiments in embedding Truffle interpreters. Please see the Native Image Overview section for more details on how to work with the project.

Based on my evaluation, I’ll be using the JNI Invocation API for all of my Truffle embedded work. It’s the API that the GraalVM team has signaled would be the future for foreign calls into a Native Image shared library and it’s the fastest invocation API for calling into Truffle interpreters. Unfortunately, working with the GraalVM Polyglot API with JNI is a little difficult (please see the Lessons Learned: GraalVM Polyglot API section for more details).

I think there’s an opportunity here for the GraalVM project to remove some of the ceremony needed to call the GraalVM Polyglot API with the JNI Invocation API in a Native Image shared library. At the simplest level, it would be a huge quality of life improvement if Native Image could handle registering the GraalVM Polyglot API for JNI usage without user involvement. There’s really no need to make each user go through the tedious and error-prone process of constructing the JNI configuration file themselves — the polyglot API is going to be the same for everyone.

The next area I’d like to see improved is API ergonomics. My goal was to execute guest language code in a Truffle interpreter from a process loading my Native Image shared library. The Native Image C API’s advantage here is that I get to largely determine what that API looks like and that API uses a C calling convention, making it very easy to call into the library from languages with foreign function libraries. Requiring every consumer of the shared library to learn how to use JNI is a large cognitive overhead. I also think it’s a leaky abstraction. Having to map all of JNI for use with a foreign function library is a massive undertaking. Without doing so, however, there’s no real way call my Native Image shared library in any language other than C or C++. In my ideal world the JNI calls are just an implementation detail and instead users work with a higher level API. I think this would be spiritually similar to the GraalVM Polyglot C API, but considerably simpler to work with.

To recap, in order to use a Truffle interpreter embedded in a Native Image shared library, you need to know:

The GraalVM polyglot API
The JNI invocation API, including:
1. How to represent type signatures
2. How to handle Java exceptions in JNI and other forms of error-handling
3. How to map JNI to your executing environment’s foreign function interface
(optional) The GraalVM tracing agent (while optional, it’s highly recommended for generating Native Image JNI configuration file)
Building a Native Image binary with JNI configuration (not difficult per se, but confusing errors if you skip this step)
Resource management between Graal Isolate, polyglot contexts, and polyglot engines
Passing “command-line” options to Graal and Truffle and which goes with what
1. Graal options need to be supplied when creating the Graal Isolate
2. Truffle options need to be supplied when creating the polyglot context
(optional) How to use the Ruby standard library from disk (TruffleRuby-specific)

That’s a lot to absorb. If you manage to master all of that, the technology is amazing. I was very happy to be able to embed TruffleRuby in an application and have Ruby code execute as fast as Java and only half the speed of C++, with virtually no effort on my part to optimize it. I look forward to exploring more of this space and see what else can be achieved. I’m hopeful we can improve the developer experience and make this technology more readily accessible to those not steeped in all things GraalVM.

¹ Native Image is unable to JIT Java code because there is no Java bytecode to profile in the compiled binary. However, Truffle-based languages can JIT because Graal compiler graphs corresponding to the language’s Truffle interpreter are compiled into the image. Getting a bit meta, there’s an implementation of a Java bytecode interpreter in Truffle called Espresso in development. Since Espresso uses a Truffle interpreter, it will be able to JIT and it is expected that will be the way forward for JITting Java applications in a Native Image binary.

² Generally, a compiled Native Image cannot dynamically load classes because there is no JVM in the compiled binary to do so. With everything ahead-of-time compiled, you need all your Java class files available at native image construction time. However, as with JIT compiling Java code, you can dynamically load classes by using the Espresso Java bytecode interpreter. In this case, the Espresso interpreter would be AOT compiled into your native image but your class files would be dynamically loaded and run in the interpreter at runtime much like running on the JVM.

³ While a Truffle interpreter can be compiled to a native binary with Native Image, an application using a Truffle language cannot as of yet. E.g., you cannot write a CLI application in Ruby and compile the application into a native executable. Instead, you’d use a compiled interpreter such as TruffleRuby to load and run your script.

⁴ For an approximation of the effort involved, please look at the native-polyglot launcher implementation in the Native Image Playground. The native-polyglot example uses the Native Image Polyglot API — a wrapper around the Native Image C API used for Truffle polyglot calls in C. This approach to embedding Truffle languages in another process is not benchmarked because it’s so similar to the Native Image C API. Additionally, the Native Image Polyglot API is deprecated and should not be used for new objects. It exists in the Native Image Playground solely for completeness.

⁵ Truffle languages can pre-bootstrap an interpreter and snapshot that into the Native Image. The Truffle languages from the GraalVM team do precisely that to varying degrees. Whatever can’t be snapshotted during the Native Image building process must be executed at run time.

⁶ As a practical matter, anonymous functions allows us to evaluate the same code snippet multiple times without state conflicts. In a shared context, all executed code fragments share the same global state. If you were to parse and run the JavaScript snippet const EARTH_RADIUS = 6371 each time you called the @CEntryPoint function with a shared context, you would get an error about attempting to redefine a constant. There are ways to work around this, of course. In Ruby, you can check if a constant is already defined before defining it in your code snippet. In our embedded examples, we could make multiple calls to the GraalVM Polyglot API, ensuring function definitions are only executed once, while function calls may happen in th benchmark loop. Using anonymous functions allowed for flexibility in how the code is evaluated; it works fine with or without a parse cache and is wholly contained (i.e., it does not require two separate calls to define and then call a function).

Reflections on ReasonML

2020-08-20T12:59:00+00:00

Introduction

For the past year or so I’ve been working on a side project with ReasonML. When people hear about it, they often ask me what my thoughts are and how it’s working out, so I’ve collected that feedback here.

Why Did I Choose ReasonML?

I’ll start off by saying that I deliberately picked ReasonML for a personal project so I didn’t need to factor in all the reasons that a business may or may not want to adopt a new niche technology. At the core of it, I loved Standard ML while in university and OCaml, which ReasonML is based on, scratches that it itch for me. JavaScript, on the other hand, really doesn’t appeal to me even with the ES6 additions. It’s a perfectly fine language that gets the job done, but I just wasn’t excited to work with it on a personal project.

Of course, there are plenty of non-JavaScript languages with all sorts of language semantics. I had considered Elm, amongst others, but landed on ReasonML because it looked to have excellent support for React and JavaScript interop in general. Additionally, being backed by Facebook suggested to me that the language may have some longevity by way of a corporate booster.

As a secondary concern, I wanted to get a feel for the productivity trade-offs between ReasonML and TypeScript as a discussion topic for my new company.

ReasonML is an interesting beast in that it layers in a new JS-like syntax for OCaml. I wasn’t a fan of it at first, but it eventually started to feel natural for writing a web application. I think that was perhaps due to ReasonReact’s support for JSX.

How Do I Feel About that Decision?

If I had to do it over again, I would be hard-pressed to go with ReasonML. This probably isn’t a shocking conclusion for many: JavaScript has first-class support in web browsers and languages that target JavaScript spring up and wither away with regularity. When things are going well, ReasonML really shines and it’s a joy to work with. Unfortunately, I hit several snags during my evaluation of the language and as a consequence my enthusiasm with the project waned. These things happen and I expected to run into them, but I hadn’t adequately considered how demotivating they could be.

When I originally got to this point in the writing, I had decided I wouldn’t write this post. I have no real interest in criticizing a project or its community, but I was encouraged to complete my writing anyway in the spirit of all feedback being good feedback. Please try to read the rest of this in the most charitable way possible. At best, my feelings on ReasonML are conflicted. The ReasonML community has been nothing but warm and helpful, even when I was clearly frustrated.

Community

The ReasonML community is perhaps the most welcoming one I’ve participated in. There is an active Discord server with several focused channels. I found the discussions there informative and questions are answered fairly quickly. It’s nice to see a group of enthusiasts willing to donate their time to help newcomers to the language.

Unfortunately, the discussion is now split between a Discord server and a Discourse instance. I appreciate that Discourse makes it easier for asynchronous communication, but it’s also a siloed community. With Discord, I can be connected to multiple servers at one time and engage in chat at my leisure, getting notifications in something that isn’t my web browser. With Discourse, I just can’t keep up with all the various communities expecting me to sign up for yet another account. This isn’t particular to ReasonML, but I do find it lamentable. We’ve regressed a long way from multi-community IRC servers and mailing lists.

Documentation

ReasonML and BuckleScript both have fairly comprehensive documentation. ReasonReact, however, has very little documentation. Consequently, you’re left having to look at the JS docs for React, looking at the type definitions for ReasonReact, and maybe a tutorial or two online. Things don’t always match up 1:1 and it’s just a very difficult way to get started.

To their credit, the ReasonReact team acknowledges this is a shortcoming, but given constrained resources are seeking community help. I’d love to be able to help out, but writing docs for something I barely understand is unlikely to be all that helpful. Moreover, I was looking to use ReasonML in large part for its purported productivity gains; having to pause to write the documentation for a big project in that ecosystem is a (helpful) distraction.

Ecosystem

The ReasonML ecosystem is frankly rather confusing. ReasonML is the language, but I never installed it. Instead, I installed BuckleScript, which packages its own version of ReasonML and it’s generally not clear what that version is. The only way I found to tell which version of Reason I was using was to run its code formatter with a version flag.

I still don’t know how one goes about installing ReasonML standalone. There’s a package called reason-cli that looks like it will do it, but it’s wildly out of date. Alas, there is documentation floating around telling you to do just that, which means you’ll have a tool that won’t run many code examples and it won’t be obvious why.

Then there’s ReasonReact, which is a dependency you need to add to use, but part of ReasonReact also ships inside BuckleScript. Between BuckleScript 7.0.1 and 7.1.0, a correctness change was made to ReasonReact code shipping within BuckleScript that broke several major projects in the ReasonML ecosystem. Just to reiterate, even if you didn’t update the ReasonReact version in your package.json/yarn.lock, suddenly code that worked before stopped working. It took over a month for this to finally settle down. In that time, I had to run forks of both direct and transitive dependencies just to get my project working with the newer BuckleScript. I suppose I could have waited to upgrade, but there was a bug in BuckleScript 7.0.1 that was fixed in the 7.0.2-dev releases that only appeared in 7.1.0, as 7.0.2 was never released. For people completely new to ReasonML, things were broken out of the box.

It was an unfortunate sequence of events, but variations of it have played out multiple times in the past year. When BuckleScript 6.0.0 was released, graphql_ppx was broken and the maintainer of that project had stopped maintaining it. That necessitated a fork, which in turn required dependent projects to update their dependencies to work with the new fork. It all worked out, but hitting these issues that are largely out of your control, and with such frequency, is really demoralizing. As of this writing, reason-apollo wasn’t compatible with the ReasonReact 0.8.0.

It might be that ReasonML isn’t a great fit for React and GraphQL applications, in which case I just picked the wrong tool for the job. There is, however, a lot of promising work going on with the reason-relay bindings and a lot of activity on improving graphql_ppx and reason-apollo-hooks. I’m not all that interested in switching to Relay, however, so I’m sticking with Apollo for the time being. I’ve been contemplating just using something like RxDB and offload the GraphQL server interaction to another library.

Setting compatibility issues aside, there just aren’t that many published ReasonML bindings or libraries. The ReasonML community promotes writing bindings for just the parts of a library that you need, since it has pretty good JS interop. Sadly, that means there’s a dearth of good bindings to look at as an example and I found the documentation a bit too high-level to be entirely practical. BuckleScript’s interop facilities are certainly rich, but if you mess something up, it can be incredibly obtuse to work out. It’s also evolved a lot over several major releases, so any examples you do find may well be out of date. I think I have a pretty good handle on it, but it was a lot of effort to get to that point, and I don’t think it would have been possible at all without help from others on Discord.

Moreover, what bindings or libraries do exist often lack a changelog or tagged releases. That makes it hard to tell what’s changed between releases. It’s a problem from the top down, as ReasonML hasn’t tagged a release since 2017. It leads to this situation where you need to be “in the know” to figure out what’s changing where and when. Or, just blindly upgrade, which can lead to the aforementioned compatibility problems. And if you pick up a new library that doesn’t work with the current ecosystem, good luck trying to find an older version that might work because you’re unlikely to get any more help than the simple version listing on NPM.

Type Definitions

As I previously mentioned, people are discouraged from releasing packages that are little more than bindings for existing JavaScript/Flow/TypeScript projects. Consequently, a lot of my time is spent manually converting TypeScript definitions to ReasonML. While straightforward once you learn how to do it, it’s slow and frustrating. The reality is, if I just used TypeScript I could get on with writing the application logic.

Since the bindings take a long time to write and easily fall out of date, the community recommendation is to only map what you need. But, then you don’t get any of the wonderful IDE support that you’d have with TypeScript, such as API discovery and full auto-complete. You’d also have to keep the TypeScript definitions around so you can consult them every time you want to see the full API. It’s awkward and hard to view as anything other than a waste of time.

There have been a few aborted attempts at automating the conversion of TypeScript to ReasonML definitions. For simple type definitions they should map straightforwardly. Having looked into it a bit myself, I believe one of the biggest problems is you can’t inherit or mix in record definitions in ReasonML, so things like inherited interfaces can’t be mapped easily (or well). It might be interesting if BuckleScript had a @mix-in or @include annotation that could be applied to record fields that are of type record. Then from ReasonML you could use nested field access like normal, but BuckleScript could then map that back to a flattened property list in JavaScript.

Without a tool to convert TypeScript definitions to BuckleScript, I think ReasonML will always remain a niche technology. Building up your own types works wonderfully when building up an internal API. But, modern web apps pull in many modules and having to write bindings for each is overwhelming.

Another community recommendation is to use a hybrid application, where part is written in ReasonML and part written in JavaScript/Flow/TypeScript. While that would solve the complex type mapping problem, it comes at the cost of a more complicated project structure. Personally, at that point I’d find it hard to justify using ReasonML if I already need to maintain a parallel TypeScript project.

Standards

The ReasonML community is interesting in that it’s incredibly small, so a lot is up for grabs. It actively encourages newcomers to participate in various ways. However, it also has very strongly held opinions on code structure and formatting which comes off as gatekeeping to me. People tend to fall into two camps on this debate, but if it truly doesn’t matter, then my arbitrary choice is just as good as yours. I’ll provide two such examples.

The first one is the compiled ReasonML file output is placed in the same directory as the source .re files. I’ve worked with a lot of languages and systems that use code generation and in every other case the generated files are placed somewhere else, oftentimes not committed. I believe the idea here is for incremental adoption of ReasonML in existing JavaScript projects, so you can directly modify the generated JS files if needed. I found it just made working with the code harder. Having both src/App.re and src/App.bs.js makes navigating code harder. Tab-completion gets messed up, an IDE’s UI gets cluttered, and jumping to code doubles the number of candidates. Changing the location is configurable, but I was discouraged from doing so. Tools like Parcel just silently fail if you use anything other than the defaults.

The second one has to do with refmt preferring 80 character wide lines. I can run four terminals side-by-side with 120 characters and still have room to spare, so I generally find 80 characters to be unereasonably narrow. This problem, however, is compounded by BuckleScript’s interop annotations. I’ve had cases where they’ll take up ~60 characters themselves, so even relatively short, nicely formatted code is getting split over two lines. Moreover, if a function call gets split, each argument will be placed on its own line, so a line of 85 characters suddenly turns into four lines.

Fortunately, the character width is controllable, but I was requested not to do that for any open source code in order not to create problems for any hypothetical contributors. I didn’t quite understand the problem if I just added my own “script” to package.json, but I guess it creates problems with editors. As a result, I’ve just opted not to open source any of my bindings. I find the wider lines considerably easier to read and this whole project was supposed to be for fun. If I need to give that up to participate in the open source community, it’s not really worth it to me.

Facebook

Inititally, I thought the backing of a major corporation would be a strength of ReasonML. Essentially, if Facebook is relying on the technology I expected it would survive where other smaller community projects have died out. However, over the course of the past year I’ve come to realize Facebook does open source a lot differently than others. First, I find Facebook doesn’t quite interact with the community like many others. People give my previous employer (Oracle) a lot of flack, but if you have a question about GraalVM, you can expect timely response on Slack, GitHub, or Twitter. Facebook seems to do a lot of work internally, quietly, and maybe eventually releases it. The other problem I have is Facebook takes “opinionated” to a level I haven’t really seen elsewhere. Each of their projects I’ve tried makes design decisions for Facebook’s unique use cases and doesn’t make that configurable, instead trying to pass them off as best practices. If you work on large polyglot teams focusing on real-time newsfeed-like products, then their decisions make a lot of sense. If like most of us, you don’t, you just have to learn to adapt to those design decisions. I believe tools should adapt to the needs of the user, not the other way around.

That’s to say nothing of their contributor license agreement (CLA) requirement. I don’t have an inherent problem with CLAs. I’ve signed a few over the years, mostly for open source organizations (Apache Software Foundation and Software Freedom Conservancy, for Selenium). I’ve signed one with Oracle to contribute to GraalVM. I can’t say if I’ve just had a change of heart on them or if the phrasing of the Facebook one is problematic, but this was the first time I felt the language was dense enough to warrant hiring a lawyer. I have no interest in paying the fees for a lawyer in order to contribute documentation fixes for a project I’m working on on the side with no commercial value. So, this is a situation where being open source doesn’t really gain me much.

Summary

I think ReasonML is a really interesting project with a lot of teething problems. Just recently, ~~BuckleScript~~ ReScript¹ unveiled a brand new syntax that further complicates the basic question of “what is ReasonML?”. When ReasonML works, it’s fantastic. ReasonML’s compilation speed is ridicuously faster than TypeScript’s. Setting up a React project is considerably less involved than using Create React App. At the language level, you get a much richer type system than TypeScript’s. Type-safe GraphQL queries and pattern matching over values makes for a very pleasent programming environment.

However, I’ve found myself simply unmotivated to work on my side project. I poke at it every couple weeks for a few hours and I invariably end up side-tracked dealing with a library compatibility issue. While I could just stick with the set of libraries I was using six months ago and make progress with that, it’s also a bit disheartening because recent BuckleScript versions have really improved the JavaScript interop and I’d hate to give those improvements up. Then, even when things work, I find I spend a lot of time manually translating TypeScript types to ReasonML types.

I think the core problem is ReasonML is in a state right now where if you can’t afford to keep up with everything going on in the ecosystem, you’re going to run into confusing problems. The community is great and will take the time to explain what the situation is, but I shouldn’t have to be active on a Discord server in order to get anything done. On the other hand, having a taste of what ReasonML provides when it works, I’m also very reticent to jettison the whole project and switch over to TypeScript. I’m currently using TypeScript, Create React App, and Relay for a project at work and while it mostly works, it’s brought a whole different set of problems I don’t really want to deal with in my free time.

I wish I had more time to contribute to ReasonML. I’ve been very active with many open source projects over the past two decades, so I don’t mind rolling my sleeves up and helping out. I just simply don’t have the time take on this large an effort. I’m extremely grateful to the community members that have been able to dedicate time to making ReasonML better and I hope my reflections here aren’t taken as a critique of their efforts. Building up a new language ecosystem and community is a massive undertaking largely handled by a small group of people.

Given Facebook’s internal usage of ReasonML, I naïvely thought it would be at the back-half of the early adopter stage, maybe even early majority. But, it feels a lot more like it’s still in the innovator stage. That’s okay. Every project needs to start somewhere. If you’re comfortable with that, you can have a lot of fun working with ReasonML and helping advance the language. If you’re just looking to tinker with something, even knowing there’ll be some bumps, you’ll probably want to use something a bit more refined.

¹ I had intended to publish this post in early August, 2020, which was after BuckleScript announced its new syntax but before it announced its renaming to ReScript. Unfortunately, I hit some technical snags that meant this wasn’t published until several days after the rename was announced. My initial reaction is that the rename is going to make it harder for people searching for information, as what little 3rd party content is out there will be using the old name, BuckleScript. I don’t mean to be alarmist, but that’s how other renames I’ve seen have gone — there’s always a big thrashing period up front. I truly hope the intention of reducing complexity comes to fruition because the ReasonML ecosystem sorely needs it. For the time being I remain cautiously optimistic.
Go back

A Systematic Approach to Improving TruffleRuby Performance

2017-05-30T00:00:00+00:00

Summary

We care a lot about performance in the TruffleRuby project. We run a set of benchmarks on every push in a variety of VM configurations and use those as proxies for system-wide issues. The problem with benchmarks, however, is unless you’re intimately familiar with the benchmark it’s hard to tell what your reported values should be. Looking at a graph, it’s easy to see if you’ve regressed or improved. It’s easy to see how you compare to other implementations. But it’s not terribly useful once you’ve leveled off. We really need another way to analyze performance.

This is a bit of a lengthy post that introduces some of the tools available to Truffle language developers for performance analysis. To help make things concrete, I walk through the end-to-end analysis of a simple Ruby method. If you’re just curious as to how we make Ruby faster, you can certainly skip over some of the finer details. I’ve tried to keep the post high-level enough to be interesting to those just curious about technology. If you’re developing a language in Truffle, you should find enough information here on how to get started in analyzing performance in your own language.

Introduction

In talking with Rubyists at various venues I’ve run into a lot of people that are curious about how a runtime is implemented, which is fantastic. Even if you’re writing in a high-level language day-to-day, it’s good to understand how things work under the covers. It can help you build better software and provide a better mental framework for debugging issues, performance or otherwise.

Unfortunately, there’s really no clear starting point for learning about how a runtime is built. In the absence of some golden path, I’m going to walk through the analysis and improvement of a single core method in Ruby. While I did pick this method out solely for the point of writing this blog post, the analysis and results are not contrived and the improved version of this method is now live in TruffleRuby.

To get started I looked at the compiled code resulting from running the Ruby language specs from the Ruby Spec Suite. Generally, this isn’t a use case we pay much attention to because the code is short-lived and the executed paths won’t match more conventional workloads, such as serving web applications. But, it was a nice simple example of real world code that I thought might yield some interesting insights. One issue that caught my attention was how we were optimizing Array#empty?.

A substantial portion of the Ruby core library in TruffleRuby is authored in Ruby. Array#empty? is one such method and its implementation is straightforward:

def empty?
  size == 0
end

Basically we have two method calls (Array#size and Fixnum#==) and a Fixnum literal reference. Each of these methods is fairly simple, so let’s look at them first.

Tracing Compilation

We can instruct Truffle to print out trace details about its compilation by setting the Java system property -Dgraal.TraceTruffleCompilation=true. We’ve simplified that a bit in TruffleRuby by way of our jt.rb tool. To see the trace output for Fixnum#== call, I ran:

GRAALVM_BIN=path/to/graalvm-0.22/bin/java tool/jt.rb --trace -e 'loop { 3 == 0 }'

That yielded the following trace information:

[truffle] opt done         Fixnum#== (builtin) <opt> <split-3dce6dd8>                  |ASTSize       8/    8 |Time    79(  76+3   )ms |DirectCallNodes I    0/D    0 |GraalNodes    33/   28 |CodeSize          147 |CodeAddress 0x7f5f287b5c10 |Source       (core):1 

Restructuring this as a table, we have:

AST Nodes	8
AST + Inlined Call AST Nodes	8
Inlined Calls	0
Dispatched Calls	0
Partial Evaluation Nodes	33
Graal Lowered Nodes	28
Code Size (bytes)	147
Partial Evaluation Time (ms)	76
Graal Compilation Time (ms)	3
Total Compilation Time (ms)	79

As a trace statement, some of this info can be daunting at first, but once you break it down it’s really not too bad. We don’t need to make any extra method calls in Fixnum#==, so there are no inlined or dispatched calls. Because we don’t inline any calls, the number of AST nodes in total is the same as the number of AST nodes for the Fixnum#== method itself. After partial evaluation (i.e., the Truffle compilation phase), we have 33 compiler nodes. After Graal is done processing those nodes, we have 28 compiler nodes being handed off to code generation. We can see compiling the method took 79 ms, with most of that being spent during partial evaluation (76 ms). The resulting compiled method is 147 bytes.

In general, we want to reduce compilation time and code size. If we can avoid dispatching method calls, that’s best too. The various node counts are interesting, but mostly as descriptive statistics. As we work on optimizing a method, we can compare current and previous counts of each of these metrics to see if we’re trending in a positive direction.

Visualizing the Compiler

To gain more insight into the Fixnum#== method, we can also visualize the transformations of the Truffle AST to a lowered graph using a tool known as the Ideal Graph Visualizer (IGV). To start IGV, download a build, unzip it, cd into the unzipped directory, then run ./bin/idealgraphvisualizer.

Once running, we’ll feed data into IGV over the network. You can instruct your Truffle language to dump graph data with the -Dgraal.Dump=Truffle system property & value. The TruffleRuby jt tool makes this easier as well, by way of the --igv and --full flags.

GRAALVM_BIN=path/to/graalvm-0.22/bin/java tool/jt.rb --igv --full -e 'loop { 3 == 0 }'

IGV will show the compiler graph after each phase of compilation. For our purposes, the phase “After Truffle Tier” is a good starting point. It shows the state of the compiler after partial evaluation, partial escape analysis, and some other compilation passes Truffle performs, but before Graal begins its compilation phases. Usually this is high level enough to still resemble source structure while low level enough to indicate what can be optimized. The compiler graph for Fixnum#== after Truffle has completed its partial evaluation looks like:

I’ve annotated various aspects of the graph, which you can see by scrolling through the image carousel. Working from the top down to the bottom, we can trace entry of the Fixnum#== method through its exit.

First, we encounter several LoadIndexed nodes which simply correspond to reading an element from an array (one per read). Every TruffleRuby method has an arguments array associated with it, consisting of both internal state and Ruby-level method arguments. Consequently, every TruffleRuby method has some boilerplate to read each element out of that array so the method can execute. The first 7 loads are for fixed fields that TruffleRuby uses to track state for the method. Any subsequent LoadIndexed nodes correspond to the Ruby-level arguments for the method. In this case, we have on argument for the Fixnum#== method.

Having run through our preamble, the remaining nodes correspond to the main functionality of the method. You might expect a straightforward integer comparison, but it’s a bit trickier than that. We treat Fixnums as boxed int or long in TruffleRuby, depending on the range of the value. When the self value is read out of the arguments array, it is read as a java.lang.Object. Along the way, Truffle notices that this is really a java.lang.Integer and notes that fact via a Pi node (node 142 in the graph). That value is then unboxed to a primitive int via an Unbox node (node 346 in the graph). Finally, the unboxed self is compared to another int.

While the original snippet was loop { 3 == 0 }, you’ll note that graph shows self being compared to 3, not 0 (node 350 in the graph with the constant 3¹ as an input). In TruffleRuby, we use value profiles for all method arguments, including the receiver (self). Value profiles are a way to speculatively optimize around the constantness of a value. They have three basic states: uninitialized, constant, and generic. The first time a value is seen, the state moves from uninitialized to constant. On subsequent calls, the profile compares the passed value to the cached value. If the values are the same it remains in its constant state, but if a new value is seen, it advances to its generic state, where it’ll remain for the life of the profile object.

So, what’s really happening here is self is being compared to the value cached in the value profile by way of a FixedGuard. If self is not 3, then the guard fails and deoptimization occurs, discarding the compiled method, updating the value profile to note the receiver is generic, and switching back to the interpreter to execute the rest of the Fixnum#== method.

The next step is to run through a similar process for the Fixnum#== argument: unbox to a primitive int and compare it to its cached profile value of 0. Again, if the guard fails then deoptimization occurs and the value profile for the argument is updated to note the argument is generic. Note that this profile is different from the receiver profile, so value constantness is tracked independently amongst the method arguments.

Finally, the method returns a value, as indicated by the Return node. You’ll note that self and the argument are never directly compared. Instead, we always return a false value (represented as the constant integer value 0). We can do this because of the value profiles on the receiver and the argument. As long as those two values are constant, the result of the comparison will always be false. This is important because by returning a constant value, Truffle can speculatively constant-fold the results of this method anywhere it is inlined. We can safely do this because if either value changes, the compiled method (including any method it is inlined into) is discarded and we drop back to the interpreter, which will perform the direct integer comparison again.

The graph will go through many more transformations before ultimately emitting machine code. Making sense of the graph after each of those transformations requires deeper understanding of Graal. When working on a Truffle language, however, we can generally skip evaluation of those phases. Looking at the graph after Truffle’s partial evaluator has run gives a good idea of what the compilation looks like with method inlining and constant-folding, while retaining enough of the shape of the original language AST to trace things back to their source. In this case, we have a sensible implementation of Fixnum#== and the graph has a tight, linear structure to it, which generally optimizes very well.

Method Specialization

We can run through the entire analysis process for the Array#size call as well. Note that TruffleRuby has two implementations for for this method: one for empty arrays and one for non-empty arrays.

GRAALVM_BIN=path/to/graalvm-0.22/bin/java tool/jt.rb --trace -e 'x = []; loop { x.size }'
GRAALVM_BIN=path/to/graalvm-0.22/bin/java tool/jt.rb --trace -e 'x = [1, 2, 3]; loop { x.size }'

I’ll skip the command-line trace info for these executions and collapse their trace data together into the same table:

Metric	Empty Array	Non-empty Array
AST Nodes	5	5
AST + Inlined Call AST Nodes	5	5
Inlined Calls	0	0
Dispatched Calls	0	0
Partial Evaluation Nodes	34	40
Graal Lowered Nodes	16	82
Code Size (bytes)	131	289
Partial Evaluation Time (ms)	49	59
Graal Compilation Time (ms)	3	5
Total Compilation Time (ms)	52	64

Comparing the two different implementations, we can see that they have the same number of AST nodes and method calls. In fact, it’s the same exact AST for both implementations; what’s different is the structure of the receiver (i.e., the array instance). In the empty array case, we simply always return 0. For the non-empty array case, we must read the array length from the backing array store. The difference here is quite large. The generated code for the empty array case is 45% of the size of the non-empty array case.

The IGV graphs for these two specializations don’t add a whole lot more to the analysis, but for completeness I’ve linked to graphs for both the empty array and non-empty array cases. The big difference between these graphs and the simpler graph for Fixnum#== is the introduction of Truffle’s object storage model, which TruffleRuby uses to represent Ruby objects that are more complex than simple boxed primitive values. Since the values aren’t boxed, we can’t unbox them, so the graphs also introduce the InstanceOf and ConditionAnchor compiler nodes for tracking type information. And since we’re reading fields from objects now, you’ll see usages of the LoadField and GuardedUnsafeLoad nodes for loading data from an object.

Bringing it All Together

Now that we’ve seen what the Fixnum#== and Array#size methods look like on their own, we’re ready to see how they combine into Array#empty?. For this analysis, I ran the Ruby language specs:

GRAALVM_BIN=path/to/graalvm-0.22/bin/java tool/jt.rb test --graal :language

This time, the trace details look a bit more interesting:

AST Nodes	19
AST + Inlined Call AST Nodes	32
Inlined Calls	2
Dispatched Calls	0
Partial Evaluation Nodes	112
Graal Lowered Nodes	48
Code Size (bytes)	263
Partial Evaluation Time (ms)	58
Graal Compilation Time (ms)	8
Total Compilation Time (ms)	66

We can see that two methods were inlined into Array#empty?. We know the only two methods being called are Fixnum#== and Array#size, but we can also figure that out by looking at the total AST node count. We see that Array#empty? consists of 19 AST nodes on its own and we know Fixnum#== has 8 nodes and Array#size has 5, which added together is 32: the same value reported for the size of the AST and its inlined nodes. The large number of partial evaluation nodes sticks out a bit, as does the code size. When looking at the IGV graph for the Array#empty? method, we see something that looks a fair bit more complicated than expected:

Given we know that Fixnum#== and Array#size have relatively compact graphs, why is this graph so large?² When investigating these larger graphs, I usually start by looking for Phi nodes. In this case, we can see one near the bottom of the graph, a few levels above the Return node.

A Phi node corresponds to a phi function in static single assignment form. In a nutshell, they represent areas in the graph where Truffle sees multiple sources for an expression’s value, but can’t determine which one to choose. Since it can’t determine which value to use, it can no longer constant-fold on that expression.

The Phi node structure resembles that of a multiplexer, with a control input determining which of the N data inputs will be output. The first input (going left to right) is the control input and is followed by the data inputs. In this case, we see the first data input is the constant value 0 and the second input is the result of a field load from some object.

The field load for the second data input looks very similar to the graph for the non-empty Array#size. And we know that the empty Array#size can be reduced to the constant 0. This suggests that the reason for the Phi node is we’re encountering both empty & non-empty arrays and, consequently, Truffle must compile in both specializations for Array#size. We can verify that’s the case by tracing back the control input to a Merge node, which in this case represents the common path after two branches of an if statement finishes executing.

In all of the graphs up until now, I’ve removed frame state nodes in IGV since they’re often just clutter. By enabling them, however, you can get additional context that might help in your analysis. In the following extraction of the primary graph, but with frame states enabled, we can see that Array#size has indeed become polymorphic, meaning multiple specializations for the method are active in this compilation unit.

Now that we have a likely culprit, what can we do about it? Answering that question requires understanding a little more about how Truffle optimizes code.

Method Splitting

At this point in the analysis, we’re blaming Array#size being polymorphic for the complexity in the Array#empty? compiled method. Array#size being polymorphic isn’t a bad thing in and of itself, but it looks like the two different Array#size specializations prevent Array#empty? from doing an optimal job.

A naive solution is to discard the specialization for empty arrays. The specialization for non-empty arrays will handle the empty array case just fine, but it will incur additional overhead and can’t reduce to a constant value. However, eliminating the empty array specialization works against Truffle. With Truffle, we want to break down methods into their constituent cases.

Typically a call site for a method isn’t going to exercise every branch in that method. By breaking a method down into its various cases and specializing for them, we can improve the quality of compiled code markedly for common cases. To help us keep call sites monomorphic, Truffle can “split³” a method. With method splitting, a clone of the uninitialized graph corresponding to a method is made and inserted at a particular call site. That way if two different pieces of code are calling Array#size and one always has an empty array and the other always a non-empty array, they each have their own copy of Array#size optimized for their respective case.

So, the real problem here isn’t that we have multiple specializations for Array#size, but rather that the splitting strategy isn’t optimal. By default, Truffle makes splitting decisions based on a set of internal heuristics and observed execution profiles at runtime. The Truffle API allows language implementors to exert more control over the process by forcing splitting where appropriate.

TruffleRuby implements the Ruby language both in Java and in Ruby itself. We currently use a very coarse heuristic that methods implemented in Java are the most critical and as such we instruct Truffle to always split them. Both Array#size and Fixnum#== are implemented in Java and thus are always split. However, splitting every method in our runtime would result in massive explosion in graph size, so we do not force splitting of methods written in Ruby, such as Array#empty?.

It appears in this situation we haven’t given Truffle enough information to make an optimal splitting decision. As a result, Array#empty? isn’t split and all of its callers ends up sharing a single copy of the method. Moreover, even though Array#size is instructed to split, splitting only occurs at unique call sites. Since we have a single shared copy of Array#empty? there’s only one call site for Array#size and as a result, Array#size becomes polymorphic as well.

Improving the Method

Unfortunately, improving Truffle’s splitting heuristics is a research project unto itself. But, I wanted to show an improvement in compilation so we could do a before-and-after comparison of the compiler graphs. As it turns out, we can improve Array#size by using a single specialization with a value profile on the return value. At first blush this sounds like I’m backtracking on my previous assertion that we do not want to eliminate the special handling of empty arrays. However, by adding a value profile on the return value we’re able to achieve our original goal of returning a constant value for empty arrays.

You might recall that we already have value profiles established for each argument to a method. In the case of Fixnum#== this is sufficient to get a constant return value because the arguments are unboxed to primitive ints and comparing the exact same two ints always returns the same value.

Ruby arrays are mutable, however, so we can’t cache the method result based on object identity; Array#size must always read the length field from the underlying store. And since we can’t know a priori what the length of an arbitrary array is, the return value is variable. In our initial implementation of Array#size we specialized on empty arrays by explicitly returning the constant value 0, while non-empty arrays returned the variable result of the loaded field length.

By adding a value profile on the result of the length field load, we can return a constant value for Array#size of both empty & non-empty arrays, as long as the profile doesn’t become generic. For monomorphic call sites, calling Array#size on an empty array will still return a constant value 0. As an added bonus, non-empty arrays will now also return a constant value, providing the array size doesn’t change. And we’ve eliminated the need for two specializations to be compiled in for polymorphic call sites.

With the new implementation the performance profile for the empty & non-empty array cases is now identical and smaller than the original. Note that compilation times are non-deterministic; despite compiling the exact same graph, we see some variance in how long it takes compilation to complete.

Metric	Empty Array	Non-empty Array
AST Nodes	5	5
AST + Inlined Call AST Nodes	5	5
Inlined Calls	0	0
Dispatched Calls	0	0
Partial Evaluation Nodes	34	34
Graal Lowered Nodes	17	17
Code Size (bytes)	130	130
Partial Evaluation Time (ms)	49	63
Graal Compilation Time (ms)	3	2
Total Compilation Time (ms)	52	65

More importantly, this new Array#size implementation allows Truffle to perform more optimizations in Array#empty?. Re-running the Ruby language specs, we now see:

Metric	Old Array#empty?	New Array#empty?
AST Nodes	19	19
AST + Inlined Call AST Nodes	32	32
Inlined Calls	2	2
Dispatched Calls	0	0
Partial Evaluation Nodes	112	42
Graal Lowered Nodes	48	26
Code Size (bytes)	263	176
Partial Evaluation Time (ms)	58	57
Graal Compilation Time (ms)	8	5
Total Compilation Time (ms)	66	62

The new Array#empty? (really, the same method but with a better Array#size) has almost 1/3 the number of nodes after partial evaluation and is about 2/3 the code size. If we look at the compiled method in IGV now, we see a much simpler graph:

This graph looks much more like the composition of the Array#size and Fixnum#== graphs we would expect to see. The only thing that should look foreign at this point is the introduction of the inline cache right around the middle of the graph. This was present in the old Array#empty? implementation as well, but was buried in a much larger graph. Since Array#empty? calls Array#size, we need to ensure we call the correct size method. To avoid expensive dynamic method lookups, TruffleRuby uses a polymorphic inline cache. What you see here is a guard on the receiver type to ensure the cached method lookup is valid, triggering deoptimization if it is not.

Unfortunately, we can’t return a constant value from Array#empty? without splitting the method. The method is called in multiple places during TruffleRuby’s bootstrap process and becomes polymorphic before end-user code is even executed. In practice, this may not be a real problem because the output of the method is a boxed bool, which is likely to be fed as the input into another method, which will have value profiling enabled on the arguments.

Conclusion

When it comes to performance analysis we often turn to benchmarks and profilers. Both are great tools in their own right, but they often lack granularity. Truffle’s tracing output and IGV provide lower-level metrics that can be used to improve a method’s compilation.

As a Truffle language developer, ensuring our code is amenable to Truffle’s partial evaluator is one of the most important things we can do. So, while a lot of the preceding discussion looks like micro-optimization, it’s the sort of thing that can have large cascading consequences. E.g., eliminating Phi nodes generally allows Truffle to perform more extensive constant folding.

I think perhaps the most important part of all this is you don’t need to be a compiler expert to help the compiler out. You don’t even need to be able to evaluate the resulting machine code. If you know the basic rules of how to play nicely with Truffle you can go really far. Granted, reading the IGV graphs does take some getting used to — there’s still a fair bit I don’t understand! But you can get pretty far with pattern matching after learning the basics. You can easily test out a hypothesis in a high level language (Java or Ruby, in this case) and compare the before & after metrics. To me, that’s incredibly powerful.

Acknowledgements

Pulling all of this info together took quite a while. Many thanks to Chris Seaton, Benoit Daloze, and Petr Chalupa for reviewing drafts, suggesting improvements, and catching errors.

¹ C(x) is an IGV shorthand for constant integer value x.

² Calling this a “large” graph is a bit misleading. Certainly it’s larger than any other graph in the blog post, but it’s actually quite small by Truffle standards. If you’re performing your own analysis of a method in Truffle, please don’t take this graph as a benchmark of typical graph size.

³ In the original “One VM to Rule Them All” paper, method splitting is referred to as “tree cloning.” You can consult the paper for a more thorough treatment of the topic.

TruffleRuby on the Substrate VM

2017-02-15T00:00:00+00:00

Introduction

In the TruffleRuby status report for 2016, we indicated that we planned to address our startup time problem in the coming year. We’re well aware that one of the biggest challenges facing alternative Ruby implementations is being competitive with MRI on startup time and memory consumption. While we have established a peak performance advantage over other implementations in many benchmarks, we haven’t had a very impressive story to tell for short-lived applications such as one-off Ruby scripts and test suites.

I’m extremely pleased to say that we’re making good on that promise. Starting with GraalVM 0.20, we’re now shipping a new virtual machine and ahead-of-time compiler that can produce static binaries of Truffle-based languages.

Glossary

There’s a lot of terminology involved in discussing TruffleRuby. If you’ve been following our project status you might already be familiar with them, but I assume many people are not. In order to make the remainder of this post easier to understand, I’ve provided a brief glossary below:

Graal: An optimizing compiler for the JVM written in Java with hooks available to interact with it from Java.
Truffle: A self-optimizing AST interpreter framework for Java. When paired with Graal it is able to perform partial evaluation on the interpreter and user program to produce tight machine code for the running application.
TruffleRuby: An implementation of Ruby based on Truffle. Since it uses Truffle, the runtime is authored in Java, but much of the core library is written in Ruby.
GraalVM: A distribution containing a Graal-enabled JVM and runtimes for Truffle-based languages (currently TruffleRuby, Graal.js, and FastR) from Oracle Labs.

Performance

Isolating startup time from program execution can be a bit tricky. Rather than getting mired in the details, I’ve taken the measurement¹ of an extremely simple program: ruby -e 'p "Hello, world"'. If you want to follow along, simply install GraalVM and build the TruffleRuby binary.

	Real Time (s)	Max RSS (MB)
TruffleRuby SVM 0.20	0.24	128.6
TruffleRuby JVM 0.20	3.43	439.6
JRuby 9.1.7.0	1.55	200.4
Rubinius 3.69	0.27	71.1
MRI 2.4.0	0.05	8.8

Running TruffleRuby on the Substrate VM is 13 times faster than running on the JVM while only using 30% as much memory. We still have a ways to go before we catch up to MRI’s startup time, but TruffleRuby on the Substrate VM starts up faster than all other alternative Ruby implementations. It’s unlikely we’ll ever approach MRI’s memory consumption because we must retain runtime metadata for the Graal compiler, but we should be able to whittle it down further and run well on memory-constrained cloud servers.

Turning our attention to a more real world application, I ran the set of language specs from the Ruby Spec Suite. These specs look and run very similarly to a typical application’s test suite.

	Real Time (s)	Max RSS (MB)
TruffleRuby SVM 0.20	9.00	1,364.1
TruffleRuby JVM² 0.20	68.38	560.8
JRuby³ 9.1.7.0	37.57	380.6
Rubinius³ 3.69	7.18	112.6
MRI 2.4.0	1.01	13.7

Test suites like this are generally hard on optimizing runtimes. They always start in a cold state, they run for a short period of time, and in the case of JVM-based languages they incur a high degree of overhead when spawning new processes. For this test suite, which has 2,121 specs and 3,824 assertions, TruffleRuby on the SVM is 6.6 times faster than TruffleRuby on the JVM, shaving almost a full minute off the test suite — a fairly substantial savings. However, this is a relative performance reduction from what we saw with the simple startup test, suggesting we have additional opportunities to reduce total run time.

At first blush, the increased max RSS value is concerning. The big difference here is that the SVM has a new generational garbage collector (GC) that’s different from the JVM’s. The SVM ahead-of-time (AOT) compiler uses a 1 GB young generation size and a 3 GB old generation size by default. That means the GC won’t even start collecting until TruffleRuby has allocated 1 GB of memory. Over the course of those specs we generate a lot of garbage. Fortunately, this isn’t an inherent limitation of either TruffleRuby or the SVM; the SVM can be used for applications other than TruffleRuby and simply defaults to a large heap as a conservative measure. For future releases we’ll look for better defaults for our needs.

The numbers show we haven’t quite caught up to MRI yet but we’re quickly closing the gap. As both the SVM and TruffleRuby continue to mature, we expect we’ll be able to approach MRI’s level of responsiveness. I think these initial results suggest the approach is viable and that our goal is realistic. We’re currently on a monthly release schedule and will continue to track these metrics.

If you’ve held off using TruffleRuby due to concerns about startup performance, now is a great time to experiment with it. Just keep in mind that this is an early release and there are certainly bugs. We do have an open issue tracker and love receiving reports from real workloads.

More about the Substrate VM

The Substrate VM is a project under the GraalVM umbrella at Oracle Labs, headed by Christian Wimmer. The basic idea behind the Substrate VM is to provide a new distribution mechanism for languages authored with Truffle. The Truffle framework is Java-based, which means languages wishing to make use of Truffle must also be written in Java. Authoring a language, such as Ruby, in a high level language like Java brings many advantages such as excellent IDEs, refactoring capabilities, performance & analysis tools, and a codebase that’s easy to maintain. However, it also brings with it disadvantages such as slower startup time, increased memory usage, and distribution difficulties.

The Substrate VM addresses those deficiencies by producing a static binary of a Truffle language runtime⁴. It performs an extensive static analysis of the runtime, noting which classes and methods are used and stripping away the ones that are not. The AOT compiler then performs some up-front optimizations such as trivial method inlining and constructs the metadata necessary for Graal to perform its runtime optimizations. The final output is a version of the Truffle language interpreter fully compiled to native machine code; i.e., there is no Java bytecode anywhere. As an added benefit, the binary size is considerably smaller than the JVM’s because all the unused classes and methods are excluded.

This is a powerful new addition to the Truffle toolchain. In addition to an optimizing JIT, GC, profiler, debugger, and built-in polyglot support, by implementing your language on top of Truffle you now get an ahead-of-time compiler for your interpreter.

As with all technology, there are trade-offs to targeting the Substrate VM. In order to perform its static analysis, all code to be included in the binary must be statically reachable. Consequently, reflection and dynamic class loading can’t be used. Additionally, arbitrary native code cannot be called. The Substrate VM does ship with built-in support for JNR POSIX to make native POSIX calls easier. And, as the Substrate VM is still a young project, certain APIs from the JDK might not yet be supported.

For code you have control over, these restrictions might be inconvenient, but are generally manageable. However, pulling in arbitrary 3rd party code can create headaches since it’s not quite as easy to avoid paths that don’t comply with the SVM’s restrictions. The SVM does include a mechanism to replace a method implementation at runtime, much like aspect-oriented programming, but this should be used a last resort.

Conclusion

The Substrate VM provides Truffle-based languages, such as TruffleRuby, with an incredible new way to deliver a language runtime. If you’ve been following the TruffleRuby project, you likely know we’ve been talking about the SVM for the past couple years as our solution to solving our startup time problem. I’m excited to say it wasn’t vaporware! The results for TruffleRuby on the SVM thus far are extremely promising and I think the release of the SVM marks a new evolutionary phase of the TruffleRuby project.

To me, one of the most amazing parts of all this is that the same TruffleRuby codebase is used to target the JVM, the GraalVM, and now the SVM. You no longer have to choose between fast startup and peak peformance — just use the best VM for your given context.

Additional Resources

If you’re interested in learning more about the history of TruffleRuby or get a deeper look at its internals, please take a look at the collection of resources Chris Seaton has assembled. Truffle and Graal are active research projects with a rich set of publications _{(Truffle, Graal)}. If you’d like to learn more about implementing a language with Truffle, I suggest checking out the SimpleLanguage implementation and watching Christian Wimmer’s walkthrough of the code.

¹ All measurements were taken on an i7-4930K running at 4.1 GHz and 48 GB RAM. The operating system was Ubuntu 16.04.1 with a 4.4 Linux kernel.

² Due to a bug in GraalVM 0.20, the Ruby Spec Suite language specs do not run with runtime compilation enabled. For this evaluation I ran with a stock JVM, while the startup tests report the JVM with Graal.

³ Neither JRuby nor Rubinius pass 100% of the language specs from the Ruby Spec Suite. As a result, they error out on some specs that TruffleRuby and MRI pass. Since they’re not executing the same code the recorded time and memory values shouldn’t be taken as definitive.

⁴ There is no technical reason SVM can’t be used with arbitrary Java applications. However, its primary use case and the one driving development is Truffle-based language runtimes.

Open Sourcing a Failed Startup

2014-11-20T00:00:00+00:00

Background

In late October, 2014 I announced that I would be shutting down Mogotest. After close to 5 years of operations it was clear I wouldn’t be able to grow the business. I don’t think it was due to lack of business opportunity, but due to some business decisions made early on that became very difficult to course correct. The exact line of reasoning that justified the shutdown is a topic for another day. The purpose of this post is to discuss what to do with the code after the fact.

Are you Going to Open Source It?

Rather predictably, one of the first things I was asked after I announced the shutdown was whether I would be open sourcing it. I was asked from current customers, by friends, by companies that were interested in the tech but never felt the need to support it by giving us business, by random people on Twitter, and so on. I had already gone through some of the thought process a priori, but I was in a different state of mind then. Getting the bombardment of questions after the announcment impacted me in ways I couldn’t predict.

For some additional context, I contribute to a lot of open source projects. I don’t have a “brand name” and I’ve never professionaly sold open source software or sold consulting services around it, but I’ve worked with a lot of projects. I use the Apache Software license version 2.0 for just about everything. And I guess I would consider myself more of a pragmatist than an ideologue when it comes to open source software.

With that said, my gut reaction was to not open source it. My analytical reaction was also not to open source it.

Why Not?

I’d just like to insert a standard disclaimer at this point that what follows is my own experience and my own potentially irrational thought process. If anything I say comes off as a generality, please note that my pomposity stops short of speaking for others.

First up is the emotional aspect. I had just made the extremely difficult decision to walk away from something I spent the past 5 years of my life dedicated to. During that time, I lost at least two full years of wages, pissed through my savings, and lost ~$40K USD in cash invested into the business. I battled with some form of founder depression. Stastically speaking, this was the most likely outcome, so I’m hardly looking for sympathy. But, having made that gut-wrenching decision to walk away from it all, the prospect of going back to it and investing a non-trivial amount of effort just to give it away is a really tough pill to swallow.

Also on the emotional aspect is just my own human pettiness. I’ve been asked to open source the codebase from people that evidently didn’t think the software was good enough to be worth paying for as a service. I’ve been asked to open source the codebase by other companies in the space that didn’t want to buy the rights when I was shopping the company around. So, while I really want to provide a soft landing for my customers, I really didn’t want to just be giving everything away to those that just wanted to mooch.

Setting all that aside, open sourcing the codebase is not some trivial process. And I’m not talking wanting to clean up stuff I might be embarrassed by. Here’s a non-comprehensive list of issues that need to be addressed:

The web site design was a theme bought on WrapBootstrap that I don’t have the rights to sublicense.
The rich UI widgets come from the commercial version of ExtJS. That needs to be excised or the whole project needs to be GPLv3.
Sidekiq Pro needs to be removed.
Every JS lib and every image resource we used must have its license examined and potentially be replaced.
Any customer info that made its way into the code needs to be removed. As an example, we built up an extensive regression suite around customer data that can’t be distributed. This whole process means auditing every file in the codebase.
Ensuring any API keys or passwords aren’t floating around in the source or configuration files (obviously bad, but things happen).
Potentially unobscuring security holes while the service is still running.
Removing all the billing code.
Removing all the drip email campaign code.
Removing any other non-Web Consistency Testing parts from the code.

A lot of this is a liability. Going through it all is a ton of work. After all that, I open myself up to all sorts of scrutiny I don’t really care for. Sometimes I swear in code. I hold a somewhat traditionalist view of English and prefer my plurality to match up, so I use gendered pronouns in my personal writings, which will have now just become public. I’m certain there is some colorful commentary about each of the browser vendors buried somewhere in the source. Without a doubt, something in this codebase will offend someone and my personal reputation is at risk when it simply wouldn’t have been by keeping it private.

It’s basically all the work required to clean up during an acquisition, but with the inverse financial outcome.

If I managed to clear that hurdle, the next problem is that I simply don’t find there to be much value in open source code. Open source projects, yes. Open source code, rarely. I won’t have either the time or the wherewithal to spend any additional effort on this project. If I make the code publicly available, people will have questions that I won’t have time to answer. Consequently, I’m just going to constantly feel like an inadequate piece of garbage. On the other hand, if I manage to find time to engage, I don’t have the energy to justify every design decision. Some things do just look silly, but they were the product of the constraints imposed at the time. Contextually, they were sound. In today’s world … probably not so much. Fixing them would certainly be progress, but in my experience these sorts of things aren’t approached tactfully and I’d rather not be called an idiot without having the resources to defend the context.

Second Thoughts

At the end of the day, I want Web Consistency Testing to evolve. If making Mogotest open source will help achieve that, I’m willing to overlook some of the other problems. I’ve already released the ancillary libraries as ASLv2, and I was going to release the main application under the Affero General Public License (AGPL). After spending 14 hours cleaning things up this past weekend, I’m still not 100% certain I’m not violating IP somewhere or leaking customer info and I’ve had to gut the product so thoroughly that it’s virltually useless. Rewriting all the view code just isn’t something I have the desire to do.

In conjunction with the decision to use the AGPL, I decided to try a crowd-sourced campaign to help with the open sourcing effort. Precisely zero of the companies that have been begging me to open-source the code have contributed in any capacity. The incredible amount spam I’ve received via comments on IndieGogo and Twitter have been equally disheartening.

Conclusion

I had my initial emotional reaction, I analyzed the hell out of it, I decided against my better judgment to try opening the code anyway, and I simply can’t do it. I think the tools I have open-sourced will be beneficial to others and I’ve explained how things work fairly extensively in a talk I gave at Google’s Test Automation Conference. A clean-room implementation shouldn’t be too onerous, given I’ve solved a lot of the environmental problems you’re apt to encounter. Unfortunately, this is where I have to get off the train.

Improving Sidekiq Performance with JRuby

2014-03-20T00:00:00+00:00

To handle all the work in our Web Consistency Testing process we use a background job queueing system called Sidekiq. Sidekiq, in turn, makes use of a Ruby actor framework called Celluloid. And Celluloid ultimately makes use of a Ruby 1.9 feature called Fibers. Fibers are Ruby’s implementation of coroutines. Unfortunately, the JVM doesn’t natively support coroutines, so in order to be API compatible with MRI, JRuby needs to fake it with JVM threads. This has historically proven tricky to implement properly and has resulted in a variety of issues related to both correctness and performance throughout the JRuby 1.7 release cycle.

Fortunately, we haven’t encountered any Fiber bugs since JRuby 1.7.4. However, while recently profiling our application in production we found that anywhere from 25% - 35% of total CPU time was spent in Fiber creation. Incidentally, at least two others running JRuby 1.7.10 discovered the same thing around the same time, making for some fun IRC conversation.

Fibers on MRI are cheap to create so their use case usually entails using a lot of them in a compressed timeframe. JRuby’s Fiber implementation requires using one thread per Fiber. This can result in extremely high thread churn if they’re not reused. In the case of Sidekiq & Celluloid, every background job ends up creating a Fiber. With JRuby 1.7.10, if we have 100,000 jobs to run, we’re creating at least 100,000 threads. While the JVM may be very good at multi-threading, thread creation itself is still something of a heavy process and unnecessary object allocations are generally bad for performance.

The fix was straightforward: add a pool of some sort. Chris Heald had come across the same Fiber performance issue and solved it by implementing a Fiber pool for Celluloid that fixes the problem for all versions of JRuby. Unfortunately, in order to get Sidekiq to use it one must monkeypatch Sidekiq after it’s already started. JRuby 1.7.11 fixes the Fiber problem by using an internal thread pool dedicated to Fibers, substantially reducing the number of allocated threads.

In order to measure the performance profile of each of these options, I pulled together a very simple Sidekiq job. This job effectively does nothing more than run through the mechanics of Sidekiq. The test methodology here isn’t exactly precise, but it’s good enough to get a sense of magnitude. Basically, in one process we run the Sidekiq daemon, which will spin up 100 workers to process a work queue. In another process we enqueue a configurable number of jobs and then busy loop until the workers have depleted the queue; when the queue is empty, we report the execution time. Since that busy loop will sleep for 100ms, results may be be off by at least that much. Likewise, an empty queue doesn’t actually mean all the work has been completed, it just means a worker has taken a job from the queue, so results will be skewed by however long it takes a Sidekiq worker to no-op. At worst, that is expected to be another couple 100ms.

All tests were run against a Linux Mint 16 machine (Ubuntu 13.10 derivative) with 16 GB RAM, SATA III SSD, running the 3.11.0-12-generic kernel on an Intel Core i7-2760QM (2.40 GHz quad core with hyperthreading). I used the 64-bit 1.7.0u51 JVM (Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) and my JRUBY_OPTS are set to -J-XX:+TieredCompilation -J-XX:TieredStopAtLevel=1 -J-noverify -J-Xmx2G -Xcompile.invokedynamic=false -Xreify.classes=true Redis 2.6.3 was used and local RDB saving and AOF were disabled to remove disk I/O.

I also wanted to get a handle on whether performance scaled with the number of jobs, so I measured each dimension with 1,000, 10,000, and 100,000 jobs. Since I am talking to Redis over a network socket (but on localhost), I ran each experiment 3 times and took the average as a way to mitigate against any sort of network jitter. Table 1 summarizes the results of these experiments.

Table 1: Job execution time (s) per JRuby configuration.
Jobs	JRuby 1.7.10	JRuby 1.7.10 w/ Fiber Pool	JRuby 1.7.11	JRuby 1.7.11 w/ Fiber Pool
1,000	1.342	0.931	0.973	0.877
10,000	11.860	7.130	7.448	7.037
100,000	116.636	68.917	71.930	68.863

As it turns out my numbers did not vary significantly from what Chris saw in his own benchmark of Celluloid directly. In both cases, use of a Fiber pool substantially sped things up. Since almost all activity is CPU-bound here, that’s a pretty massive savings. There’s no need to introduce a caching layer or optimize queries — just reduce costly object allocation. And simply upgrading to JRuby 1.7.11 will yield a similar savings without any code modifications to your application.

It’s unfortunate that JRuby ever spent this much time just burning CPU, but it’s also impressive that even with this deficiency it has routinely beaten out MRI on Sidekiq based benchmarks. It continues to be great that the JRuby team is very responsive to reported performance issues. This matter was fixed within days of the issue being filed and a new release was published a couple weeks later.

We’ve opted to run both JRuby 1.7.11 and the Fiber pool. Our backend jobs run faster and now that we’ve removed a major bottleneck, we can focus on other areas in profiling.

Faster SecureRandom in JRuby 1.7.11

2014-03-11T00:00:00+00:00

While profiling our Rails app recently, SecureRandom surfaced as a hot spot. We use UUIDs to generate request IDs so we can correlate different log statements with a logical user request. To isolate the problem, I removed the request IDs and profiled again, but found that SecureRandom still showed up. As it turns out, in Rails 3.2+ the ActionDispatch::RequestId middleware is enabled by default and does essentially what we were doing — it creates unique request IDs so disjoint log statements can be grouped together. While we were able to cut out 50% of the SecureRandom calls just by using the Rails middleware instead of our custom solution, we were still making a slow call on every request. And since this is a Rails default behavior, it became evident this could be adversely impacting a lot of sites.

One of the things I love about JRuby is that when a compelling benchmark is pulled together, it’s easy to convince the team to focus on an area to improve performance. And if it’s demonstrated that the issue is affecting a lot of people, the fix is given even higher priority. With new, stable releases coming out almost every three weeks, performance issues don’t get buried for months or years at a time.

Initially we were concerned with raw execution performance, since that would manifest itself as slow response times. Figure 1 shows the simple benchmarking code we used, while Figure 2 shows the results from JRuby 1.7.10 and MRI 2.1.0 merged into a single report.

    require 'benchmark/ips'
    require 'securerandom'

    Benchmark.ips do |x|
      x.report('SecureRandom') { SecureRandom.uuid }

      if defined?(JRUBY_VERSION)
        x.report('Java') { java.util.UUID.random_uuid.to_s }
      end
    end

Figure 1: SecureRandom serial benchmark code.

Calculating -------------------------------------
        SecureRandom      3970 i/100ms
                Java      7661 i/100ms
  SecureRandom (MRI)     14068 i/100ms
-------------------------------------------------
        SecureRandom    47757.0 (±6.6%) i/s -     238200 in   5.013000s
                Java   104769.4 (±12.8%) i/s -    513287 in   5.032000s
   SecureRandom (MRI)  177503.1 (±4.1%) i/s -     900352 in   5.082986s

Figure 2: JRuby 1.7.10 SecureRandom serial benchmark results.

If you’re not familiar with benchmark/ips, it basically measures throughput in a calculated timeframe and reports results as the number of iterations per 100ms interval. The two instrumented calls show the difference between calling JRuby 1.7.10’s implementation of SecureRandom for generating a UUID and calling Java’s built-in utility method for creating a UUID from JRuby. Naturally, MRI can’t run the Java method, so its benchmarked value is the pure Ruby SecureRandom call. As can be seen, both implementations in JRuby are slower than that of MRI. But what’s immediately interesting is how the JRuby implementation is roughly twice as slow as the pure Java implementation. In theory, JRuby can achieve the same speed that Java can. Of course that’s not always the case, but when there’s such a wide performance gap as seen here, there’s almost certainly something that can be done more efficiently.

The SecureRandom implementation in JRuby 1.7.10 is a thin Ruby wrapper around the native Java SecureRandom implementation. In JRuby 1.7.11 it’s been rewritten to move critical sections into pure Java. This allows saving costly coercion between Java and Ruby objects. Moreover, rather than create a new Java SecureRandom object every time a random value is needed, the new implementation only creates one for the life of a thread. This is a big performance boost because by default the JVM on Linux will seed a SecureRandom instance from /dev/random, which locks on a synchronized method and may block indefinitely if /dev/random determines it needs more environment data for its source of entropy. By reusing an already seeded instance a significant amount of time can be saved as can be seen with the benchmark code run on JRuby 1.7.11 (Figure 3).

Calculating -------------------------------------
        SecureRandom     33114 i/100ms
                Java      7917 i/100ms
  SecureRandom (MRI)     14068 i/100ms
-------------------------------------------------
        SecureRandom  1019165.3 (±6.9%) i/s -    5066442 in   5.011000s
                Java   108395.6 (±13.8%) i/s -    530439 in   5.052000s
   SecureRandom (MRI)  177503.1 (±4.1%) i/s -     900352 in   5.082986s

Figure 3: JRuby 1.7.11 SecureRandom serial benchmark results.

While the measurement for the naive pure Java implementation used in the benchmark has a fairly high margin of error, we can see the new JRuby implementation is roughly 4x faster than calling the Java method. In contrast, with JRuby 1.7.10 it was 2x as slow. And the new JRuby implementation is now approximately 2.3x faster than MRI, whereas with JRuby 1.7.10 it was 3.5x as slow. The same caching trick could be employed with the Java method call, so JRuby isn’t faster than Java, but all that heavy lifting has been done for us. And while that may seem contrived, the Java method for fetching a UUID is a static one, so it’s often called directly with no opportunity for the developer to cache the underlying SecureRandom instance.

The second half of our performance story presented itself while doing load testing against the running application. Unfortunately, it was much harder to create a reproducible benchmark for that. But with JRuby 1.7.10 we were seeing a lot of blocked threads as we increased the number of concurrent requests. The changes in JRuby 1.7.11 avoid hitting an internal lock in the JVM and reduces the chance of blocking on /dev/random, which allowed us to increase the number of concurrent connections by approximately 50% in a test environment. This number will vary wildly for other apps, however, because the likelihood of multiple threads hitting the same code path is highly variable on the rest of the response cycle.

If you haven’t upgraded to JRuby 1.7.11 yet, I’d highly recommend it. This is just one of several big performance improvements made that have real world benefits. Faster response times mean happier customers and increased request throughput helps us contain infrastructure costs.

How to Accept Self-Signed SSL Certificates in Selenium 2

2013-03-05T00:00:00+00:00

I wrote about how to accept self-signed SSL certificates for Selenium 1 almost 3 years ago. At the time, Selenium 2 hadn’t seen an official release yet so I was sticking with the more stable Selenium 1 (now the Selenium RC protocol in Selenium 2). A lot has changed in the world of Selenium since then and I thought it was time to provide a new post with modern information, based upon the Selenium WebDriver drivers. If you’re looking for info on Selenium RC, please read the older article as it is still relevant.

Built-in Driver Support

Unlike Selenium RC, the individual browser drivers in Selenium WebDriver are authored in a native language suitable for that browser. This affords the drivers much greater flexibility in how they can influence the browser. As a result, handling self-signed SSL certificates is trivial in the following browsers:

Firefox
Chrome
Android

Firefox and Chrome handle the issue of self-signed certificates by default, so no additional configuration is required. Android requires a bit more work, however. When starting the driver, you must pass the acceptSslCert desired capability, with the value true. After that, everything should work just fine.

Support Via Proxy

Not all browser drivers are able to override the browser’s SSL security warnings, however. In these cases you’ll need to use the same technique as with Selenium RC: run a proxy server that intercepts the bad SSL certificates and re-encrypts the call using a certificate that the browser trusts. The following browsers require this method:

Internet Explorer
Opera
Safari

While you could use the Selenium RC Server as your SSL proxy, a much better option is to use BrowserMob Proxy. The Selenium RC Server proxy is deprecated and is maintained just enough to keep it functional. BrowserMob Proxy is a fork of that codebase that has matured with new features and bug fixes.

Note that in order for the browser to handle the re-encrypted connection to BrowserMob Proxy it must trust the CyberVillains certificate. This trust relationship is normally established by installing the certificate as a trusted root in your operating system.

Warning!

For security reasons, you should never trust the CyberVillains certificate on a machine that is not dedicated to testing as you may open yourself up to man-in-the-middle attacks or phishing schemes.

Installing the CyberVillains Certificate on Windows

First, you need to install the CyberVillains certificate in order for SSL connections to BrowserMob Proxy to succeed. The certificate is shipped as part of the BrowserMob Proxy distribution and can be found $install_directory/ssl-support/. Figures 1 – 7 illustrate the installation process.

Figure 1: Double-click the CyberVillains certificate in the BrowserMob Proxy distribution.

Figure 2: Install the CyberVillains certificate.

Figure 3: Click through the SSL certificate import wizard.

Figure 4: Choose the Trusted Root Certification Authorities certificate store.

Figure 5: Complete the import.

Figure 6: Accept the security warning.

Figure 7: Wrap everything up.

Using BrowserMob Proxy

Once you have the CyberVillains certificate installed, you’re ready to run the BrowserMob Proxy server. Its README has info on how to use the proxy. But in its simplest form, we’ll just start the proxy up:

    $ /opt/browsermob-proxy/bin/browsermob-proxy --port 8080

Figure 8: Starting up BrowserMob Proxy.

Now you need to pass the proxy configuration when starting your driver:

    require 'selenium/webdriver'
    require 'browsermob-proxy'

    proxy = BrowserMob::Proxy::Client.from "http://localhost:8080/"

    desired_caps = Selenium::WebDriver::Remote::Capabilities.internet_explorer(
      :proxy => proxy.selenium_proxy(:http, :ssl))

    driver = Selenium::WebDriver.for(:remote,
      :url => "http://localhost:4444/wd/hub", :desired_capabilities => desired_caps)

Figure 9: Creating an IE driver configured to use BrowserMob Proxy as a proxy (Ruby).

Where things get tricky is not all drivers have the ability to specify the proxy. As of Selenium 2.31.0, for instance, there’s no way to configure the proxy for Safari. Should you encounter a driver like this, you’ll need to configure the browser manually. By default, most browsers on Windows (including Safari) will use the system proxy settings. Thus, if you can modify the system proxy, your browsers will automatically pick up those changes even if the driver doesn’t yet support proxy configuration.

Figure 10 shows a sample snippet that can be used to set up the Windows system proxy in Ruby.

    require 'win32/registry'

    key = Win32::Registry::HKEY_CURRENT_USER.open(
      'Software\Microsoft\Windows\CurrentVersion\Internet Settings', Win32::Registry::KEY_WRITE)

    begin
      key.delete('AutoConfigURL')
    rescue => e
      # This will raise if the registry value doesn't exist.   
      # Since reading if it doesn't exist will also raise, I'm not sure how to
      # check for the value's presence before attempting to delete it.
    end

    # Set these values to wherever you're running your BrowserMob Proxy.
    # browsermob_proxy_host = 'localhost'; browsermob_proxy_port = 8020

    key['ProxyServer'] = "#{browsermob_proxy_host}:#{browsermob_proxy_port}"
    key['ProxyOverride'] = '&lt;local&gt;'
    key['MigrateProxy'] = 1
    key['ProxyEnable'] = 1

Figure 10: Configuring the Windows system proxy (Ruby).

With your driver configured for the proxy, you’re free to navigate to URLs that use self-signed certificates. No further configuration is necessary.

Conclusion

To recap, accepting self-signed certificates in Selenium 2 is dependent on the driver being used. The proxy installation method works with nearly every driver, but is overkill for those drivers that natively support accepting self-signed SSL certificates. Notably absent from the list above is the iOS driver, which neither can accept self-signed certificates nor can configure the proxy. In this case, you may want to look into the 3rd party ios-driver project, which can handle self-signed SSL certificates.

One final note is that these approaches won’t just handle self-signed SSL certificates, but rather any SSL problem, such as an expired certificate. Since you’ll be ignoring these problems, you should ensure you have an external process test for the validity of any critical certificates.

Centralized Selenium Logging with Graylog

2013-01-16T00:00:00+00:00

Selenium Grid is a fantastic way to run a cluster of browsers to speed up your testing. We make very heavy use of it to provide our Web Consistency Testing results. But running many distributed nodes and not knowing which one is running a session at any given time can be problematic from an ops standpoint. Sending all of our logs to a centralized location is one way that we manage all this.

Graylog

We use Graylog 2 as our integration point (don’t let the web page dissuade you — it’s very solid software). Graylog is built atop elasticsearch, so it has great discovery capabilities and can segment your messages by various facets. You can then save these configurations as “streams” that you can monitor or optionally have them send you alerts when certain conditions are met.

If you click through a stream, you can see all the messages in that stream. Here I’m showing a snippet from our “Grid Nodes” stream. Note how each message notes the host the message was sent from as well as the severity of that message. We can use that information to drill down into an issue we need to debug or as the basis of an alert for a production problem.

Figure 1: Log messages in the "Grid Nodes" stream.

Configuring Selenium’s Logger

As it turns out, getting all this is fairly cheap. If you haven’t already, you may want to review the article on Selenium RC Logging for Developers. The article predates the Selenium 2 merge with WebDriver, but the logging system is the same.

First, you need to create your logging.properties file. Graylog’s logging format is called GELF and the GelfHandler serves as the root for the logger configuration:

  handlers = org.graylog2.logging.GelfHandler

  .level = ALL

  org.graylog2.logging.GelfHandler.graylogHost = my_graylog_server_hostname
  org.graylog2.logging.GelfHandler.graylogPort = 12201
  org.graylog2.logging.GelfHandler.originHost = ie901
  org.graylog2.logging.GelfHandler.extractStacktrace = true
  org.graylog2.logging.GelfHandler.facility = selenium

Figure 2: Sample logging.properties config file.

Most of these values should be straightforward. The originHost value allows you to set the hostname to be used for log messages sent from this host. If not set, Java will try to figure it out based on your local network configuration. We’ve found this often leads to non-descript and long names. So, we opt to override on each host. That does imply a host-specific configuration file, but we autogenerate it so it’s not a problem in practice.

Next up, start the server with the configuration file. Note you’ll need to download the gelfj.jar file and its json-simple.jar dependency and add them to the classpath when starting Selenium.

  java -cp json-simple.jar:gelfj.jar:selenium-server-standalone.jar \
    -Djava.util.logging.config.file=logging.properties org.openqa.grid.selenium.GridLauncher \
    -role node

Figure 3: Sample command to start up Selenium with logger configuration.

And with that, you should now see log messages appear in your Graylog installation. NB: Since we overload the logger completely, no messages will appear in your console any longer. Graylog will be the canonical source of all logs from this point forward. You may want to consider logging to syslog and having that forward to Graylog instead for redundancy.

Bonus: Message Filtering

If you’ve ever looked at your Selenium logs you know they can be quite verbose. E.g., Selenium Grid nodes will log a message for every heartbeat message they send to the grid hub. While arguably Selenium should make better use of log levels, one way to work around this is to set up filtering rules in Graylog. We do this to drop all heartbeat messages from appearing in Graylog at all, allowing us to focus on log messages that really matter.

  import org.graylog2.messagehandlers.gelf.GELFMessage

  rule "Drop Selenium Grid heartbeat messages"
    when
      m : GELFMessage( facility == "selenium" && fullMessage matches ".*\\/status\\)?$" )
    then
      m.setFilterOut(true);
      System.out.println("[Drop all Selenium Grid heartbeat messages] : " + m.toString() );
  end

Figure 4: Sample Graylog rule for dropping verbose log messages.

Speed Up Web Testing with a Caching Proxy

2013-01-03T00:00:00+00:00

If you’ve ever done Web integration testing you know it can be really slow. Starting up browsers is slow. Loading Web pages is slow. Interacting with those pages is slow. Since Mogotest is a service built mostly around these concepts, we’re constantly looking for ways to speed things up. This post is about how to speed up page load times.

More often than not, your test environment is going to be anemic in comparison to your production environment. If you’re running integration tests locally, you’re probably hitting an untuned, simple-to-use server, like WEBrick on Ruby or an embedded Jetty server on Java. Request processing is likely sluggish and the server isn’t configured for concurrent requests, meaning if you run tests in parallel to speed things up, you’re going to hit a wall.

To that end, we’ve long routed all requests through a Squid proxy. Using a proxy allows us to:

Gate the number of connections to a remote server so we don’t DoS it
Route requests from a known set of IPs (great for filtering or whitelisting)

Layering in a cache on top of that proxy would allow us to:

Deliver test run results to our customers faster
More efficiently use our browser cluster

Unfortunately, we never could get our Squid cache configuration right. Granted, we do have some esoteric requirements that most won’t have. In particular, we need to work around third party sites with broken cache configurations. The easiest way around this is to have a low TTL on the cache; that way we’d have the content cached for that short burst period of testing and then evict it before the next test starts.

We recently swapped out our Squid server for Apache Traffic Server (ATS). While Squid has served us well, ATS grants us a very fine-grained control over the cache. Additionally, it has a much simpler config and ships with a utility that makes cache usage monitoring easy.

Since soft-launching about two weeks ago, we’ve had 169,437 cache hits and 9,864 cache misses. We hit the cache on 72.5% of all our requests and save 69% of our incoming bandwidth as a result. The numbers are a bit skewed towards smaller documents, which have a higher hit rate. This is because we start up all browsers at roughly the same time and they can only read from the cache if a resource is fully written out to the cache. As such, larger objects may be fetched several times on the first page load, but things smooth out on subsequent pages.

It’s hard to measure how much this speeds tests up given the variability in our customers’ test targets. For fast servers, the savings is minimal as the tests are dominated by browser rendering time. But on particularly slow servers, like you’d see in a typical testing environment, we’ve seen tests complete 2 - 4x faster.

If you’re integration testing your site using a tool like Selenium, you may want to try placing ATS in front of your test environment server. The set up is very simple, the overhead is minimal, and you’ll almost certainly see faster test results due to faster page load times.