Embedding Truffle Languages

Introduction

The past several years of my career have been spent predominately working on TruffleRuby, an implementation of the Ruby programming language that can achieve impressive execution speed thanks to the Truffle language implementation framework and the Graal JIT compiler. Taken together, these three technologies form part of the GraalVM distribution. The full distribution includes implementations of other languages (JavaScript, Python, R, and Java), an interpreter for LLVM bitcode (Sulong), tooling such as a profiler and debugger for both host and guest code (VisualVM), tooling to visualize decisions made by the JIT compiler (IGV), and the ability to generate native binaries of Java applications (including any of the listed language interpreters) via Native Image. There’s more to GraalVM as well, which makes defining it and discovering all of its capabilities difficult. In this article, I’d like to focus on two pieces of GraalVM functionality: 1) loading a Truffle interpreter into a Java application to call guest language code (e.g, Ruby) directly from Java; and 2) using Native Image to turn that Java code into a native shared library, allowing Truffle languages to be loaded and called just like any other exposed C function.

Native Image Overview

GraalVM’s Native Image tool can build native executables and native shared libraries from Java code. By default, these binaries will have a dependency on your system’s libc and implementations, but you can instruct Native Image to statically link in libc and zlib libraries if you have them, leaving you with a binary that has no external dependencies. In effect, you can use Java just as you would any other ahead-of-time (AOT) compiled language. In contrast to C/C++, Rust, or other similar systems languages, you still have access to the Java VM facilities such as Java’s IO abstraction and garbage collection (GC). However, the VM facilities are not provided by HotSpot, but rather a new VM written specifically for Native Image binaries called SubstrateVM.

As with most technology decisions, there’s a trade-off: Native Image binaries start considerably faster than running an application in the JVM, but they forego the ability to JIT the application¹. Additionally, the SubstrateVM garbage collector that ships with GraalVM Community Edition is not quite as polished as the HotSpot one (GraalVM Enterprise Edition supports the G1 garbage collector). Despite not having a JIT, that doesn’t mean that there is no optimization at all. The Native Image compilation process will run AOT optimization passes as it builds the image. The enterprise version of GraalVM also supports profile-guided optimization (PGO) to help Native Image make compilation decisions that are favorable to the profiled application. Additionally, Native Image binaries make distribution easier since you don’t need to have a JVM available in your target environment.

While Native Image binaries may not be the best option for long-running server applications, they open up the ability to run Java applications in environments that the language was previously ill-suited towards, such as Functions as a Service (FaaS), which need to start up quickly and are ephemeral. TruffleRuby ships as a native image so it can load quickly for scripting applications, fast REPL start-up, and execute test suites considerably faster than the JVM-based version could.

In order to build a binary of a native application while still supporting broad use of the JDK, the Native Image performs an extensive closed world analysis to figure out exactly what classes and methods your application uses and only compiles those into the application. Just to reiterate, the binary generated by Native Image does not include a JVM, so it can’t support functionality like dynamically loading classes from a JAR. Your application can make use of some of the dynamic features the JVM provides, but such usages must be constrained to something that can be decided and included in the binary. The Native Image compiler is able to detect and resolve some usages of reflection, such as Class.forName and Class.getDeclaredField when the arguments can be reduced to a constant during static analysis (e.g., a field name supplied as a static string). If your reflection usage is more dynamic or otherwise can’t be statically determined, you must provide a configuration file declaring what classes, fields, and methods must be available along with any necessary class/JAR files on the classpath so Native Image can build support for them in the binary. With these two mechanisms, Native Image can handle many use cases that call for reflection or JNI access. However, if your application allows a user to supply their own class files at runtime (e.g., a plugin-based application), please be aware that cannot and will not work in a Native Image binary².

At their core, Truffle language interpreters are just Java applications. Certainly, the language implementations also use non-Java code (e.g., a substantial portion of TruffleRuby is written in Ruby and some parts in C), but they all can be loaded and invoked in Java using the GraalVM Polyglot API. By using libjvm and the Java Native Interface (JNI) Invocation API, we can load a copy of the JVM up into a non-Java application and execute code in a Truffle language via the GraalVM Polyglot API. But, loading an entire copy of the JVM up is rather slow and memory intensive.

As Java applications, a Truffle interpreter can be compiled via Native Image³. Moreover, Native Image can link the entirety of a Truffle interpreter into the resulting binary (executable or shared library). Following this approach, we can generate a library to run our Ruby code that starts quickly, uses less RAM than libjvm, requires less disk space than JVM distribution, and have an integrated JIT to optimize our code running in the interpreter.

Native Image Playground

The GraalVM distribution ships with a dizzying amount of functionality. Most of it is very well documented, but some of it is either lacking or simply assumes the reader has more information than this post will assume. To help illustrate some of the techniques described here, I’ve pulled together a project called the Native Image Playground which has several examples of using Native Image to build standalone executables and shared libraries, for loading a Truffle interpreter into another process, and for executing multiple Truffle languages (e.g., Ruby and JavaScript) within the same process. I will refer to examples from Native Image Playground in this article. If you wish to run the code on your own, please follow the steps outlined in the project’s README to ensure you have all the necessary prerequisites.

Many of the examples in the Native Image Playground compute the Haversine distance: a way to measure geographic distance on the surface of a sphere. This algorithm was chosen because it was the same one used by Chris Seaton in his Top 10 Things To Do With Graal post, which is a spiritual predecessor to this piece. The algorithm is implemented with simple arithmetic and trigonometry, so we can easily evaluate the generated machine code for the operation. As a caveat though, the algorithm implementation was taken from the Apache SIS project and was found to be incorrect. Since the purpose of this post isn’t to be a reference for geopspatial operations, I’ve pushed ahead with the incorrect algorithm in order to retain parity with Chris’s earlier post and because the correct implementation is more involved, complicating our performance analysis.

Calling Methods From a Native Image Shared Library

As of GraalVM 22.1.0, there are two primary mechanisms for calling a method embedded in a Native Image shared library: the Native Image C API and the JNI Invocation API. The Native Image C API is somewhat begrudgingly supported and likely to be removed in the not too distant future. It’s an extra API specific to Native Image that the GraalVM would like to remove in favor of the more standard JNI Invocation API. In a Native Image binary, JNI is retargeted to work with GraalVM Isolates, the mechanism by which GraalVM supports multiple, isolated execution environments within the same process. However, JNI performance within a Native Image is limited pending the merge of Project Panama to the JDK. As a result, we have two methods for calling natively compiled Java methods from a library where neither can be fully endorsed at the moment.

Native Image C API

Don’t be put off by the name “Native Image C API”. While GraalVM makes it easy to use C to call into Native Image shared libraries, the name is more of an indication as to how the functions will be exported from the library. You can use this API in any language with the ability to call foreign functions (e.g., using the FFI or Fiddle libraries in Ruby).

By default, nothing is exported from your shared library other than a function named main should you have a public static void main method somewhere in your Java code. Otherwise, to export a Java method you must do the following:

Declare the method as static
Make the first argument an org.graalvm.nativeimage.IsolateThread
Restrict your parameter and return types to primitive types or a type from the org.graalvm.nativeimage.c.type package
Annotate the method with the org.graalvm.nativeimage.c.function.CEntryPoint annotation

If you look at the various org.graalvm.nativeimage sub-packages, you’ll find some code for handling additional cases that we are not going to do so here, such as mapping Java interfaces to C structs. For the Haversine distance calculations, all parameters will be doubles and the return value will be a double as well, so we won’t need any of the additional functionality that Native Image makes available.

Taking the NativeLibrary example from the Native Image Playground project, we have the following:

@CEntryPoint(name = "distance")
public static double distance(IsolateThread thread,
        double a_lat, double a_long,
        double b_lat, double b_long) {
    return DistanceUtils.getHaversineDistance(a_lat, a_long, b_lat, b_long);
}

Example 1: Haversine distance in Java exposed as C function in Native Image shared library.

The name attribute in the @CEntryPoint annotation may be omitted, but the default name is constructed from the class and method names along with randomly generated number to ensure uniqueness. Naturally, since the methods are being exposed in a shared library, they must have unique names. If you give two exposed methods the same name, the Native Image compiler will fail with a message such as:

duplicate symbol '_distance' in:
    libnative-library-runner.o
ld: 1 duplicate symbol for architecture x86_64

Example 2: Error message building Native Image shared library with duplicate exposed function names.

When you build the binary, Native Image will also generate some C header files for you. If working with C or C++, you can use these header files directly. For other languages, you can use the function declarations in the headers to set up your foreign call bindings. The code found in Example 1 will result in the following function declaration:

double distance(graal_isolatethread_t*, double, double, double, double);

Example 3: Function declaration for the Haversince distance method exposed in the Native Image shared library.

As you can see, the Java double type is mapped to the C double type. The Java IsolateThread type is mapped to a graal_isolatethread_t* in C.

Working with Isolates

Every function you would like to expose in a Native Image shared library using @CEntryPoint must have an IsolateThread as its first parameter and every call to that method through the shared library must supply a Graal Isolate pointer as its first argument. Looking at the code in Example 1, the distance method doesn’t do anything with the Isolate parameter. The actual usage of the Isolate handle is managed by Native Image in the generated binary.

Along with the header file generated with all of the function declarations for exposed methods in the shared library, Native Image also generates a graal_isolate.h file with type definitions and function declarations for working with the Native Image C API.

The naming here might be a bit confusing. There are Graal Isolates and Graal Isolate Threads. When calling a function exposed in Native Image shared library, you must actually supply a pointer to an Isolate Thread and all Isolate Threads must be attached to an Isolate. Creating an Isolate will implicitly create a primary Isolate Thread and that is what the sample projects in Native Image Playground use (i.e., none of the sample projects dig into multi-threading). All Graal Isolates and Isolate Threads must be torn down when you’re done with them; tearing down the Isolate will also teardown the primary Isolate Thread.

Another way of working with Isolates is to expose your own functions in the shared library by using @CEntryPoint built-ins. The Native Image Playground samples do not make extensive use of this form of resource management, but some do for completeness. To expose these methods, you would use something like the following:

@CEntryPoint(builtin = CEntryPoint.Builtin.CREATE_ISOLATE, name = "create_isolate")
static native IsolateThread createIsolate();

@CEntryPoint(builtin = CEntryPoint.Builtin.TEAR_DOWN_ISOLATE, name = "tear_down_isolate")
static native int tearDownIsolate(IsolateThread thread);

Example 4: Using @CEntryPoint built-ins to expose Graal Isolate resource management methods in the Native Image shared library with custom names.

Java Native Interface (JNI) Invocation API

The preferred mechanism for invoking code in a Native Image shared library is to use the Java Native Interface (JNI) Invocation API — a standard JDK API for starting and programmatically controlling a JVM from another process. Usage of JNI Invocation API might seem a bit odd, given a defining feature of Native Image binaries is that they do not include the JVM. Native Image binaries do include a VM though to handle things like GC and thread scheduling. This alternative VM, called the Substrate VM, reimplements the JNI Invocation API to create Graal Isolates and Isolate Threads and adjusts the rest of the API so that JNI calls bind to the appropriate Isolate Thread (see the earlier discussion on Graal Isolates if you’re unsure what that means).

By using the JNI Invocation API, you don’t need to learn a new Native Image-specific way to write code that drives a Java process. However, much of JNI is essentially runtime reflection and Native Image does not allow arbitrary reflection. In order to use JNI with a Native Image binary, you need to supply a JNI configuration file to the native-image command when build your image. Manually creating that file is tedious and error-prone. To simplify the process, I recommend using a tracing agent provided by GraalVM, which will record all JNI calls made at runtime and dump them out to a file. To do so, you’ll need to temporarily swap your application over to using libjvm, which will allow general JNI calls. I found it easiest to set the JAVA_TOOL_OPTIONS environment variable, that way I wouldn’t have to customize the java command in Maven. Using the jni-libjvm-polyglot example from the Native Image Playground, we have:

$ mvn -P jni-libjvm-polyglot -D skipTests=true clean package
$ export JAVA_TOOL_OPTIONS="-agentlib:native-image-agent=config-output-dir=$PWD/target-jni-libjvm/config-output-dir-{pid}-{datetime}/"
$ ./target-jni-libjvm/jni-runner js 51.507222 -0.1275 40.7127 -74.0059

Example 5: Enable the Native Image tracing agent to record JNI calls.

In this example, we really didn’t need to embed the PID or timestamp into the generated directory, but it’s generally useful if you have multiple Java processes running since they’ll all share the environment variable and thus would all dump to their output to the same directory. If we take a look at that directory, we’ll see the agent generated several files for us:

$ ls target-jni-libjvm/config-output-dir-40562-20220329T191144Z/

jni-config.json                 proxy-config.json               resource-config.json
predefined-classes-config.json  reflect-config.json             serialization-config.json

Example 6: Configuration files generated by the Native Image tracing agent.

The jni-config.json file is the one of interest. We can pass that file to the native-image command using the -H:JNIConfigurationFiles option. The jni-native profile from the Native Image Playground does precisely that. Both the jni-libjvm-polyglot and jni-native Maven profiles from the Native Image Playground use the the same exact C++ code launcher application to calculate the Haversine distance using a Truffle language through its Java polyglot API. That’s the primary draw of using the JNI Invocation API with Native Image; you don’t need to learn a new non-standard API and your code will work without modification as you switch between libjvm and the Native Image shared library.

Benchmarks

When starting this project, I was only aware of the Native Image C API, so that’s what I started with. Between documentation, GitHub issues, and discussions with others on the GraalVM Slack, I learned about the JNI support in Native Image. But, I was also told that JNI calls would have higher overhead than using the Native Image C API until Project Panama is finished. This presented a conflict because ultimately I’m investigating ways to embed languages like TruffleRuby into other applications. The choice between fast & deprecated (Native Image C API) and slower but API-stable (JNI Invocation API) is not the sort of trade-off I really wanted to make. I haven’t been actively tracking Project Panama, but it’s not in Java 17 and GraalVM only uses Java LTS releases. The next planned LTS release will be Java 21 and that’s targeted for Sept. 2023 — too far out to wait for this application.

While I’ve spoken with people that experienced significant slowdowns in trying to migrate from the Native Image C API to the JNI Invocation API, I couldn’t find any numbers supporting their claims. Thus, the final aspect of the Native Image Playground is to benchmark different different options for executing code in Truffle languages embedded in a process. Whether using the Native Image C API or the JNI Invocation API, there are several different ways to call into a Truffle language, so the benchmarks include multiple approaches with each of the Native Image shared library APIs.

I want to reiterate that the focus of these benchmarks is on Truffle language performance. While Truffle interpreters are written in Java and compile the same as any other Java method would, Native Image does some extra work to record Graal graphs for Truffle interpreters so those interpreters can be JIT compiled. In contrast, a trade-off when using Native Image is that there is no JIT for arbitrary Java methods. The GraalVM team is working on a Java interpreter called Espresso that will allow Java methods to JIT in Native Image binaries by running the bytecode through the Espresso interpreter, but I did not consider it for any part of the Native Image Playground. The reason I’m calling this out specifically is because I’m not measuring the call overhead of Java methods being run in a Native Image binary. Certainly, I need to make some Java calls to use the GraalVM Polyglot API, but what I’m really concerned with is the performance of executing guest code in a Truffle interpreter.

Methodology

For benchmarking, I’m using Google’s benchmark library in a launcher written in C++. I.e., the benchmark harness is not a Native Image binary. The benchmarks were run on a Ryzen 3700X system with 64 GB ECC RAM running Ubuntu 22.04 (kernel 5.15.0-27-generic) and with CPU frequency scaling disabled. Each benchmark was run for ~30s to allow adequate warm-up of any code that could be JIT compiled. Since Truffle optimizes and deoptimizes based on values profiled an run-time, each benchmark was run in its own Graal Isolate to avoid any cross-benchmark JIT issues. While it’s nearly impossible to eliminate system effects (e.g., cache line pollution), each benchmark was run three times in order to help minimize such effects. Additionally, to help avoid differences related to benchmark execution order, benchmark results were collected using the --benchmark_enable_random_interleaving=true option from the Google Benchmark library.

For each benchmark, I present two values: 1) the mean of three execution times; and 2) the standard deviation for the three executions. Deciding which value to present is an on-going debate in the world of computer science. One theory holds that the minimum value represents the ideal case where system effects have not adversely impacted performance and so that should be used. Another is that you can never run with ideal state, so average values like the mean and median represent more realistic cases. In this situation, I picked the mean mostly because I also want to present error data and the standard deviation is a simple value to use. Even that should be taken with a grain of salt, however, because there’s no guarantee the performance follows a normal distribution. If you’d like to see all of the raw measurements, along with the median, mean, standard deviation, and coefficient of variation, you can download the results.

Much like the examples used to demonstrate the various ways to embed Truffle languages, the benchmarks all compute the Haversine distance. There is a Haversine implementation in C++ intended to be something of a control value. Likewise, there’s an Haversine implementation in Java to establish the baseline for methods compiled by Native Image. From there, all of the other benchmarks call into a Truffle language to calculate the Haversine distance.

Software versions:

C++ Compiler: clang++ (Ubuntu clang version 14.0.0-1ubuntu1)
GraalVM: 22.1.0 (based on Java 17.0.3) - Community Edition
Native Image Playground: ceff9b6e21c6a3d55d426c7c0c2a2cf3c8f7fcbb
Google Benchmark: 705202d22a10154ebc8bf0975e94e1a93f08ce98

Classifications

The benchmark results are presented in four phases:

Baseline performance
Native Image C API
Java Native Interface (JNI) Invocation API
Native Image C API vs JNI Invocation API

Since I’m using Native Image to build a native binary using Java, I wanted to establish a reasonable upper-bound on performance of the generated code. In Phase 1, I present the results of a C-based implementation of the Haversine distance. This is a straightforward implementation compiled with -03 optimizations, but does not make use of profile-guided optimization (PGO), compiler intrinsics, inline ASM, or any other manually-driven optimization.

Having established what Native Image performance would be in an ideal case, Phases 2 & 3 explore performance of the two primary APIs for invoking exposed functions in a Native Image shared library: the Native Image C API and the JNI Invocation API. For each API, I benchmark a Java-based implementation of the Haversine distance, establising a new reasonable upper-bound on performance for that API. From there, the various ways to execute code within a Truffle interpreter are investigated. Phases 2 & 3 explore these different approaches and the best option (* not necessarily the fastest) are used for the head-to-head comparison in Phase 4.

Baseline

The benchmark runner includes an implementation of the Haversine distance written in C++ using trigonometric functions from cmath/math.h from the C/C++ standard library as shipped with LLVM. The Haversine distance implementation is a direct port of the Apache SIS implementation used in Java. While I don’t doubt the algorithm could be tweaked more manually, the implementation is quite straightforward and compact. A large component of these benchmarks is to see how well a compiler is able to optimize code. Since the Native Image builder will perform optimizations when building its binary, the benchmark runner, including the C++ Haversine implementation, is compiled with the -O3 optimization flag.

---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
C++                                                      51.1 ns         51.0 ns    822699656
C++                                                      51.1 ns         51.1 ns    822699656
C++                                                      51.0 ns         51.0 ns    822699656
C++_mean                                                 51.0 ns         51.0 ns            3
C++_median                                               51.1 ns         51.0 ns            3
C++_stddev                                              0.053 ns        0.054 ns            3
C++_cv                                                   0.10 %          0.11 %             3

Figure 1: Benchmark baseline number.

The tabular output in Fig. 1 is generated by the Google Benchmark framework. The execution time is very stable between each of the benchmark runs, with a mean execution time of 51.0 ns.

Native Image C API

There are three types of benchmarks run with the Native Image C API:

A pure Java implementation of the Haversine distance
A hard-coded Ruby implementation of Haversine distance
A general executor of Truffle language code, which is supplied with implementations of the Haversine distance in Ruby and JavaScript.

I chose these three to help establish where overhead may lie. I expect the pure Java version to optimize the best during AOT compilation; however, it will not JIT compile.

The hard-coded Ruby implementation uses the GraalVM Polyglot API to execute a predetermined code fragment with TruffleRuby. The code fragment can be parsed ahead-of-time and the resulting function object stored directly into the Native Image shared library, avoiding the need to parse the code at runtime. Since the requirements of that code fragment are known ahead of time, the implementation can make the exact Truffle polyglot calls needed to execute the Ruby Haversine code. While somewhat limited, this example is illustrative of how you might embed a language like Ruby within a process to run specific workloads.

The final benchmark also makes use of the GraalVM Polyglot API, but rather than hard-code the guest code fragments in the image, they are supplied as a string argument by the benchmark. As a matter of practicality, however, the exposed library function only works with code fragments that take four double arguments and return a double value. The API call that deals with evaluating the code fragment is unaware of the restriction, however, so the interpreter still must discover the shape of data by runtime profiling.

Ideally, everything about the call would be flexible, but there’s a lot of ceremony involved in harmonizing the C and Java type systems using the Native Image C API (JNI does not have this problem). Principally, all return values are typed as org.graalvm.polyglot.Value in the GraalVM Polyglot API, but the Native Image C API cannot work directly with these objects. As a result, the return value needs to be coerced into a native type (in this case, double). That’s fairly straightforward to do when the caller knows the return value should be of a specific type, but it becomes much more complicated when the caller needs to allow any return type. Likewise, a truly general API would need to pack the four double coordinates into a java.lang.Object[] constructed in C/C++. While it’s all doable, the effort required to make this approach truly general is so involved that I can’t believe anyone would do it in practice⁴.

Results

The results of the two simple @CEntryPoint methods — Benchmark 1 & 2 from the previous section — are available in Fig. 2.

Figure 2: Benchmark results for methods exposed via @CEntryPoint (Native Image C API).

The Java Haversine implementation executes in 115 ns, compared to the 51 ns of the C++ implementation, taking 2.25x times as long to execute. Interpreting that result requires contextualizing your application. On the one hand, if performance is your ultimate goal, the C++ implementation is twice as fast as the Java one compiled with Native Image. On the other hand, the Native Image implementation includes a fully functional virtual machine with memory safety, garbage collection, platform API abstraction, and so on. If the overall performance is within your application’s performance target, Native Image can be a compelling option for generating native binaries while benefiting from the Java ecosystem. Either way, this example is not telling the whole story and results should not be extrapolated. There isn’t much going on in terms of memory allocation, I/O, multi-threading, or even functionality like virtual calls and templating/generics. I’d encourage you to run your own benchmarks flexing the functionality your application would require.

From here on out I’m going to use the Java implementation of the Haversine algorithm as the baseline. I think this is a more realistic performance target for the Truffle languages. Additionally, the differences between Java and Truffle languages are smaller than the differences between C++ and Java and that detail would be difficult to see if the ensuing analysis focused on Truffle languages vs C++.

Before taking a look at the performance of the various Haversine implementations invoked via the Native Image C API, we need to sort out which approach to take for making arbitrary GraalVM Polyglot API calls (see previous section for description). Fig. 3 shows the results of executing a guest language code fragment under a variety of Truffle resource management strategies. Results are provided for both TruffleRuby and Graal.js in order to provide results that don’t overfit to a particular guest language. The different strategies measured are:

Polyglot context reused, but code fragments always parsed at runtime (No Parse Cache)
Polyglot context reused and evaluated code fragments parsed (Thread-Safe Parse Cache)
Polyglot context reused and evaluated code fragments parsed (Thread-Unsafe Parse Cache)
Polyglot context recreated for each call (Not graphed)

Figure 3: Benchmark results for Truffle polyglot methods exposed via @CEntryPoint (Native Image C API).

Benchmarks 3a - 3c execute an exposed Native Image library function that takes both a Truffle language identifier and a code fragment to execute. These values are supplied at runtime, so there’s no ability to parse them ahead-of-time and snapshot them into the image as Benchmark 2 could.

Each of these benchmarks reuse a GraalVM Polyglot context throughout all of their iterations. Benchmark 3a parses the Haversine distance code fragment each time the benchmark runs. Benchmark 3b uses a thread-safe parse cache, parsing the Haversine distance code fragment once and reusing the resulting Truffle function object for subsequent calls. Benchmark 3c does essentially the same things as 3b, but does away with the overhead of a ConcurrentHashMap. Benchmark 3c is a horrible way to use the GraalVM Polyglot API and exists only to give us a sense of the overhead of a protected parse cache.

Of the three approaches displayed in Benchmark 3, I think the thread-safe parse cache (3b) is the one to go with. It outperforms executing without a parse cache (3a) without introducing race conditions that will be very difficult to debug (3c). This is the value that will be used in Phase 4 where the results of the Native Image C API will be compared to the results of the JNI Invocation API.

Having evaluated several GraalVM Polyglot resource management strategies and settling on the thread-safe parse cache with polyglot context reuse, we can now look at the performance of the various Native Image C API calls (see Fig. 4). I was happy to see that the hard-coded Ruby implementation (Benchmark 2) runs just as fast as the Java implementation (Benchmark 1). It’s a little difficult to see in the graphs, but when accounting for measurement errors, the differences are virtually non-existent: 114.6 ± 1.56 ns for the Java implementation versus 117.2 ± 1.15 ns for the Ruby one. If you have a well-defined operation you need to run and that can be baked right into the Native Image binary, you can write your code in a guest language and not have to worry about rewriting parts of it in Java for performance.

Figure 4: Benchmark results for Native Image C API calls.

Unfortunately, the general use GraalVM Polyglot API calls are much slower than the Java Haversine implementation. The polyglot Ruby call takes 12x as long to process as the hard-code Ruby call. This isn’t isolated to TruffleRuby as the Graal.js calls take 14x as long. I haven’t spent any time digging into why there’s such a large performance gap so I have no concrete suggestions on how to fix it.

I will say that using the GraalVM Polyglot API by exposing Java methods with @CEntryPoint is quite awkward and probably not the best way to write polyglot code to begin with. GraalVM also ships with a library called libpolyglot that exposes a more natural C API for the GraalVM Polyglot API and you can see an example of that in the Native Image Playground project. I did not benchmark any examples libpolyglot.

Notionally, libpolyglot uses the same machinery as the the Java GraalVM Polyglot API, so I’d expect performance to be quite similar. Moreover, it’s a big library that includes every Truffle native image you have installed locally (261 MB with TruffleRuby and Graal.js in GraalVM 21.3.0 on Linux) and must be built manually (i.e., you can’t install it from the GraalVM component catalog). Due to the effort involved and minimal gain anticipated, I opted to defer a deeper analysis of libpolyglot performance as future work.

JNI Invocation API

In order to easily compare results between the Native Image C API and the JNI Invocation API, the same workloads were tested with each API. As a reminder, those benchmarks are:

A pure Java implementation of the Haversine distance
A hard-coded Ruby implementation of Haversine distance
A general executor of Truffle language code, which is supplied with implementations of the Haversine distance in Ruby and JavaScript.

As with the Native Image C API benchmarks, the hard-coded Ruby implementation uses the GraalVM Polyglot API to execute a predetermined code fragment with TruffleRuby. That code fragment can be parsed ahead-of-time and the resulting function object stored directly into the Native Image shared library. Rather than call an exposed library function, as was needed with the Native Image C API, we can call the representative Java method directly with the JNI Invocation API. In this way, the same exact Ruby implementation can be called using two different foreign access APIs.

With the JNI Invocation API, we have considerably more control over reusing the Graal Isolate, GraalVM Polyglot context, and parsed guest language functions. Consequently, the JNI benchmarks do not explore different Truffle caching strategies as we did with the Native Image C API benchmarks; we just use the most straightforward implementation, which happens to be the most performant.

Results

The results of the two simple JNI methods — Benchmarks 1 & 2 from the previous section — are available in Fig. 5.

Figure 5: Benchmark results for methods exposed invoked via JNI Invocation API.

As with the Native Image C API results, we start by looking at the performance the Java implementation of the Haversine distance algorithm. At a mean value of 130 ns, the Java implementation takes 2.5x as long to execute as the C++ implementation (51 ns). As noted in the Native Image C API benchmark results, there’s a natural trade-off between executing the C++ version and Java version, as the latter has a supporting virtual machine. It’s important to know that there is a performance difference and what that is, but that should be evaluated in context of your functional requirements. The Haversine distance algorithm is a computation-heavy benchmark; you should establish your own representative benchmarks if you’re trying to decide between a systems language and Native Image for a particular task.

Having established the performance difference between the C++ Haversine implementation and the Java implementation compiled with Native Image, I’ll be using the Java implementation as the baseline for the Truffle benchmarks. While the C++ implementation helps establish a competitive performance target, ultimately I’m interested in embedding Truffle languages into another process. As such, using Java as the baseline is more useful as it highlights where there’s room for improvement in Truffle interpreters running in Native Image binaries.

Figure 6: Benchmark results for Truffle polyglot methods invoked via JNI Invocation API.

Fig. 6 shows the performance of the JNI polyglot experiments relative to the Java Haversine implementation. As with the Native Image C API, the case where the Ruby code can be parsed and snapshotted into the Native Image binary (Benchmark 2) is as fast as the Java implementation: 130.4 ± 0.37 ns (Java) versus 128.2 ± 0.79 ns (Ruby). My conclusion is the same as before: if you have a well-defined operation you need to run and can bake that right into your Native Image binary, you can write your code in a guest language and not have to worry about rewriting parts of it in Java for performance. That’s an amazing result to me. I suspect most people would expect the Ruby version to run substantially slower than Java and possibly expect the Java version to run substantially slower than C++. But, here, with Native Image we can load TruffleRuby into a foreign process and run a math-heavy operation that only takes 2.5x as long as an optimized C++ version.

The GraalVM polyglot calls that take both a Truffle language ID and a code fragment to evaluate at runtime (Benchmark 3) were a fair bit slower. Calling Ruby in this way took 5x as long to execute as the version where the Ruby code could be compiled right into the shared library. The Graal.js result (7.2x as long the Java implementation) helps to establish that this is not a result specific to TruffleRuby.

In an ideal world, there would be no steady state difference between using the GraalVM Polyglot API via JNI and hard-coding the guest language code fragment into the image. This is an area I’d like to dig into more. I would’ve expected the hard-coded version to have an advantage in execution speed before things have warmed up just by virtue of not needing to parse the code fragment. However, once warmed up, both approaches should have generated the same machine code since they’re running the same code fragments with the same inputs. I don’t know if the difference is due to JNI call overhead, issues with the JIT process, or something else.

I started down the path of looking at how these different invocation techniques compiled into the Native Image binary, but found it rather difficult as the compiler generates label names divorced from the Java method names. Running a debug build helped map the labels back to their source methods, but it’s still a time-consuming process of following JMP and CALL instructions in a debugger and consulting the backtrace to see where I logically was in the application. There’s almost certainly a better way to dump the machine code for a method compiled by Native Image.

Overall

The previous benchmark sections looked at the performance of different ways to execute code compiled into a Native Image shared library from an external process, with an emphasis on executing code in a Truffle interpreter. There are two primary invocation APIs for exposing Java methods and making them accessible via C calling conventions: the Native Image C API and the Java Native Interface (JNI) Invocation API. Thus far we’ve looked at the relative performance difference of executing methods written in C++, Java, Ruby, and JavaScript in each of these invocation APIs. In Fig. 7, we can now see how the invocation APIs perform relative to each other.

Figure 7: Benchmark results for Truffle polyglot methods invoked via both the Native Image C API and the JNI Invocation API.

When calling a Java method compiled into binary without caller use of the GraalVM Polyglot API, the Native Image C API does come out ahead, although not by much. I want to qualify that statement by making it clear I was measuring warmed up benchmarks and did not get into wrapping object handles very much; the only data passed from the benchmark harness to the shared library were double values and C strings. The double values mapped directly to the Java primitive type, but the C strings needed to be decoded to java.lang.String. If there was more data coercion or coercion of more sophisticated types, I wouldn’t be surprised to see differences in overhead between the two APIs.

While it was interesting to see how each API handled calling plain old Java methods, my real goal was learning how embedded Truffle interpreters performed. When it comes to making calls using the GraalVM Polyglot API, the JNI Invocation API comes out way ahead. Polyglot calls made with JNI were roughly twice as fast as using the Native Image C API. This was a fortuitous outcome; the promoted invocation API is also the one that performs best for executing guest language code.

I suspect much of the performance difference is attributable to JNI providing a more natural and refined mechanism for managing Truffle objects. JNI can store and work with Java types right in C++. The Native Image C API only supports a narrow set of foreign objects and the supporting API is quite difficult to work with. Accordingly, the JNI benchmarks can parse guest language code and store the resulting Truffle function object right in a local field, which it can then use for each iteration of the benchmark. Whereas with the Native Image C API, I needed to create a thread-safe map of guest code to Truffle functions (to avoid repeatedly parsing the same code fragment) and that cache needed to be read from on each benchmark iteration.

Lessons Learned: GraalVM Polyglot API

I struggled a fair amount deciding where to put this section. It feels somewhat buried here at the end of the benchmark presentation, but I think the benchmark results help contextualize the notes on the GraalVM Polyglot API interactions.

I found the GraalVM Polyglot API rather awkward to work with from C and C++. I ran into issues working with it from both the Native Image C API and from the JNI Invocation API. There are risks with public APIs that are difficult to use. One is that users simply give up and moves on to another project or solution. For those that persevere, there’s a risk that they’re using the API in dangerous ways and just don’t know it. Moreover, they can pass this incorrect knowledge off to others, exacerbating the problem. Yet another risk is that users will look for ways to simplify the API usage and kill performance in the process.

I started down the path of doing the simplest thing first. It wasn’t just laziness or ineptitude though. The GraalVM Embedding Languages reference manual uses very simplified examples throughout the whole document. Notably, nearly every GraalVM Polyglot API example uses a try-with-resources statement to create a polyglot context, which is then used to initialize a Truffle language engine and ultimately execute some guest language code. When the containing Java method exits, the context is freed. It’s a very tidy way of doing resource management and it looks sensible. Having not thought too deeply about, this approach looked right to me.

I started this project working with the Native Image C API. When you write a @CEntryPoint method using try-with-resources for managing the polyglot context, as in Example 7, you have a completely self-contained function you can call from C. Here, too, I thought everything looked nice and tidy. Java functions exposed with @CEntryPoint are supposed to be self-contained; they must be static and they only have access to their parameters and other static data.

@CEntryPoint(name = "distance_polyglot_no_cache")
public static double distance(IsolateThread thread,
                              CCharPointer cLanguage,
                              CCharPointer cCode,
                              double aLat, double aLong,
                              double bLat, double bLong) {
    try (Context context = Context.newBuilder()
            .allowExperimentalOptions(true)
            .option("ruby.no-home-provided", "true")
            .build()) {
        final String code = CTypeConversion.toJavaString(cCode);
        final String language = CTypeConversion.toJavaString(cLanguage);

        var function = context.eval(language, code);

        return function.execute(aLat, aLong, bLat, bLong).asDouble();
    }
}

// Function declaration in header file generated by Native Image.
double distance_polyglot_no_cache(graal_isolatethread_t*,
                                  char*, char*, double, double, double, double);

Example 7: Using try-with-resources for polyglot context management in a @CEntryPoint method.

When I finally had everything come together such that a C application could successfully execute Ruby code by calling a function written in Java that was exposed in a Native Image shared library, I was ecstatic. The amount of technology that had to come together to make all of this happen is staggering and ten years ago I wouldn’t have thought it was possible. However, my excitement was tempered by the abysmal execution time. Each time I ran this function, it took hundreds of milliseconds — sometimes even approaching a full second.

The probablem was every time I called this function, TruffleRuby needed to bootstrap from scratch. There’s ongoing work to make that bootstrap process faster, particularly in native images. But, the proximate cause was the polyglot context never lasted more than a single function call. Even if TruffleRuby bootstrapped instantenously, my code would never have the ability to optimize in any meaningful way. Each time the context was closed, any JIT-generated code went along with it.

At face value, the solution seemed simple enough: share the context across multiple calls. However, I could not find any documentation or code samples on how to do this with the Native Image C API. The @CEntryPoint functions that Native Image can expose in a shared library only support a narrow range of types for parameters and return types. You might think that you could pass arbitrary Java objects around as void *, treating them as opaque values to be passed around. However, Java is a garbage collected language and that presents problems. If the GC were to free an object you still have a pointer to, you would have a use-after-free problem if you ever used that pointer again. An equally bad situation is if the GC moves the object, since there would be no way to update the calling process. To prevent the GC from processing an object, you can pin it with the Native Image C API. However, this should be done sparingly and is intended to prevent an object from moving during a very narrow window; long-term pinning of an object is not recommended as it will have an adverse impact on GC. Moreover, you won’t find documentation on pinning objects with the Native Image C API; you will be in decidedly unsupported territory.

There is a C-based GraalVM Polyglot API partially hidden inside GraalVM via the libpolyglot image (gu rebuild libpolyglot to install it). With this API you can create a polyglot context from C, but you forfeit the nice, simple functions exposed via @CEntryPoint. For instance, looking back at the C function declaration from Example 7, we can call distance_polyglot_no_cache with a C string and C double values. The Java side of the call takes care of any necessary type coercion from C to Java types and dispatches the appropriate arguments to the polyglot function call. The GraalVM Polyglot C API, on the other hand, requires using its own API-specific types. Making a similar call with this API involves converting a C double to a type called poly_value (using poly_create_double) and then packing all four values into an array for a call to the polyglot function via poly_value_execute.

To the best of my knowledge, there’s no documentation for the GraalVM Polyglot C API. You have to piece it together by realizing it’s a mirror of the Java-based GraalVM Polyglot API, which I did not know at first owing to the lack of documentation. From there, you have to match up data types and function declarations from the API’s header files with the JavaDoc for the Java-based API. You can see my foray into using this API in my Native Image Playground project. While I now appreciate the API symmetry and understand the design, I still find the GraalVM Polyglot C API a bit obtuse to use and rather error-prone.

For me, at least, it was decidedly easier to try to find a way to share a polyglot context across multiple @CEntryPoint method calls. The ugly approach I landed on, and the one explored in the benchmarks for the Native Image C API, was to build and store the context in a static field. For the hard-coded Ruby example, I evaluated the Haversine code snippet in a static initializer and stored the polyglot function object in a static field so the code would not need to be re-parsed each time the distance_ruby function was called. For the polyglot cases, the caller supplies both the Truffle language identifier and the code to evaluate. Since parsing the code on each call would incur overhead, I set up a parsed code cache, keyed by the language ID and code. The cache serves another function: with the polyglot context being shared across each @CEntryPoint method call, evaluating a code fragment repeatedly will fail if the the fragment is not idempotent. The benchmarks explore the performance impact of using such a cache and measures the overhead of ensuring its thread-safety. To avoid issues re-parsing the same code fragment multiple times when the cache is disabled, the fragments for the Haversine distance were all implemented as anonymous functions.⁶

I don’t doubt someone more intimately familiar with the Native Image C API and the GraalVM Polyglot API could find a more optimized way of calling guest code than I’ve used in this project. But, that circles back to the dearth of documentation and examples on how to embed Truffle interpreters in a performance-sensitive manner. If there’s a better way to share compiled code across multiple @CEntryPoint method calls, I haven’t found it.

In contrast, the JNI Invocation API makes polyglot resource management much easier. The design of the API allows storing and passing Java objects to and from C++. The difficulty is that anyone using JNI to make GraalVM Polyglot API calls is going to need to map that API to a JNI configuration file to be used when building the Native Image binary. The Native Image Playground has such a configuration file, generated by the GraalVM tracing agent when the application was run using JNI against libjvm.

The file can be hand-crafted, but Java has its own custom format for representing type signatures and it’s easy to get a mapping wrong or overlook one. If you get it wrong, you won’t know until you run your application and it’ll likely manifest as a segfault due to the mismapped function returning nullptr. There may very well be an accompanying Java exception, but JNI does not print those by default; you must invoke yet another pair of functions to check if an exception object exists and then to print it out if so. Every call could fail, so a robust application would have extensive error-checking. But, that’s tedious and makes the business logic much harder to read. For Native Image Playground I wrote an exception-checking macro that I sprinkled around the application when I encountered a segfault and then removed after the bug was fixed.

Once you’ve identified what’s missing or incorrect in your JNI configuration file, you need to go and rebuild the Native Image shared library. It’s a slow and unforgiving process.

When I started this project, I couldn’t find much in way of documentation or examples for using the JNI Invocation API to call into a Native Image shared library. That changed with the release of GraalVM 22.1.0. Now the various Truffle languages are launched by a C++ application that uses JNI to run the interpreter either in a Graal-enabled JVM (via libjvm) or as a native application by calling into a Native Image shared library version of the interpreter. The launcher doesn’t make use of the GraalVM Polyglot API, but it’s still nice seeing how the JNI Invocation API should be used to call into a Native Image shared library.

Conclusion

This turned out to be a much larger project than I had anticipated and there’s still much left to explore. Unfortunately, the various mechanisms for exposing Java methods in a Native Image shared library and then calling into those methods were not easy to discover. I frequently had to dig into the Native Image source code to work things out. To be fair, some stuff I thought was undocumented turned out in fact to be documented; I simply didn’t know which set of keywords to use to find them. Maybe there’s even more documentation out there that I’ve yet to discover. Be that as it may, I hope this blog post and the examples in the Native Image Playground project can help steer others in the right direction and save them some frustration.

I don’t have any data to back it up, but I get the sense that the predominant use case of Native Image is turning JVM-based applications into native applications. Using Native Image in this way is much easier than using it to build a shared library. And while there are plenty of benchmarks for Native Image applications, to the best of my knowledge no one has published comprehensive benchmarks on either the Native Image C API or the JNI Invocation API for calling functions within a Native Image shared library. I hope the experiments and results from this blog post can help developers make an informed decision about how best to expose and call Java methods in a Native Image shared library.

All of the benchmarks live in my Native Image Playground project. It’s also home to self-contained examples that demonstrate everything discussed in this post. I intend for the playground to be a testbed for other experiments in embedding Truffle interpreters. Please see the Native Image Overview section for more details on how to work with the project.

Based on my evaluation, I’ll be using the JNI Invocation API for all of my Truffle embedded work. It’s the API that the GraalVM team has signaled would be the future for foreign calls into a Native Image shared library and it’s the fastest invocation API for calling into Truffle interpreters. Unfortunately, working with the GraalVM Polyglot API with JNI is a little difficult (please see the Lessons Learned: GraalVM Polyglot API section for more details).

I think there’s an opportunity here for the GraalVM project to remove some of the ceremony needed to call the GraalVM Polyglot API with the JNI Invocation API in a Native Image shared library. At the simplest level, it would be a huge quality of life improvement if Native Image could handle registering the GraalVM Polyglot API for JNI usage without user involvement. There’s really no need to make each user go through the tedious and error-prone process of constructing the JNI configuration file themselves — the polyglot API is going to be the same for everyone.

The next area I’d like to see improved is API ergonomics. My goal was to execute guest language code in a Truffle interpreter from a process loading my Native Image shared library. The Native Image C API’s advantage here is that I get to largely determine what that API looks like and that API uses a C calling convention, making it very easy to call into the library from languages with foreign function libraries. Requiring every consumer of the shared library to learn how to use JNI is a large cognitive overhead. I also think it’s a leaky abstraction. Having to map all of JNI for use with a foreign function library is a massive undertaking. Without doing so, however, there’s no real way call my Native Image shared library in any language other than C or C++. In my ideal world the JNI calls are just an implementation detail and instead users work with a higher level API. I think this would be spiritually similar to the GraalVM Polyglot C API, but considerably simpler to work with.

To recap, in order to use a Truffle interpreter embedded in a Native Image shared library, you need to know:

The GraalVM polyglot API
The JNI invocation API, including:
1. How to represent type signatures
2. How to handle Java exceptions in JNI and other forms of error-handling
3. How to map JNI to your executing environment’s foreign function interface
(optional) The GraalVM tracing agent (while optional, it’s highly recommended for generating Native Image JNI configuration file)
Building a Native Image binary with JNI configuration (not difficult per se, but confusing errors if you skip this step)
Resource management between Graal Isolate, polyglot contexts, and polyglot engines
Passing “command-line” options to Graal and Truffle and which goes with what
1. Graal options need to be supplied when creating the Graal Isolate
2. Truffle options need to be supplied when creating the polyglot context
(optional) How to use the Ruby standard library from disk (TruffleRuby-specific)

That’s a lot to absorb. If you manage to master all of that, the technology is amazing. I was very happy to be able to embed TruffleRuby in an application and have Ruby code execute as fast as Java and only half the speed of C++, with virtually no effort on my part to optimize it. I look forward to exploring more of this space and see what else can be achieved. I’m hopeful we can improve the developer experience and make this technology more readily accessible to those not steeped in all things GraalVM.

¹ Native Image is unable to JIT Java code because there is no Java bytecode to profile in the compiled binary. However, Truffle-based languages can JIT because Graal compiler graphs corresponding to the language’s Truffle interpreter are compiled into the image. Getting a bit meta, there’s an implementation of a Java bytecode interpreter in Truffle called Espresso in development. Since Espresso uses a Truffle interpreter, it will be able to JIT and it is expected that will be the way forward for JITting Java applications in a Native Image binary.

² Generally, a compiled Native Image cannot dynamically load classes because there is no JVM in the compiled binary to do so. With everything ahead-of-time compiled, you need all your Java class files available at native image construction time. However, as with JIT compiling Java code, you can dynamically load classes by using the Espresso Java bytecode interpreter. In this case, the Espresso interpreter would be AOT compiled into your native image but your class files would be dynamically loaded and run in the interpreter at runtime much like running on the JVM.

³ While a Truffle interpreter can be compiled to a native binary with Native Image, an application using a Truffle language cannot as of yet. E.g., you cannot write a CLI application in Ruby and compile the application into a native executable. Instead, you’d use a compiled interpreter such as TruffleRuby to load and run your script.

⁴ For an approximation of the effort involved, please look at the native-polyglot launcher implementation in the Native Image Playground. The native-polyglot example uses the Native Image Polyglot API — a wrapper around the Native Image C API used for Truffle polyglot calls in C. This approach to embedding Truffle languages in another process is not benchmarked because it’s so similar to the Native Image C API. Additionally, the Native Image Polyglot API is deprecated and should not be used for new objects. It exists in the Native Image Playground solely for completeness.

⁵ Truffle languages can pre-bootstrap an interpreter and snapshot that into the Native Image. The Truffle languages from the GraalVM team do precisely that to varying degrees. Whatever can’t be snapshotted during the Native Image building process must be executed at run time.

⁶ As a practical matter, anonymous functions allows us to evaluate the same code snippet multiple times without state conflicts. In a shared context, all executed code fragments share the same global state. If you were to parse and run the JavaScript snippet const EARTH_RADIUS = 6371 each time you called the @CEntryPoint function with a shared context, you would get an error about attempting to redefine a constant. There are ways to work around this, of course. In Ruby, you can check if a constant is already defined before defining it in your code snippet. In our embedded examples, we could make multiple calls to the GraalVM Polyglot API, ensuring function definitions are only executed once, while function calls may happen in th benchmark loop. Using anonymous functions allowed for flexibility in how the code is evaluated; it works fine with or without a parse cache and is wholly contained (i.e., it does not require two separate calls to define and then call a function).