Juan Fumero

Docker Compose for SonarQube: A Simple YAML Template

2026-02-14T00:00:00+00:00

Docker Compose for SonarQube Community

services:
  sonarqube:
    image: sonarqube:community
    restart: unless-stopped
    depends_on:
      - db
    environment:
      - SONAR_JDBC_URL=jdbc:postgresql://db:5432/sonar
      - SONAR_JDBC_USERNAME=sonar
      - SONAR_JDBC_PASSWORD=sonar
    volumes:
      - sonarqube_data:/opt/sonarqube/data
      - sonarqube_extensions:/opt/sonarqube/extensions
      - sonarqube_logs:/opt/sonarqube/logs
    ports:
      - "9000:9000"
    networks:
      - sonarnet

  db:
    image: postgres:15
    restart: unless-stopped
    environment:
      - POSTGRES_USER=sonar
      - POSTGRES_PASSWORD=sonar
      - POSTGRES_DB=sonar
    volumes:
      - postgresql_data:/var/lib/postgresql/data
    networks:
      - sonarnet

networks:
  sonarnet:

volumes:
  sonarqube_data:
  sonarqube_extensions:
  sonarqube_logs:
  postgresql_data:

Then, we pull the images:

docker compose pull

And run the server:

docker compose up -d

To setup Sonar, we access to :9000 and update the password. The default is:

User: admin
Password: admin

Update the docker image for Sonar

docker compose down
docker compose pull 
docker compose up -d   

Building The Regression Test Harness for the OpenJDK (jtreg) from Source

2026-02-05T00:00:00+00:00

Build JTREG

As in February 2026, to build jtreg from source, we need to use JDK 25.

git clone https://github.com/openjdk/jtreg/
cd jtreg


## Use JDK 25 (e.g., Oracle JDK). We can use sdkman to 
## obtain the desired JDK version
sdk use java 25.0.1-oracle

## Build jtreg
sh ./make/build.sh

## Check
$ ./build/images/jtreg/bin/jtreg -version 

jtreg 8.3-dev+0
Installed in /home/juan/bin/jtreg/build/images/jtreg/lib/jtreg.jar
Running on platform version 25.0.1 from /home/juan/.sdkman/candidates/java/25.0.1-oracle.
Built with Java(TM) 2 SDK, Version 25.0.1+8-LTS-27 on February 04, 2026.
JT Harness, version 6.0 ea b24 (January 21, 2026)
Java Assembler Tools, version 9.1 ea 01 (January 21, 2026)
TestNG: testng-7.3.0.jar, guice-5.1.0.jar, jcommander-1.82.jar
JUnit: junit-platform-console-standalone-1.14.2.jar

To make jtreg easily accessible, it is convenient to declare the JTREG_HOME and update your PATH variable.

#JTREG_HOME
export JTREG_HOME=/jtreg/build/images/jtreg
export PATH=/jtreg/build/images/jtreg/bin:$PATH

Build IDEA Plugin for JTREG

The jtreg repository also contains source code for an IntelliJ IDEA plugin.

cd ./plugins/idea

Update the file gradle.properties with the jtregHome path pointing to the JTREG_HOME we just built.

jtregHome = ../../build/images/jtreg

To build the plugin, we need JDK 21.

sdk use java 21.0.9-oracle

sh gradlew clean build

The plugin is located at plugins/idea/build/distributions/jtreg-plugin-1.19.zip.

To install it in IntelliJ IDEA:

Go to Settings > Plugins.
Click the Gear Icon ⚙️ and select “Install Plugin from Disk…”.
Select the generated .zip file.
Restart your IDE.

How to Install NVIDIA Drivers and CUDA Toolkit on Oracle Linux 10

2025-08-15T00:00:00+01:00

For any developer or power user who’s ever tried to install NVIDIA drivers on Linux, the process can feel less like a straightforward task. While most mainstream distributions have streamlined the process, trying to get NVIDIA drivers working on Oracle Linux for a desktop setup is a unique challenge. The available documentation is often sparse, focusing on server-side GPU acceleration rather than desktop graphics, leaving a trail of broken dependencies and black screens in its wake (and that just happened to me while I was trying to install the drivers as well 😢).

This guide aims to fill that gap. We’ll walk through the process step-by-step, demystifying the installation of NVIDIA drivers on the latest Oracle Linux 10 for desktop setups. Note, for data centers, NVIDIA guidelines covers how to install and configure the NVIDIA driver for servers.

In addition, this post also shows how to configure CUDA 13.0 SDK to compile and run our CUDA programs on NVIDIA GPUs.

1. Update the system

First of all, we need to update the system. This installation guideline assumes a fresh installation of the Oracle Linux 10 with Gnome 47.4.

sudo dnf update
reboot

At the time of writing this post, this the the latest Linux Kernel available for Oracle Linux 10:

$ uname -a
Linux oraclelinux 6.12.0-101.33.4.3.el10uek.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 14 18:29:21 PDT 2025 x86_64 GNU/Linux

So, keep in mind we are using a UEK (Unbreakable Enterprise Kernel) Linux kernel. This is a Linux Kernel developed by Oracle for the Oracle Linux distribution that is optimized for the Oracle Cloud. Thus, when we install the kernel pre-requisites for the NVIDIA drivers, we need to install also the UEK versions.

2. Download the latest NVIDIA Driver

Visit the NVIDIA website to download the latest NVIDIA driver. At the time of writing this post, the latest version is 580.76.05.

Download the file and give execution permisions:

chmod +x NVIDIA-Linux-x86_64-580.76.05.run

3. Installing the dependencies

Enable epel Oracle Epel Repo:

sudo dnf install oracle-epel-release-el10-1.0-2.el10.x86_64

Install the dependencies:

sudo dnf install kernel-uek-devel gcc make acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig xorg-x11-server-Xwayland libxcb egl-wayland

4. Disable Nouveau and NOVA Core

Access as root:

sudo su -

Then disable the nouveau and nova_core drivers:

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nova_core" >> /etc/modprobe.d/blacklist.conf
echo "options nvidia NVreg_PreserveVideoMemoryAllocations=1" >> /etc/modprobe.d/nvidia.conf
echo "options nvidia-drm modeset=1 fbdev=0" >> /etc/modprobe.d/nvidia.conf

5. Update the grub2 configuration

As described in the excellent if-not-true-then-false blog, we then need to update the grub2 con configuration and create a new image.

grub2-mkconfig -o /boot/grub2/grub.cfg
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau-nova.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)

6. Install the NVIDIA Driver

Disable the graphical interface to install the NVIDIA Driver. We will enable it again once the installation is done.

systemctl set-default multi-user.target

Then we can reboot:

reboot

To install the NVIDIA driver, just run the following script, and follow the instructions:

sudo ./NVIDIA-Linux-x86_64-580.76.05.run

Once the installation is done, enable enable the graphics interface:

systemctl set-default graphical.target
reboot

Check with the nvidia-smi command:

Fri Aug 15 08:52:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05              Driver Version: 580.76.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   40C    P8              2W /   65W |       4MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2744      G   /usr/bin/gnome-shell                      1MiB |
+-----------------------------------------------------------------------------------------+

Great, installation is done! Now, let’s compile and run some CUDA programs on the GPU. To do that, we need to install the CUDA Toolkit.

7. Install CUDA 13.0 Toolkit

wget https://developer.download.nvidia.com/compute/cuda/13.0.0/local_installers/cuda-repo-rhel9-13-0-local-13.0.0_580.65.06-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel9-13-0-local-13.0.0_580.65.06-1.x86_64.rpm
sudo dnf clean all
sudo dnf -y install cuda-toolkit-13-0

Now we can run CUDA. Update the PATH and the CPLUS_INCLUDE_PATH to include the CUDA libraries and CUDA compiler. You can add the following lines in your ~/.bashrc:

export CPLUS_INCLUDE_PATH=/usr/local/cuda/include
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export PATH=/usr/local/cuda/bin/:$PATH

And done!

Let’s try a few examples:

8. Download CUDA Sample Suite

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples
cd Samples/0_Introduction/matrixMul/
mkdir build
cd build
cmake ..
make

Now we can run the example:

./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 382.48 GFlop/s, Time= 0.343 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

References

[1] https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/

[2] https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/

How to enable NVIDIA Nsight Compute CLI in Fedora

2025-07-04T00:00:00+01:00

When working with CUDA, NVIDIA’s Nsight Compute CLI (ncu) is an indispensable command-line tool for profiling your CUDA applications. It lets you peek under the hood to see exactly how your code is performing on the GPU.

For instance, you can easily profile a CUDA application with a command like this:

ncu -o myProfileData --set full ./cuda_sample

The command above generates a file myProfileData.ncu-rep which we can inspect with NVIDIA Nsight GUI to see all the profiled data.

However, to set it up in Linux can be a bit tricky.

Setting It Up on Linux

The first time you use ncu, you might get the following error:

The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

Don’t worry, this is a common setup issue and it’s easy to fix! It means you need to grant users permission to access the GPU’s performance counters.

What you need to do is to add a new line in the /etc/modprobe.d/nvidia.conf with the following content:

options nvidia NVreg_RestrictProfilingToAdminUsers=0

Then, you need to reboot the machine. Now you should be able to run the ncu command and profiler our CUDA programs.

A Note for Fedora and Similar Systems

Following the NVIDIA guidelines, in some Linux distributions like Fedora, you might need to rebuild the initrd. If the reboot alone doesn’t do the trick, you’ll need to rebuild it.

dracut --regenerate-all -f

How to disable auto-update in Fedora

2025-06-25T00:00:00+01:00

To disable the automatic updates at restart run the following command:

gsettings set org.gnome.software download-updates false

You can still update manually using dnf. This previous command disables the option. This is important, especially if you have enabled/configured custom kernels, or third party modules (e.g., NVIDIA Drivers).

Disable Kernel Updates

If you have installed a custom kernel or installed a third party kernel module, you can disable updates for the kernel.

To do so, edit the file /etc/dnf/dnf.conf within the [main] section, and add the following line:

exclude=kernel*

Configuring Unsloth on Linux for LLM Fine Tuning

2025-04-17T00:00:00+01:00

What is `unsloth`?

Unsloth is a Python framework focused on optimizing the fine-tuning of Large Language Models (LLMs) specifically for NVIDIA GPUs on both Linux and Windows. It leverages existing LLM frameworks for training and fine-tuning, such as the Hugging Face 🤗 Transformers library.

It’s important to understand that unsloth is not a complete fine-tuning framework itself. Instead, it acts as an optimization layer, providing low-level utilities for quantization and performance enhancements to accelerate the fine-tuning process.

Despite the comprehensive documentation available on the Unsloth website, the installation steps weren’t entirely straightforward for me. To help others facing the same, this guide details the configuration of Unsloth with an NVIDIA GPU on Fedora 41/42 and Ubuntu WSL systems.

Installing Unsloth Locally

At the time of writing this post, unsloth requires Python >= 3.9 and <= 3.13. Systems such as Fedora 41/42 and Ubuntu 24 come with a newer version of Python, so we need to set up an older version.

Installing `spack`

Hopefully, this is an easy process with the help of spack, a manager software tool for Linux.

git clone -c feature.manyFiles=true --depth=2 https://github.com/spack/spack.git

. spack/share/spack/setup-env.sh

Installing Python 3.12.X

Then, we can install Python 3.12.7. Check all versions available with spack: https://packages.spack.io/package.html?name=python.

spack install python@3.12.7

Configure a new environment for Python

spack load python@3.12.7

python -m venv ~/bin/venv/

Installing PyTorch

Next, we can install PyTorch. Note that, at the time of writing this post (April 2025), unsloth supports PyTorch 2.5.0 and 2.4.0.

pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0

Installing `unsloth`

Finally, we can install unsloth:

pip3 install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

We also need to install a few libraries to store llama-based models:

Some extra packages

We also need to install a few libraries to store llama-based models:

Fedora 41/42:

sudo dnf install ccache curl-devel

Ubuntu 24 WSL:

sudo apt-get install cmake ccache libcurl4-gnutls-dev

How to update `unsloth`?

pip install --upgrade unsloth unsloth_zo

Now we have all tools available to create new fine-tuned models in our machines.

What’s next?

You can start develop your own Python programs to build new LLM fine-tuned models based on LLama, Mistral, etc. The unsloth documentation covers a wide range of use-cases:

https://docs.unsloth.ai/get-started/all-our-models

Accelerating Java programs on RISC-V CPUs with Vector Instructions

2025-04-04T00:00:00+01:00

This article provides a high-level overview of the RISC-V instruction set architecture, illustrating its modular design through the examination of specific processor implementations. Besides, it discusses the current support for Java on RISC-V. Finally, this article explores the acceleration of Java data-parallel applications on RISC-V CPUs utilizing vector instructions. Specifically, I describe how frameworks like TornadoVM and the oneAPI Construction Kit enable significant performance gains compared to standard Java execution, showcasing the potential of RISC-V for data parallel workloads.

This post is based on a recent paper, Leveraging RISC-V Vectorization: Accelerating Java Programs with TornadoVM and OCK at the RISC-V EU Summit 2025, and it is a collaboration between The University of Manchester and Codeplay Software Ltd. through the AERO European Project.

My goal with this article is to expand, in more detail, about the technique and the technologies involved to achieve hardware acceleration on these niche processors. Hopefully, by the end of this article, you will have a better understanding of the RISC-V ecosystem, the status of Java for RISC-V, and a possible approach to enable RISC-V CPUs as hardware accelerators for Java programs.

High-Level Overview of RISC-V

RISC-V is an open standard and royalty-free Instruction Set Architecture (ISA) based on the Reduced Instruction Set Computing (RISC) principles. It is designed to provide an open alternative to proprietary ISAs, enabling both academia and industry to innovate and craft custom processor designs without licensing fees.

In my view, a key strength of the RISC-V (apart from the open designs) is its modularity and extensibility. RISC-V is built upon modular designs, enabling CPU architects and developers to tailor processors to their specific needs. For instance, a RISC-V 32/64-bit integer base design can be extended with modules for multiply-divide operations, floating-point arithmetic, and vector processing, each meticulously defined within the RISC-V specification.

Furthermore, it allows CPU hardware implementers to pick and choose the modules they need. For instance, If a design doesn’t necessitate FP64 (double-precision floating-point) computations, the ‘D’ extension can be omitted, streamlining the hardware and reducing complexity.

To support the ongoing evolution and maintenance of the RISC-V specification and its ecosystem, RISC-V International (formerly the RISC-V Foundation) was established. This organization plays a crucial role in ensuring the standardization and growth of this architecture through various activities and working groups.

As follows, we will investigate how Java can be used with RISC-V processors, and how it can utilize parallel functional units (such as the vector units) to process data much faster with the help of TornadoVM and the oneAPI Construction Kit.

What options are available for RISC-V hardware in 2025?

RISC-V hardware presents exciting possibilities, but acquiring units was challenging in the past. When I began exploring RISC-V in 2018, hardware availability was extremely limited. Obtaining even a single board was difficult. At that time, Sifive was one of the key players, showcasing a prototype capable of running Linux on a RISC-V64 SoC.

Since then, many companies have emerged supporting and building on RISC-V. As of 2025, I see many Single Board Computers (SBCs) appearing in the market, such as the Banana PI BPI-F3, and Lichee PI 3A. Those are the boards I am using for this blogspot. These two boards are available on Amazon and Aliexpress, and they cost around 150-200 euros each, depending on the internal capacity of the eMMC flash storage and RAM size.

These two SBCs have the same CPU processor, a Spacemit K1 processor that implements a RISCV64 GCVB - RVA22 Profile. Each letter represents an extension, or a group of extensions over the base RISC-V 64 CPU. Let’s break down these numbers mean:

G represents a group of several extensions: It contains:
I: Integer base
M: Integer multiplication and division
A: Atomic instructions
F: Single-precision floating point instructions
D: Double-precision floating point instructions

In addition, this RISC-V implements CVB:

C: Compressed instructions
V: Vector instructions. These are the ones we are interested in for the acceleration part with TornadoVM and OCK. More on this later.
B: Bit manipulation instructions

As you can see, RISC-V’s modular design allows processors to be highly customized. A processor’s capabilities are determined by the specific extensions it includes. To simplify software development for these varying configurations, RISC-V defines standardized profiles. These profiles group common extensions together, providing a target platform for general-purpose processors and making it easier for developers to create compatible applications.For the Spacemit K1 processor, the profile implemented is RVA22:

https://github.com/riscv/riscv-profiles/blob/main/src/profiles.adoc#rva22-profiles

In this way, it will be easier for software developers to build and support applications running on these architectures.

What kind of Operating System can you run?

The Banana PI F3 and Lichee PI3 SBCs support Bianbu OS, a customized Ubuntu-based distribution for RISC-V developed by Bit-Brick (https://www.bit-brick.com/about-us/). While Bianbu was my primary option, other distributions like Debian, ARMbian, and Fedora can also be used, though I haven’t tested them myself.

Installation instructions for Bianbu on the SBC can be found at:

https://wiki.banana-pi.org/Banana_Pi_BPI-F3#System_Image

For better performance, I recommend installing the OS on the internal eMMC memory using the provided Titan Tools. This significantly improves speed compared to running the OS from an SD card. Here are the read throughput benchmarks for the internal SSD.

The difference in read throughput performance is shown in the following examples:

## SD card
sudo hdparm -t --direct /dev/mmcblk0
/dev/mmcblk0:
 Timing O_DIRECT disk reads: 240 MB in  3.01 seconds =  79.69 MB/sec

While the read speeds for the internal eMMC storage (where the OS is installed):

## Internal SSD
$ sudo hdparm -t --direct /dev/mmcblk2    
/dev/mmcblk2:  
Timing O_DIRECT disk reads: 580 MB in  3.00 seconds = 193.31 MB/sec 

For even faster performance, I also recommend installing an SSD and working with your files in this space.

sudo hdparm -t --direct /dev/nvme0n1    
/dev/nvme0n1:  
Timing O_DIRECT disk reads: 1898 MB in  3.00 seconds = 632.25 MB/sec

The following image shows the Banana PI F3 running Bianbu OS 1.0.5. The Banana PI F3 is located on the left-hand side.

Before discussing the entire software stack to run TornadoVM on these SBCs, let’s briefly discuss the processor and the features of each SBC.

As I mentioned, the CPU present in the BananaPI F3 and the Lichee PI 3 from SiPEED is the same, the Spacemit K1 processor, which contains 8 RISC-V cores able to run vector instructions compliant with the RVV 1.0. This is important, for me at least, since the software dependencies for TornadoVM to run on this hardware generate RISC-V RVV 1.0 instructions. There are other RISC-V boards on the market (e.g., the Milk-V Pioneer, but it implements RISC-V RVV 0.7 instead).

$ lscpu 
Architecture:          riscv64
  Byte Order:          Little Endian
CPU(s):                8
  On-line CPU(s) list: 0-7
Model name:            Spacemit(R) X60
  Thread(s) per core:  1
  Core(s) per socket:  8
  Socket(s):           1
  CPU(s) scaling MHz:  100%
  CPU max MHz:         1600.0000
  CPU min MHz:         614.4000
Caches (sum of all):   
  L1d:                 256 KiB (8 instances)
  L1i:                 256 KiB (8 instances)
  L2:                  1 MiB (2 instances)

The Banana PI F3 that I got has 4GB of RAM, which, as we will see, can be very limiting when it comes to the installation of some of the software dependencies. At a later stage of the development for TornadoVM, my lab bought the Lichee PI 3 from SiPEED, which has the same processor but 16GB of RAM and 32GB of eMMC flash storage, which makes compilation of LLVM much easier.

So far, we have discussed general aspects of the RISC-V architectures, some real hardware and the OS. Now it is time to run Java.

Is Java available for RISC-V?

Since TornadoVM accelerates Java programs, we need to run Java applications on RISC-V. But, is Java ready for this new CPU architecture?

The first port for RISC-V is JEP 422 which supports RISC-V RV64GV (and by now, we know what these letters mean). This RISC-V port was originally provided by Huawei, and followed up by Alibaba, Rivos, ISCAS and Syntacore. It was merged for JDK 19, and it contains the port for the template interpreter, the C1 and C2 compilers, and all mainline GCs.

Since TornadoVM currently uses JDK 21, the RISC-V port is already included, which is great news!

The one I am currently using is from BellSoft:

$ java --version
openjdk 21.0.6 2025-01-21 LTS
OpenJDK Runtime Environment (build 21.0.6+10-LTS)
OpenJDK 64-Bit Server VM (build 21.0.6+10-LTS, mixed mode)

So, can we run TornadoVM?

TornadoVM depends on the implementation of low-level parallel programming models such as OpenCL, Level Zero or CUDA PTX. As far as I know, there are no Level Zero or CUDA PTX implementations for the RISC-V architecture. However, we can find some implementations for OpenCL.

The oneAPI Construction Kit (OCK for short), and it is defined as a framework to implement open standards for new hardware accelerators. OCK includes a runtime for CPUs to run OpenCL C programs as well as dispatch SPIR-V kernels. And this is exactly what TornadoVM needs in order to accelerate Java methods on new hardware.

But, not only that, OCK can also auto-vectorize OpenCL and SPIR-V programs to run on RISC-V with RVV 1.0 vector instructions, which can potentially increase performance of our data parallel Java methods. Let’s explore what vectorization means, and how it can be enabled before we start running some experiments on this platform.

Vectorization

Vectorization is a parallel computing technique that performs the same arithmetic operation on multiple data elements at a time. The number of elements processed in parallel depends on the processor’s vector unit capabilities, typically handling 2, 4, 8, 16, or more data items at once. This technique is widely used to accelerate multimedia and data-parallel applications, including LLMs these days!

For example, modern Intel CPUs implement AVX and AVX512 instructructions, which can compute up to 32 FP32 floats at a time.

The following Figure shows a high-level representation of vectorization. Consider a for loop performing vector addition. In a scalar execution, each iteration processes a single element from each array, performing the addition, and storing the result in the corresponding position.

Unlike scalar operations, vectorization, as shown in the figure’s right-hand side, enables simultaneous processing of multiple data elements. The figure exemplifies this with a four-element operation. For illustrative purposes, we assume a single CPU clock cycle per operation, acknowledging that actual cycle counts depend on the operation and CPU architecture. However, this simplification effectively highlights the performance benefits of parallel computation. This speedup is achieved through replicated functional units in CPUs equipped with vector instructions.

But how do you write vector code? There are few approaches: a) via libraries, what is called explicit vectorization; b) via constructs in a programming language or a parallel programming model (e.g., the CilkPlus programming model using the array notation); and c) auto-vectorization, in which compilers can generate vector code from a scalar code.

Each approach has its pros and cons. But the auto-vectorization approach is ideal and probably the hardest to achieve. Modern compilers, such as the Java C2 compiler, can auto-vectorize code for x64 and ARM64. However, explicit use of vector units via the Java Vector API can yield to higher performance, as shown in this paper.

But, what about auto-vectorization of OpenCL programs? How does it compare? The rest of the post I am going to explore this.

Workflow for auto-vectorization in TornadoVM with OCK

TornadoVM compiles Java methods from the Java bytecode to OpenCL C, and SPIR-V binary. Then, the resulting optimized OpenCL/SPIR-V code is dispatched via the OpenCL runtime.

The compilation process is shown in the Figure below. The input application is written using the TornadoVM APIs and it contains three main parts:

Identify the parallel loops (using the @Parallel) annotation, as we can see in the left-hand side of the Figure. Note that the example represents a parallel version for the matrix multiplication, and it operates on scalar types.
Task-Graph build: then we build a task graph, which contains the definition of the methods to offload, and the data involved (e.g., arrays and matrices we want to send to the accelerator).
Finally, we create an execution plan from the task-graph, and execute it.

This overview provides a high-level description of the TornadoVM programming model. For a more detailed exploration, including application development guidelines, please refer to one of my previous posts.

The primary objective of this article is to illustrate the compilation and execution process of TornadoVM on RISC-V, with a specific focus on auto-vectorization.

At runtime, TornadoVM builds a graph (it is actually the Graal IR that represents all methods to offload), and optimizes the code. TornadoVM has a pipeline of many compiler optimizations that are interleaved with current Graal compiler optimizations. Some examples are loop interchange, data parallel loops transformations, intrinsics exploration, etc.

Once the code has been optimized, TornadoVM generates the corresponding OpenCL C/SPIR-V codes. Note that the generated code is still scalar code. No auto-vectorization is applied, just yet.

After the code generation, TornadoVM builds the generated program via the OpenCL runtime using the clBuildProgram. In this step, the generated code is further compiled to the target selected platform. In this case, is the RISC-V 64 CPU platform using OCK.

OCK also contains a JIT compiler to optimize the OpenCL C/SPIR-V code for the RISC-V 64 CPU. In this step, the code is actually auto-vectorized. Thus, from the input Java scalar code, we have reached, hopefully, a vectorized code optimized for RISC-V 64. How cool is this?

Ok, enough talk. Let’s see this in action. The rest of the post I will explain how to compile LLVM, OCK and TornadoVM to run on RISC-V, and show some performance analysis of the traditional Matrix Multiplication application running on this CPU.

Building LLVM and OCK from source

At the time of writing this post (April 2025), there are no prebuilts of OCK for RISC-V 64.Thus, we need to build OCK from source. The OCK source code is an open-source project under the UXL Accelerated Foundation, and it depends on LLVM 19, so we are going to build LLVM from source as well.

Configure the dependencies for LLVM and OCK:

sudo apt install python3-virtualenv python3-psutil
sudo apt install -y build-essential git cmake libtinfo-dev python3
sudo apt-get -y install gcc-riscv64-linux-gnu
sudo apt-get install spirv-tools

Build LLVM:

git clone --depth 1 --branch=release/19.x git@github.com:llvm/llvm-project.git llvm 
git clone --depth 1 git@github.com:uxlfoundation/oneapi-construction-kit.git 

cmake llvm \
-Bbuild -GNinja \
-DLLVM_ENABLE_DIA_SDK=OFF \
-DCMAKE_INSTALL_PREFIX=llvm_install \
-DLLVM_ENABLE_ZLIB=FALSE \
-DLLVM_ENABLE_ZSTD=FALSE \
-DLLVM_ENABLE_Z3_SOLVER=FALSE \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_TARGETS_TO_BUILD="RISCV" \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_ASSERTIONS=ON \ -DCMAKE_TOOLCHAIN_FILE=/mnt/data/ock/oneapi-construction-kit/platform/riscv64-linux/riscv64-gcc-toolchain.cmake \
-DLLVM_HOST_TRIPLE=riscv64-unknown-linux-gnu \
-DLLVM_BUILD_LLVM_DYLIB=ON \
-DLLVM_LINK_LLVM_DYLIB=ON

Note that the -DCMAKE_TOOLCHAIN needs to be pointed to the cmake file provided by OCK:

-DCMAKE_TOOLCHAIN_FILE=/path/to/oneapi-construction-kit/platform/riscv64-linux/riscv64-gcc-toolchain.cmake 

Then:

ninja -C build install

Keep an eye on RAM, Swapping and Thermals

If you compile LLVM on a board with only 4GB of RAM, you might end up swapping quickly. That was my case when I first built LLVM on the Banana PI F3 4GB. To avoid swapping, you can tell LLVM to build with 1 or 2 threads by adding these two flags in the configure:

-DLLVM_PARALLEL_LINK_JOBS=1 -DLLVM_PARALLEL_COMPILE_JOBS=2

Then:

export CMAKE_BUILD_PARALLEL_LEVEL=1
cmake --build build --target install

Note that compilation may take some time. In fact, in my case, back and forth with some parameter tuning took ~4 days. So, be patient!

Another thing to consider when compiling LLVM is temperature. In my case, the Banana PI F3 did not come with active cooling. With the passive cooling and normal use, it is ok. However, compiling LLVM is another story, and I ended up using an old fan from a laptop just for the time it took to compile LLVM:

Compiling OCK for RISC-V

cmake -GNinja -Bbuild-riscv-hw-vector \
-DCA_ENABLE_DEBUG_SUPPORT=ON \
-DCA_LLVM_INSTALL_DIR=/mnt/data/ock/llvm/llvm_install \
-DCA_ENABLE_HOST_IMAGE_SUPPORT=OFF \
-DCA_ENABLE_API=cl \
-DCA_CL_ENABLE_ICD_LOADER=ON \
-DCMAKE_INSTALL_PREFIX=$PWD/build-riscv-hw-vector/install \
-DCA_HOST_TARGET_RISCV64_FEATURES="+v" 

ninja -C build-riscv-hw-vector install

Alternatively, to build in a single thread:

export CMAKE_BUILD_PARALLEL_LEVEL=1
cmake --build build-riscv-hw-vector --target install

Final Configuration for OpenCL

$ sudo apt-get install clinfo 
$ cd /usr/lib/riscv64-linux-gnu/
$ sudo ln -s libOpenCL.so.1 libOpenCL.so

Additionally, create a new file under /etc/OpenCL/vendors/ with the path to the libCL.so that OCK generates.

cat /etc/OpenCL/vendors/ock.icd 
/mnt/data/ock/oneapi-construction-kit/build-riscv-hw-vector/install/lib/libCL.so

Now we can run OpenCL!

$ clinfo                         
Number of platforms                               1
  Platform Name                                   ComputeAorta
  Platform Vendor                                 Codeplay Software Ltd.
  Platform Version                                OpenCL 3.0 ComputeAorta 4.0.0 Linux riscv64 (Release, 08207aa8)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_codeplay_kernel_exec_info cl_codeplay_soft_math cl_khr_create_command_queue cl_khr_icd cl_codeplay_extra_build_options
  Platform Extensions with Version                cl_codeplay_kernel_exec_info                                       0x1000 (0.1.0)
                                                  cl_codeplay_soft_math                                              0x1000 (0.1.0)
                                                  cl_khr_create_command_queue                                      0x400000 (1.0.0)
                                                  cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_codeplay_extra_build_options                                    0x6000 (0.6.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             CODEPLAY
  Platform Host timer resolution                  0ns

Now we are ready to build TornadoVM for RISC-V.

Build TornadoVM for RISC-V

Although TornadoVM is just a Java program, it contains some dependencies that are not fully ported to RISC-V. However, with a small patch, TornadoVM can be installed on RISC-V systems. TornadoVM provides an automatic script to download and patch the code for RISC-V.

First, create a new Python environment:

$ python3 -m venv /mnt/data/python-env 
$ source /mnt/data/python-env/bin/activate
$ pip3 install lit

Then, clone TornadoVM and build it with the patch for RISC-V (updated for TornadoVM v1.1.1-dev).

## Clone TornadoVM Repo
$ git clone https://github.com/beehive-lab/TornadoVM.git

## Clone TornadoVM patch repo: 
$ git clone https://github.com/beehive-lab/tornadovm-riscv-patch.git

## Build for OpenCL only
$ bash tornadovm-riscv-patch/apply-riscv-patch-opencl.sh 

## Build for OpenCL and SPIR-V 
$ bash tornadovm-riscv-patch/apply-riscv-patch-spirv.sh 

$ source setvars.sh 

tornado --devices

Number of Tornado drivers: 1
Driver: OpenCL1
  Total number of OpenCL devices  : 1
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [ComputeAorta] -- ComputeAorta riscv64
                Global Memory Size: 3.9 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 1024]
                Device OpenCL C version: OpenCL C 1.2 Clang 19.1.7

Congratulations! TornadoVM running on RISC-V with OCK. Now, we can run a few experiments and check some performance.

Checking Vector Instructions for RISC-V

We can generate the assembly that the OCK generates from the OpenCL C kernel that TornadoVM generates by enabling the following env variable:

export CA_HOST_DUMP_ASM=1

Then, we can run any example with TornadoVM, and we will see the generated RISC-V assembly code. For example:

tornado -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 256

The RISC-V generated code can be very large. But we can see that in some parts of the code vector instructions are used:

.LBB3_33:
        add     t5, a4, a5
        add     t5, t5, s4
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v24, v8, t5
        vsetvli zero, zero, e32, m4, ta, ma
        vnsrl.wi        v20, v24, 0
        vmslt.vx        v0, v20, s8
        vsetvli zero, zero, e16, m2, ta, ma
        vmv.x.s a1, v0
        slli    a1, a1, 48
        beqz    a1, .LBB3_32
        vsetvli zero, zero, e64, m8, ta, ma
        li      a1, 32
        vsll.vx v24, v24, a1
        vsra.vx v24, v24, a1
        j       .LBB3_36
.LBB3_35:
        vsetvli zero, zero, e64, m8, ta, ma
        vadd.vx v24, v24, s2
        vmslt.vx        v20, v24, s8
        vmand.mm        v0, v20, v0
        vsetvli zero, zero, e16, m2, ta, ma
        vmv.x.s a1, v0
        slli    a1, a1, 48
        add     t5, t5, s2
        beqz    a1, .LBB3_32

Let’s see how this can impact performance.

Preliminary Results on RISC-V

Let’s run an experiment and see the performance we get by enabling auto-vectorization with TornadoVM and OCK. We are going to run the Matrix Multiplication, a common algorithm widely used these days for AI and LLMs.

I run this Benchmark on the RISC-V Banana PI F3 with 4GB of RAM. The OS is Bianbu 1.0.5, TornadoVM 1.0.10 and OCK commit 65036b8. LLVM 19.1.5 and GCC 13.2. The OpenJDK used is 21.0.5.

You can obtain the benchmark from the GitHub repository:

$ git clone https://github.com/beehive-lab/tornadovm-benchmarks
$ cd tornado-benchmarks
$ ./build.sh

To run, you need to copy the setvars.sh from the TornadoVM installation:

cp /path/to/tornadovm/setvars.sh . 
source setvars.sh
./run.sh mxm 

The following plot shows the run-time distribution for the 100 runs. Note that TornadoVM compiles the Java code in the first iteration, and then it runs directly with the compiled code.

The performance plot is read as follows. The x axis shows different data sizes for the matrix multiplication. Each size was evaluated with Java single threaded, Java with parallel streams, and then TornadoVM using the OpenCL backend and TornadoVM using the SPIR-V backend. The y-axis shows runtime in nanoseconds. Thus, the lower, the better.

For small matrices, the Java sequential version performs very well. The cost of multi-threading and runtime thread-scheduling are not worth it for small data sizes. There is no auto-vectorization for the Java code, as in April 2025. The Java streams version performs up to 8x in this 8-core machine.

However, for TornadoVM, performance is even higher, up to 32x faster than Java Sequential, and up to 4x faster than Java Streams for the same CPU. This is the effect of the auto-vectorizer and the multi-threaded execution. Another highlight is that, for large matrix sizes (e.g., 512 and 1024) even the first iteration (which includes optimization and compilation) runs faster than the parallel stream execution.

Conclusions

This post has shown a general introduction to RISC-V, the modularity of RISC-V processors and a high-level overview of how ready is Java to run on RISC-V. Additionally, this post has shown how to increase performance of data parallel applications written in Java using TornadoVM and the oneAPI Construction Kit to exploit auto-vectorization in RISC-V processors.

The preliminary results from the Matrix Multiplication benchmark show a substantial speedup with TornadoVM compared to sequential Java and Java streams, highlighting the effectiveness of auto-vectorization and multi-threaded execution on RISC-V. While challenges exist, such as the need to build software from source and manage limited resources, the advancements in hardware availability and software support make RISC-V a very appealing platform for many developers, including Java developers.

Discussions

If you are interested, let’s keep the discussions active:

https://github.com/jjfumero/jjfumero.github.io/discussions/15

Appendix

When using LLVM 19.1.7, I noticed an error in a duplicated definition. The error is as follows:

/mnt/data/ock/oneapi-construction-kit/modules/compiler/builtins/source/builtins.cl:10675:27: error: conflicting types for 'printf'
 10675 | int __attribute__((weak)) printf(const constant char *const restrict fmt, ...);
       |                           ^
/mnt/data/ock/oneapi-construction-kit/modules/compiler/builtins/include/builtins/builtins.h:16367:27: note: previous declaration is here
 16367 | int __attribute__((weak)) printf(const constant char* const restrict fmt, ...);
       |                           ^
1 error generated.
[6/4830] Building CXX object modules/compiler/compiler_pipeline/CMakeFiles/compiler-pipeline.dir/source/define_mux_dma_pass.cpp.o^C

By removing one of these definitions, you can build OCK:

diff --git a/modules/compiler/builtins/source/builtins.cl b/modules/compiler/builtins/source/builtins.cl
index 1c96f2d1..2ef30343 100644
--- a/modules/compiler/builtins/source/builtins.cl
+++ b/modules/compiler/builtins/source/builtins.cl
@@ -10672,7 +10672,7 @@ void __CL_BUILTIN_ATTRIBUTES prefetch(const global double16 *pointer,
 
 /*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*/
 
-int __attribute__((weak)) printf(const constant char *const restrict fmt, ...);
+//int __attribute__((weak)) printf(const constant char *const restrict fmt, ...);
 
 /*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*/

Compile again:

ninja -C build install

Building JDK with HSDIS on Linux

2025-02-14T00:00:00+00:00

Intro

hsdis is a disassembler plugin for the HotSpot JVM (Java Virtual Machine). The hsdis plugin is very useful for some Java developers that want to see the code generated by the JVM’s Just-In-Time (JIT) compiler into human-readable assembly language. Unfortunately, this plugin is not included out of the box with JDK (Java Development Kit), so we need to build our own JDK and enable the hsdis plugin manually.

In this post, I’ll walk you through the process of building hsdis for both the latest JDK (JDK 25 at the time of writing) and the latest Long-Term Support (LTS) version, JDK 21.
I’ll focus on a Linux environment, so get your terminal ready!

Getting the dependencies

For Fedora 41:

sudo dnf install autoconf alsa-lib-devel cups-devel libXtst-devel libXt-devel libXrender-devel libXrandr-devel libXi-devel

sudo dnf install gmp gmp-devel mpfr mpfr-devel libmpc libmpc-devel

For Ubuntu 24.04 LTS:

sudo apt-get install autoconf libasound2-dev libcups2-dev libfontconfig1-dev libx11-dev libxext-dev libxrender-dev libxrandr-dev libxtst-dev libxt-dev texinfo 
sudo apt-get install libmpfr-dev libmpc-dev libgmp-dev

Get Binutils

Clone the binutils project:

cd ~/bin/jdk/binutils/
git clone git://sourceware.org/git/binutils-gdb.git
export BIN_UTILS_DIR=$PWD/binutils-gdb

Build a JDK with HSDIS from the `master` branch (e.g., JDK 25)

Clone the JDK repo:

cd ~/bin/jdk
git clone https://github.com/openjdk/jdk.git
cd jdk

And run the configure script with the following options:

bash configure --with-hsdis=binutils --with-binutils-src=$BIN_UTILS_DIR

If the configure is correct, then we can start the build:

make clean
make images 
make build-hsdis
make install-hsdis 

Finally, we load environment for the new JDK:

$ export JAVA_HOME=$PWD/build/linux-x86_64-server-release/jdk/
$ export PATH=$JAVA_HOME/bin:$PATH

## Check 
$ java --version

openjdk 25-internal 2025-09-16
OpenJDK Runtime Environment (build 25-internal-adhoc.juan.jdk)
OpenJDK 64-Bit Server VM (build 25-internal-adhoc.juan.jdk, mixed mode)

Build `hsdis` for JDK 21

The configuration process is almost identical to the upstream version, except that we need a specific version of binutils, the 2.37 in order to build JDK21.

cd BIN_UTILS_DIR
git checkout binutils-2_37

Besides, we can obtain an updated version of JDK 21 by changing the repository:

cd ~/bin/jdk
git clone https://github.com/openjdk/jdk21u-dev.git
cd jdk21u-dev

I am going to use the JDK 21+6 update:

git checkout jdk-21.0.6-ga

Now, we can use the same command as the one used to build the upstream version:

bash configure --with-hsdis=binutils --with-binutils-src=$BIN_UTILS_DIR
make clean
make images 
make build-hsdis
make install-hsdis 

Finally, we load the environment:

$ export JAVA_HOME=$PWD/build/linux-x86_64-server-release/jdk/
$ export PATH=$JAVA_HOME/bin:$PATH

## Check 
$ java --version

openjdk 21.0.6-internal 2025-01-21
OpenJDK Runtime Environment (build 21.0.6-internal-adhoc.juan.jdk21u-dev)
OpenJDK 64-Bit Server VM (build 21.0.6-internal-adhoc.juan.jdk21u-dev, mixed mode)

Enabling the Disassembler: An Example

Let’s write an example and see the disassembler in action:

public class SampleCompute {

  public static void main(String[] args) {
    SampleCompute compute = new SampleCompute();
    int[] array = new int[100_000];
    for(int i = 0; i < array.length; i++) {
    	array[i] = compute.compute(i);
    }
  }

  private int compute(int i) {
	return (i * i) + i;
  }
}

We compile the program with javac as usual:

javac SampleCompute.java

To enable the assembler, we run java with the -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly options:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly SampleCompute

This is very verbose because it dumps into the standard output all methods compiled. However, it is possible to add filters for specific methods using the -XX:CompileCommand='print,MyKlass::method’ option. For example, the following code snippet only enables the assembly dump for the compute method.

java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand='print,SampleCompute.compute' SampleCompute 

Sample output:

CompileCommand: print SampleCompute.compute bool print = true

============================= C1-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c1) 73  477       3       SampleCompute::compute (6 bytes)
 total in heap  [0x00007f8eb909e290,0x00007f8eb909e580] = 752
 relocation     [0x00007f8eb909e3e8,0x00007f8eb909e418] = 48
 main code      [0x00007f8eb909e420,0x00007f8eb909e4f8] = 216
 stub code      [0x00007f8eb909e4f8,0x00007f8eb909e528] = 48
 oops           [0x00007f8eb909e528,0x00007f8eb909e530] = 8
 metadata       [0x00007f8eb909e530,0x00007f8eb909e538] = 8
 scopes data    [0x00007f8eb909e538,0x00007f8eb909e548] = 16
 scopes pcs     [0x00007f8eb909e548,0x00007f8eb909e578] = 48
 dependencies   [0x00007f8eb909e578,0x00007f8eb909e580] = 8

[Disassembly]
--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Entry Point]
  # {method} {0x00007f8e90700320} 'compute' '(I)I' in 'SampleCompute'
  # this:     rsi:rsi   = 'SampleCompute'
  # parm0:    rdx       = int
  #           [sp+0x30]  (sp of caller)
  0x00007f8eb909e420:   mov    0x8(%rsi),%r10d
  0x00007f8eb909e424:   shl    $0x3,%r10
  0x00007f8eb909e428:   cmp    %rax,%r10
  0x00007f8eb909e42b:   jne    0x00007f8ec04ad080           ;   {runtime_call ic_miss_stub}
  0x00007f8eb909e431:   data16 data16 nopw 0x0(%rax,%rax,1)
  0x00007f8eb909e43c:   data16 data16 xchg %ax,%ax
[Verified Entry Point]
  0x00007f8eb909e440:   mov    %eax,-0x14000(%rsp)
  0x00007f8eb909e447:   push   %rbp
  0x00007f8eb909e448:   sub    $0x20,%rsp
  0x00007f8eb909e44c:   cmpl   $0x1,0x20(%r15)
  0x00007f8eb909e454:   je     0x00007f8eb909e45b
  0x00007f8eb909e456:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x00007f8eb909e45b:   movabs $0x7f8e907004c0,%rax         ;   {metadata(method data for {method} {0x00007f8e90700320} 'compute' '(I)I' in 'SampleCompute')}
  0x00007f8eb909e465:   mov    0xf4(%rax),%edi
  0x00007f8eb909e46b:   add    $0x2,%edi
  0x00007f8eb909e46e:   mov    %edi,0xf4(%rax)
  0x00007f8eb909e474:   and    $0x7fe,%edi
  0x00007f8eb909e47a:   test   %edi,%edi
  0x00007f8eb909e47c:   je     0x00007f8eb909e49d
  0x00007f8eb909e482:   mov    %rdx,%rax
  0x00007f8eb909e485:   imul   %edx,%eax
  0x00007f8eb909e488:   add    %edx,%eax
  0x00007f8eb909e48a:   add    $0x20,%rsp
  0x00007f8eb909e48e:   pop    %rbp
  0x00007f8eb909e48f:   cmp    0x448(%r15),%rsp             ;   {poll_return}
  0x00007f8eb909e496:   ja     0x00007f8eb909e4bb
  0x00007f8eb909e49c:   ret    
  0x00007f8eb909e49d:   movabs $0x7f8e90700320,%r10         ;   {metadata({method} {0x00007f8e90700320} 'compute' '(I)I' in 'SampleCompute')}
  0x00007f8eb909e4a7:   mov    %r10,0x8(%rsp)
  0x00007f8eb909e4ac:   movq   $0xffffffffffffffff,(%rsp)
  0x00007f8eb909e4b4:   call   0x00007f8ec0570a00           ; ImmutableOopMap {rsi=Oop }
                                                            ;*synchronization entry
                                                            ; - SampleCompute::compute@-1 (line 12)
                                                            ;   {runtime_call counter_overflow Runtime1 stub}
  0x00007f8eb909e4b9:   jmp    0x00007f8eb909e482
  0x00007f8eb909e4bb:   movabs $0x7f8eb909e48f,%r10         ;   {internal_word}
  0x00007f8eb909e4c5:   mov    %r10,0x460(%r15)
  0x00007f8eb909e4cc:   jmp    0x00007f8ec04b4000           ;   {runtime_call SafepointBlob}
  0x00007f8eb909e4d1:   mov    0x4f8(%r15),%rax
  0x00007f8eb909e4d8:   movq   $0x0,0x4f8(%r15)
  0x00007f8eb909e4e3:   movq   $0x0,0x500(%r15)
  0x00007f8eb909e4ee:   add    $0x20,%rsp
  0x00007f8eb909e4f2:   pop    %rbp
  0x00007f8eb909e4f3:   jmp    0x00007f8ec056ac00           ;   {runtime_call unwind_exception Runtime1 stub}
[Exception Handler]
  0x00007f8eb909e4f8:   call   0x00007f8ec056d900           ;   {no_reloc}
  0x00007f8eb909e4fd:   movabs $0x7f8ed0349604,%rdi         ;   {external_word}
  0x00007f8eb909e507:   and    $0xfffffffffffffff0,%rsp
  0x00007f8eb909e50b:   call   0x00007f8ecfb79340           ;   {runtime_call MacroAssembler::debug64(char*, long, long*)}
  0x00007f8eb909e510:   hlt    
[Deopt Handler Code]
  0x00007f8eb909e511:   movabs $0x7f8eb909e511,%r10         ;   {section_word}
  0x00007f8eb909e51b:   push   %r10
  0x00007f8eb909e51d:   jmp    0x00007f8ec04b32a0           ;   {runtime_call DeoptimizationBlob}
  0x00007f8eb909e522:   hlt    
  0x00007f8eb909e523:   hlt    
  0x00007f8eb909e524:   hlt    
  0x00007f8eb909e525:   hlt    
  0x00007f8eb909e526:   hlt    
  0x00007f8eb909e527:   hlt    
--------------------------------------------------------------------------------

Babylon OpenJDK: A Guide for Beginners and Comparison with TornadoVM

2025-02-07T00:00:00+00:00

Introduction

Babylon is a new OpenJDK project which aims to enhance code reflection for the Java platform allowing not only to inspect classes and fields, but also to inspect methods and lambdas with the end goal of performing code transformation without using any 3rd party libraries.

What does this mean in practice? The enhanced code reflection can be used to represent different types of computation, such as for automatic differentiation [2], LINQ expressions [3] and even GPU offloading, which is the focus on this article. We are going to walk through how Babylon helps developers to define a parallel framework for GPU programming within Java, and how it differs from current solutions, such as TornadoVM.

But before we dive into the GPU workflow within Babylon, let’s define a key term, the code model. In the context of Babylon, a code model is a representation of a program code (e.g., a Java method) that is produced by the javac compiler, and stored in the class file. The information stored in the class file includes, for example, the type information, and the control flow.

Babylon’s enhanced reflection API empowers developers to access and manipulate these code models at runtime, enabling metaprogramming directly within Java. This capability allows for dynamic generation and manipulation of Java programs, including the creation of GPU code tailored for various hardware accelerators like Intel or NVIDIA GPUs. In fact, this is the purpose of a subproject from Babylon called HAT (Heterogeneous Accelerator Toolkit), which leverages Babylon to provide a GPU backend for the Java platform.

In this article I am going to explore HAT, how developers can start using it to access GPUs for hardware acceleration. We’ll delve into the key API components that enable this functionality, and explain how code is executed. Then, I will also compare HAT with TornadoVM, a Java parallel programming framework to transparently accelerate Java data-parallel workloads on modern hardware, including GPUs.

For full disclosure: I’m one of the architects and the lead developer of the TornadoVM project. However, this exploration of HAT comes from a place of research and genuine curiosity about this emerging technology. My goal is to provide an objective comparison between the two projects. While I’ve strived for impartiality, I welcome any discussion or feedback if anything seems biased.

With this out of the way, let’s get started!

HAT: Heterogeneous Accelerator Toolkit

This blog post reflects the state of the Babylon project as of February 2025. Given the project’s rapid development, some examples may not compile or run correctly in future versions. However, the core concepts and fundamental understanding presented here should remain valuable for readers.

The Heterogeneous accelerator Toolkit offers different interfaces to build applications tailored for GPU execution. The HAT interfaces are grouped into three categories:

An NDRange Kernel API to help developers to express parallel kernels.
A Java interface to map memory between Java and hardware accelerators, called iFaceMapper)
An API for identifying methods to accelerate on GPUs.

Let’s briefly look at each of these components.

NDRange API

HAT is based on the SIMT (Single Instruction, Multiple Thread) model, and the NDRange API serves as the interface for Java developers to create parallel kernels that target this model. In a SIMT model, a single instruction operates on multiple threads concurrently, where each thread can access different data. This SIMT model is also the foundation of other GPU programming interfaces and languages such as CUDA, OpenCL, and SYCL.

In HAT, Java developers use the NDRange API to define kernels (methods that will be offloaded to a GPU). A kernel encapsulates the work to be done per thread, and the NDRange defines the amount of threads to run. This programming model scales very well, independently of the number of GPU cores of the actual graphics card.

Let’s write a simple example, a vector addition. In Java, the vector addition can be expressed as follows:

public void vectorAddition(float[] a, float[] b, float[] c) {
    for (int i = 0; i < a.length; i++) {
        c[i] = a[i] + b[i];
    }
}

For clarity, let’s make a couple of simplifying assumptions: none of our vectors will be null, and they’ll all have the same size. This allows us to focus on the core concepts. Here’s the Babylon/HAT code:

@CodeReflection
public void vectorAddition(F32Array a, F32Array b, F32Array c, KernelContext context) {
      int idx = context.x;
      float sum = a.array(idx) + b.array(idx);
      c.array(idx, sum );
}

This example demonstrates an explicit parallel kernel. Several key changes are worth noting:

Annotation: A new annotation (@CodeRefection) is required to instruct the javac compiler to generate a code model that represents the whole method.
Type Changes: The parameter types have been modified from float[] to F32Array. F32Array is a type provided by HAT, representing data structures compatible with the GPU. We’ll dive deeper into HAT’s type system and memory management in the next section.
Kernel Context: A new parameter, the kernel context, is introduced. This special object provides access to GPU built-in intrinsics, including thread IDs and other GPU execution parameters like the maximum number of threads.
Thread-Based Execution: The traditional for loop has been replaced. Instead, the thread ID, obtained from the kernel context, is used to access data. This is a standard GPU programming pattern: the number of threads launched typically corresponds to the size of the input arrays.

Those familiar with CUDA, OpenCL, or oneAPI will find this code structure very familiar. This similarity is a point that I’ll revisit when comparing HAT with TornadoVM.

Memory Mapping

This is one of my favourite parts of the HAT project. HAT defines an interface called iFaceMapper to represent data. Data is actually stored off-heap by leveraging the Panama Memory Segments API for GPU computing.

From my point of view, data representation presents a significant challenge in GPU programming with managed runtime languages like Java, particularly concerning the tradeoffs between performance, portability and ease of use. It is also a critical part, because in Java, we have the Garbage Collector (GC), that can move pointers around if needed.

HAT tackles this issue by defining a base interface capable of handling data access and manipulation within Panama Segments. This interface is extensible, enabling developers to create custom data objects compatibility with GPUs and other hardware accelerators.

This interface offers broad potential benefits, extending beyond Babylon and HAT to projects like TornadoVM. While TornadoVM offers a wide range of hardware accelerator-compatible types, it currently lacks user-side customization for data representation. This interface could provide a very promising approach for integration, enabling greater flexibility and control, and improve TornadoVM further.

For example, to create a custom data object in HAT to store an array that uses a Memory Segment:

public interface MyCustomArray extends Buffer {
   int length();

   @BoundBy("length")
   float data(long idx);
   void data(long idx, float f);

   // Define the schema
   Schema<MyCustomArray> schema = Schema.of(MyCustomArray.class,
           array -> array
           .arrayLen("length")
           .array("data"));

   static MyCustomArray create(Accelerator accelerator, int length) {
       return schema.allocate(accelerator, length);
   }
}

Then, the HAT OpenCL compiler generates a C-struct as follows:

typedef struct MyCustomArray_s {
    int length;
    float data[1];
} MyCustomArray_t;

Still, a bit of boiler-plate code to add, but it can be used to define custom data types compatible with GPUs. How cool is this?

Accelerator and Compute Context

Let’s look now at the final piece of the API, the Accelerator object the the ComputeContext. These two objects are used to define the backend to use (e.g., OpenCL, CUDA, etc), and the list of kernels we want to offload.

var accelerator = new Accelerator(lookup, Backend.FIRST);
accelerator.compute(cc ->
       MyClass.methodToOffload(cc, matrixA, matrixB, matrixC, size)
);

Then:

@CodeReflection
public static void methodToOffload(ComputeContext cc, MyCustomArray matrixA) {
   cc.dispatchKernel(size, kc -> myGPUKernel(kc, data));
}

Note that the first parameter of the dispatchKernel method call (size in this case) is the number of threads to be deployed on the GPU.

Example: Expressing Parallel Matrix Multiplication for GPUs

Let’s put all these concepts into practice and implement Matrix Multiplication for HAT. Matrix Multiplication is one of the key routines used for modern workloads, such as Deep Learning, AI and LLMs. Besides, it is a very good application to be accelerated on GPUs.

Let’s start with the Java sequential implementation of the Matrix Multiplication:

private static void runSequential(F32Array matrixA, F32Array matrixB, F32Array matrixC, final int size) {
   for (int i = 0; i < size; i++) {
       for (int j = 0; j < size; j++) {
           float sum = 0;
           for (int k = 0; k < size; k++) {
               float a = matrixA.array((long) i * size + k);
               float b = matrixB.array((long) k * size + j);
               sum += a * b;
           }
           matrixC.array((long) i * size + j, sum);
       }
   }
}

This shows the canonical matrix multiply (three nested loops). In Babylon/HAT we can parallelize the outermost loop as follows:

@CodeReflection
public static void matrixMultiplyKernel(KernelContext kc, F32Array matrixA, F32Array matrixB, F32Array matrixC, int size) {
   if (kc.x < kc.maxX) {
       for (int j = 0; j < size; j++) {
           float acc = 0;
           for (int k = 0; k < size; k++) {
               acc += (matrixA.array(kc.x * size + k) * matrixB.array(k * size + j));
           }
           matrixC.array(kc.x * size + j, acc);
       }
   }
}

This means that the first loop will run in parallel on the target device by deploying as many threads as rows for each of the matrices. Each thread performs the second and innermost loop (a reduction) to sum-up the values per column.

Next, we need to dispatch the kernel.

@CodeReflection
public static void matrixMultiply(ComputeContext cc, F32Array matrixA, F32Array matrixB, F32Array matrixC, int size) {
   cc.dispatchKernel(size,
           kc -> matrixMultiplyKernel(kc, matrixA, matrixB, matrixC, size)
   );
}

Note that this method also contains the @CodeReflection annotation, even though it will not be executed on the device (GPU). This is because HAT can obtain data, and infer types before compiling the code, and obtain the code model for the method to be offloaded. Thus, the annotation helps the HAT compiler and the runtime to manipulate date and generate the correct OpenCL and CUDA PTX code.

You can see the full example here: https://github.com/openjdk/babylon/pull/276. Note that the only method that will be offloaded to a GPU is the matrixMultiplicationKernel. The rest of the code runs on the host side (under the Java platform). But how is the compilation done? Which parts are offloaded and what the final code looks like? Let’s dive in.

How does Babylon/HAT internally work for GPUs?

As of February 2025, HAT supports OpenCL and CUDA backends. There is also ongoing work for a SPIR-V backend (and fun fact, the SPIR-V code generator library is actually the one we -TornadoVM team- developed for TornadoVM, so I was so happy to see such a library being used outside Academia).

HAT works in a two-stage compilation process to reach the GPU source code (e.g., OpenCL C, or SPIR-V), and then another compilation phase performed by the corresponding GPU driver to obtain the final GPU binary.
Let’s discuss the two-stage compilation process first.

The following diagram shows an abstract representation of the workflow of the different compilation stages to reach the GPU code in Babylon. First, as we saw in the previous example, developers use the NDRange API and the Accelerator Toolkit to annotate and identify the code to be offloaded. Since the method is annotated with the @CodeReflection annotation, the javac compiler generates a code model that is stored in the class file.

This code model is close to an AST (Abstract Syntax Tree) along with types and control flow information. At this point, HAT performs a lowering phase (it actually invokes a lowering phase from the code-reflection API) to transform the original code model into a low level representation. This representation is similar to LLVM IR.

From this code representation, HAT generates the corresponding OpenCL C code (it could also generate CUDA PTX - assembly code for CUDA programs - , or SPIR-V). Once this GPU code is generated, we need another compiler to transform the generated source to GPU binary. This is done by calling the corresponding function from each of the drivers. For example, for OpenCL, the function clBuildProgram will do this.

Note that one could generate GPU code from the code model itself, without lowering. Thus, depending on the target code, this could be an easier choice. However, for SPIR-V or CUDA PTX, I see the lowering phase being a more appropriate level for offloading the code.

For more details: link

Ok, enough talk, let’s see some action!

Installation and Configuration of Babylon for GPUs

Install prerequisites

For Fedora (Checked on Fedora 41)

$ sudo dnf install autoconf alsa-lib-devel cups-devel libXtst-devel libXt-devel libXrender-devel libXrandr-devel libXi-devel

for Ubuntu (Checked on Ubuntu 22.04.5 LTS):

sudo apt-get install autoconf libasound2-dev libcups2-dev libfontconfig1-dev libx11-dev libxext-dev libxrender-dev libxrandr-dev libxtst-dev libxt-dev

Installation of Babylon Code-Reflection with OpenJDK 24

Babylon and HAT are in continuous development. Thus, build instructions may change in the future, The following instructions are based on Babylon (commit ee3da03).

# as in February 2025

sdk install java 23-open
sdk use java 23-open

Configure Babylon (Java JDK with Babylon Port)

First, we are going to configure Babylon by building JVM from the source code. Then, we are going to use the resulting JVM to compile and run HAT programs on GPUs.

cd workdir 
ROOT_BABYLON=`pwd`
git clone https://github.com/openjdk/babylon.git
bash configure  --with-boot-jdk=${JAVA_HOME}
make images

Now we get a new OpenJDK version:

export JAVA_HOME=$ROOT_BABYLON/babylon/build/linux-x86_64-server-release/jdk
export PATH=$JAVA_HOME/bin:$PATH

Configure HAT

cd $ROOT_BABYLON/hat 
source env.bash 
java @bldr/args bld

Run Examples on GPUs

E.g., Mandelbrot with the OpenCL backed:

java @bldr/hatrun ffi-opencl mandel

Mandelbrot with the CUDA PTX backed:

java @bldr/hatrun ffi-ptx mandel

Cool, isn’t it? Let’s now run a benchmark and compare it with Java and TornadoVM.

Performance Evaluation of Matrix Multiplication on GPUs

In this section, we are going to evaluate the performance of the Matrix Multiplication on GPUs using Babylon, and compare it against TornadoVM. The following table shows the system CPU, GPU and the software used.

System	Version
CPU	13th Gen Intel(R) Core(TM) i9-13900K
GPU	RTX 4090
NVIDIA-DRIVER	550.107.02
OS	Ubuntu 22.04.5 LTS
Kernel	Linux 6.8.0-47
RAM	64GB
CUDA	12.1.r12.1
GCC	11.4.0
TornadoVM	1.0.10-dev (5da9549d1)
JDK for TornadoVM	OpenJDK “21.0.4” 2024-07-16 LTS
Babylon	cd3c7ce9c8a
JDK for Babylon	openjdk 23.0.1

Examples:

Let’s run the Matrix Multiplication explained in the previous section and compare it with TornadoVM. The full example in Babylon can be found in the following link:

https://github.com/jjfumero/babylon/tree/dev/examples/hat/examples/matmul

The TornadoVM version can be found here:https://github.com/jjfumero/tornadovm-examples.

In this post I am not explaining how to program with TornadoVM. If you are interested, I recommend a previous article in which I go into the details about how TornadoVM is used to accelerate different workloads: https://jjfumero.github.io/posts/2024/23/tornadovm-programming-model.

Backends:

Let’s evaluate the OpenCL C and the PTX backends. For the OpenCL C, I use the Intel Integrated Graphics. Although on my system I could have used the RTX 4090 for OpenCL, at the time of writing this post, Babylon does not support multiple devices or device switching. Thus, to make a fair comparison, I also chose the integrated GPU in TornadoVM.

Compared with TormadoVM, an interesting feature is when multiple GPUs are available, the TornadoVM runtime system automatically reorders the devices and selects the best based on compute capability and number of threads to be deployed. Thus, in my system, the default choice for TornadoVM was the 4090, which in my opinion, is what we want by default.

How to reproduce?

Babylon (OpenCL):

java @bldr/hatrun ffi-opencl matmul

Babylon (PTX):

java @bldr/hatrun ffi-ptx matmul

TornadoVM:

The experiment is taken from the tornadovm-examples project.

Note that we can increment the number of runs to make it match with the Babylon experiment, and remove the 2D level of parallelization, to make it equivalent to the HAT/Babylon example:

git diff
diff --git a/src/main/java/io/github/jjfumero/MatrixMultiplication.java b/src/main/java/io/github/jjfumero/MatrixMultiplication.java
index 81bf05c..13c5bb1 100644
--- a/src/main/java/io/github/jjfumero/MatrixMultiplication.java
+++ b/src/main/java/io/github/jjfumero/MatrixMultiplication.java
@@ -253,7 +253,7 @@ public class MatrixMultiplication {
          */
         private static void mxmTornadoVM(Matrix2DFloat a, Matrix2DFloat b, Matrix2DFloat c, final int size) {
             for (@Parallel int i = 0; i < size; i++) {
-                for (@Parallel int j = 0; j < size; j++) {
+                for (int j = 0; j < size; j++) {
                     float sum = 0.0f;
                     for (int k = 0; k < size; k++) {
                         sum += a.get(i, k) * b.get(k, j);
@@ -277,7 +277,7 @@ public class MatrixMultiplication {
 
         private static TornadoExecutionPlan createTornadoVMPlan(Matrix2DFloat a, Matrix2DFloat b, Matrix2DFloat c) {
             TaskGraph taskGraph = new TaskGraph("mxm");
-            taskGraph.transferToDevice(DataTransferMode.FIRST_EXECUTION, a, b) //
+            taskGraph.transferToDevice(DataTransferMode.EVERY_EXECUTION, a, b) //
                     .task("mxm", Multiplication::mxmTornadoVM, a, b, c, a.getNumRows()) //
                     .transferToHost(DataTransferMode.EVERY_EXECUTION, c);
             TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(taskGraph.snapshot());
@@ -455,7 +455,7 @@ public class MatrixMultiplication {
         matrixA.initRandom();
         matrixB.initRandom();
 
-        final int RUNS = 10;
+        final int RUNS = 100;
 
         // 6 implementations to compare
         ArrayList> timers = IntStream.range(0, 6) //

To run:

tornado -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.MatrixMultiplication onlyTornadoVM

If we have multiple devices/backends installed with TornadoVM, we can change the device and the runtime by using the flag -Dmxm.mxm.device=X:Y. X and Y are the required device indices. You can check all devices available with TornadoVM with the following command:

$ tornado --devices

Performance Evaluation

OpenCL C on Intel Integrated Graphics

The following performance plot shows the distribution of the run-time across 100 runs for all evaluated versions: namely, a) TornadoVM with the OpenCL backend; b) TornadoVM dispatching SPIR-V code via the OpenCL backend, and c) TornadoVM dispatching SPIR-V code via the Level Zero API. The last bar shows the runtime distribution for Babylon. All these versions run on the Intel integrated graphics. The y-axis shows the total run-time (end-to-end) in nanoseconds. Thus, the lower, the better. The first run of each version includes the JIT compilation time.

As we can see, TornadoVM consistently outperforms Babylon, even with JIT compilation. TornadoVM’s performance is also more stable, with execution times clustered tightly around the average. Babylon’s performance on the same Intel integrated GPU varies more widely, though the total difference between its minimum and maximum execution times is only about 93 milliseconds.

Let’s see the big picture now. Let’s compare each of these approaches with Java and Java Vector API running with Java Streams (the fastest we can get with Java on CPUs). The following performance plot shows speedup against Java sequential run in peak performance (after warm up), and it compares against a) Java Parallel Vector API on CPU; b) TornadoVM with OpenCL C on the intel integrated GPU using a 2D kernel; c) TornadoVM with OpenCL C for 1D kernel; and d) Babylon/HAT.

We see that, for this application, MxM, running on the integrated GPU does not outperform the parallel Java Vector API implementation on CPU.
Take away! Do not underestimate CPU power unless you have a powerful accelerator!

If we include the NVIDIA 4090 GPU, TornadoVM performs up to 2500x compared to Java for the OpenCL backend, as I detailed in a recent technical article!

CUDA PTX Backend

And what about the PTX backend running on the NVIDIA 4090 GPU? The following performance graphs shows the run-time distribution (the lower, the better) of 100 runs for Java sequential version, the parallel Java Vector API version, TornadoVM 1D with the PTX backend, the TornadoVM 2D version and Babylon.

The dots indicate the first execution, in which TornadoVM and Babylon perform the JIT compilation. As we can see, TornadoVM runs faster than Babylon, including the first run in which JIT compilation and execution are involved (2.3x faster for TornadoVM 1D and 9.3x for the 2D version compared to Babylon).

When we compare Babylon and TornadoVM 1D with the parallel Java Vector API, we see that they run slower than the parallel CPU implementation. When running on discrete GPUs, we must consider the cost of offloading, in which we need to consider the data transfers between the main CPU and the GPU, the number of concurrent/parallel operations we will perform on the device. For this particular application, MxM, we are under-utilizing the hardware when we run in 1D.

If you want to see an deeper analysis of the Java Vector API vs TornadoVM, I recommend the following article: https://jjfumero.github.io/posts/2024/12/17/tornadovm-vs-opencl.

By looking at the speeds for the PTX backend compared to Java:

As we see, TornadoVM achieves speedups of up to 1700x compared to Java, 11x faster than CPU execution, and 346x faster than Babylon/HAT for the same GPU.

Does this mean TornadoVM is always faster than Babylon/HAT? No, it does not have to be. For some applications might be faster, other might be slower. As I describe in the next Section more detail, TornadoVM has a JIT compiler and an optimizer, and that can give an advantage for some applications.

HAT vs TornadoVM: Differences and Limitations

Let’s talk about current limitations for both Babylon and TornadoVM. Bear in mind that both projects are in active development, and, what I describe as limitations today (February 2025) might be solved/overcome in near future.

Current Limitations of Babylon/Hat vs TornadoVM

Babylon and HAT are clearly focused on offering an interface to facilitate the manipulation and transformation of Java code. Thus, the main focus is compilation and the minimum runtime support to run the code (e.g., data handling and data representation).

TornadoVM, instead, offers a more complete solution to run on modern hardware accelerators, not just on GPUs. With that, TornadoVM comes a more complex engineering framework to solve adaptive compiler optimizations per architecture, a specialized code optimizer and an optimising runtime systems for different architectures and vendors. Let’s break this down:

Runtime Limitations:

Babylon HAT’s runtime features are currently limited. Compared to TornadoVM, HAT lacks dynamic multiple device selection (e.g., multiple GPUs) and dynamic task-migration. Instead, devices are always statically assigned, reducing adaptability to changing system conditions. Furthermore, it doesn’t support copy operations for data ranges, restricting automatic data management capabilities, for example for automatic batch processing.

Hardware Support and Code Generation:

Babylon HAT currently lacks code generation and a runtime orchestrator for other devices but GPUs. Compared to TornadoVM, which supports GPUs from multiple vendors (Intel, NVIDIA, and AMD), CPUs, FPGAs, and even RISC-V accelerators, Babylon’s hardware support is considerably narrower. While future expansion is likely, the current limitations restrict its applicability. The absence of a code optimizer could impact performance potential on specialized hardware accelerators [4].

Compiler Optimizations:

Babylon does not include an optimizer compiler, at least for now. In contrast, TornadoVM extends the state-of-the-art open source Graal JIT compiler with new compiler optimization pipelines targeted for GPUs, FPGAs and multi-core CPUS, tuning loops ordering, automatic usage of fast intrinsics, automatic use of local/shared memory, etc.

Parallelism and API Complexity:

Babylon HAT lacks native support for 2D and 3D parallelism (or 2D and 3D ranges). While this seems a relatively straightforward feature to implement in the future, its current absence restricts the efficient parallelization of multi-dimensional problems. The HAT API, with its Range programming model, requires developers to possess expertise in GPU programming models like CUDA, OpenCL, or oneAPI. While developers with this background can quickly become productive, those without it may face a steep learning curve.

This contrasts with TornadoVM’s dual API approach: a high-level annotation-based system for newcomers and a low-level Kernel API (similar to Babylon’s Range API) for expert developers. I think this dual approach can gather a broader range of developer expertise.

Current Limitations in TornadoVM vs Babylon/HAT

TornadoVM is not perfect, by any means. It is also in continuous development and improving with every new version.

Support for Custom Data Types:

The main limitation of TornadoVM is the lack of customization for user-data types compatible between Java and hardware accelerators. The iFaceMapper is a promising approach to program and handle efficient data structures compatible between hardware accelerators and the Java runtime.

New APIs and Data Types:

This is also valid for Babylon/HAT, but since I am more involved in the TornadoVM project, I can refer to it here. Offering APIs and new types, although crucial to achieve performance, comes with the cost of developers having to learn new APIs. From my view, if these new interfaces are part of the JDK, then it will be easier to adopt these types of technologies.

Code Generation of Structure Programming Languages:

Code generation in TornadoVM is tricky, and for the OpenCL C backend, especially tricky. Going to low-level details, TornadoVM generates code from the Low-Tier in Graal IR, an unstructured flow IR [5]. The challenge here is to generate a structured OpenCL C kernel from an unstructured flow graph. Thus, it is sometimes difficult to generate correct code. A better target, and an easier target, for TornadoVM is CUDA PTX, and SPIR-V, instead of OpenCL C. However, not all vendors (NVIDIA GPUs for example), allow to run SPIR-V for OpenCL. Since Babylon generates OpenCL C code from a close-to-an-AST form, it will be easier to generate correct OpenCL C code.

Maintenance Support:

The fact that TornadoVM offers more backends and support for more devices also comes with the cost of maintenance. For a small team like TornadoVM, there is always the tradeoff between offering new features and keeping TornadoVM working for all possible devices, architectures and operating systems. This limitation, although not in the design, cannot be overlooked.

I would like this to be an active discussion. Do you know/do you see other limitations? Let me know in the comments.

Conclusions and Final Thoughts

Babylon, through its enhanced reflection API and the HAT subproject, offers a very interesting approach to GPU programming within Java. By enabling direct manipulation of code models at runtime, it facilitates the dynamic generation of GPU code.

This article is a brief introduction to Babylon and GPU programming via the HAT project, as well as an idea about current performance, similarities and differences compared to TornadoVM. All of these from the perspective of a person directly involved in GPU programming for Java for the past 12+ years (time flies!).

I like to see HAT happening as an incubator OpenJDK project in the future for the enhancement of the Java platform, allowing Java developers not only to run on modern GPUs, but also on new upcoming accelerators (e.g., new AI accelerators). Babylon/HAT, in my opinion, is a step towards unification and consolidation of APIs and interfaces that help vendors and implementers (like TornadoVM) to be as close as possible to Java while offering high performance.

On that front, I see HAT borrowing ideas and the research done in projects such as TornadoVM, Aparapi and others. For instance, as Gary Frost (main software architect of the HAT project and creator of Aparapi) acknowledged, the HAT Accelerator and Compute-Context API were inspired by the TornadoVM API. Besides, I see ideas borrowed from the Aparapi project.

As I briefly mention, TorandoVM has served not only as an example, but also as a technology enabler, allowing HAT developers to write a SPIR-V backend using the Java framework we implemented to enable the SPIR-V backend in TornadoVM.

Discussions

If you are interested, let’s keep the discussions active:

https://github.com/jjfumero/jjfumero.github.io/discussions/14

Links

[1] https://mail.openjdk.org/pipermail/discuss/2023-September/006226.html

[2] https://openjdk.org/projects/babylon/articles/code-models

[3] https://openjdk.org/projects/babylon/articles/linq

[4] https://jjfumero.github.io/posts/2024/12/17/tornadovm-vs-opencl

[5] https://dl.acm.org/doi/pdf/10.1145/2816707.2816715

Fixing libcurl conflicts in Fedora 41

2025-01-20T00:00:00+00:00

Recently, I came across this error in Fedora 41, and I am not sure why the OS installed this library using different versions.

$ sudo dnf update
Updating and loading repositories:
Repositories loaded.
Problem: installed package libcurl-minimal-8.9.1-3.fc41.x86_64 conflicts with libcurl(x86-64) provided by libcurl-8.9.1-3.fc41.x86_64 from updates
  - libcurl-8.9.1-3.fc41.i686 from updates has inferior architecture
  - cannot install the best update candidate for package libcurl-minimal-8.9.1-3.fc41.x86_64
  - cannot install the best update candidate for package libcurl-8.9.1-2.fc41.i686

Package                                          Arch         Version                                           Repository                     Size
Skipping packages with conflicts:
 libcurl                                         x86_64       8.9.1-3.fc41                                      updates                   809.3 KiB

Nothing to do.

Fortunately, there is a solution to this. Based on this comment from the Fedora forums, we can swap the library to use.

$ sudo dnf swap libcurl-minimal libcurl
Updating and loading repositories:
Repositories loaded.
Package "libcurl-8.9.1-2.fc41.i686" is already installed.

Package                                          Arch         Version                                           Repository                     Size
Removing:
 libcurl-minimal                                 x86_64       8.9.1-3.fc41                                      updates                   641.2 KiB
Downgrading:
 curl                                            x86_64       8.9.1-2.fc41                                      fedora                    796.2 KiB
   replacing curl                                x86_64       8.9.1-3.fc41                                      updates                   793.5 KiB
 libcurl-devel                                   x86_64       8.9.1-2.fc41                                      fedora                      1.3 MiB
   replacing libcurl-devel                       x86_64       8.9.1-3.fc41                                      updates                     1.3 MiB
Installing dependencies:
 libcurl                                         x86_64       8.9.1-2.fc41                                      fedora                    818.1 KiB

Transaction Summary:
 Installing:         1 package
 Replacing:          2 packages
 Removing:           1 package
 Downgrading:        2 packages

Total size of inbound packages is 2 MiB. Need to download 2 MiB.
After this operation, 180 KiB extra will be used (install 3 MiB, remove 3 MiB).
Is this ok [y/N]: y
[1/3] curl-0:8.9.1-2.fc41.x86_64                                                                           100% | 283.1 KiB/s | 315.1 KiB |  00m01s
[2/3] libcurl-0:8.9.1-2.fc41.x86_64                                                                        100% | 239.2 KiB/s | 361.9 KiB |  00m02s
[3/3] libcurl-devel-0:8.9.1-2.fc41.x86_64                                                                  100% | 547.2 KiB/s | 872.8 KiB |  00m02s
---------------------------------------------------------------------------------------------------------------------------------------------------
[3/3] Total                                                                                                100% | 820.5 KiB/s |   1.5 MiB |  00m02s
Running transaction
[1/8] Verify package files                                                                                 100% | 600.0   B/s |   3.0   B |  00m00s
[2/8] Prepare transaction                                                                                  100% |  23.0   B/s |   6.0   B |  00m00s
[3/8] Installing libcurl-0:8.9.1-2.fc41.x86_64                                                             100% |  57.1 MiB/s | 819.2 KiB |  00m00s
[4/8] Downgrading libcurl-devel-0:8.9.1-2.fc41.x86_64                                                      100% |   7.1 MiB/s |   1.4 MiB |  00m00s
[5/8] Downgrading curl-0:8.9.1-2.fc41.x86_64                                                               100% |  55.7 MiB/s | 798.6 KiB |  00m00s
[6/8] Removing libcurl-devel-0:8.9.1-3.fc41.x86_64                                                         100% | 126.8 KiB/s | 649.0   B |  00m00s
[7/8] Removing curl-0:8.9.1-3.fc41.x86_64                                                                  100% |   8.3 KiB/s |  17.0   B |  00m00s
[8/8] Removing libcurl-minimal-0:8.9.1-3.fc41.x86_64                                                       100% |  22.0   B/s |   7.0   B |  00m00s
Complete!

And then, you can update the system as usual:

$ sudo dnf update
Updating and loading repositories:
Repositories loaded.
Package                                          Arch         Version                                           Repository                     Size
Upgrading:
 curl                                            x86_64       8.9.1-3.fc41                                      updates                   793.5 KiB
   replacing curl                                x86_64       8.9.1-2.fc41                                      fedora                    796.2 KiB
 libcurl                                         i686         8.9.1-3.fc41                                      updates                   836.9 KiB
   replacing libcurl                             i686         8.9.1-2.fc41                                      fedora                    846.1 KiB
 libcurl                                         x86_64       8.9.1-3.fc41                                      updates                   809.3 KiB
   replacing libcurl                             x86_64       8.9.1-2.fc41                                      fedora                    818.1 KiB
 libcurl-devel                                   x86_64       8.9.1-3.fc41                                      updates                     1.3 MiB
   replacing libcurl-devel                       x86_64       8.9.1-2.fc41                                      fedora                      1.3 MiB

Transaction Summary:
 Upgrading:          4 packages
 Replacing:          4 packages

Juan Fumero

Docker Compose for SonarQube: A Simple YAML Template

Docker Compose for SonarQube Community

Update the docker image for Sonar

Links:

Building The Regression Test Harness for the OpenJDK (jtreg) from Source

Build JTREG

Build IDEA Plugin for JTREG

How to Install NVIDIA Drivers and CUDA Toolkit on Oracle Linux 10

1. Update the system

2. Download the latest NVIDIA Driver

3. Installing the dependencies

4. Disable Nouveau and NOVA Core

5. Update the grub2 configuration

6. Install the NVIDIA Driver

7. Install CUDA 13.0 Toolkit

8. Download CUDA Sample Suite

References

How to enable NVIDIA Nsight Compute CLI in Fedora

Setting It Up on Linux

A Note for Fedora and Similar Systems

Links:

How to disable auto-update in Fedora

Disable Kernel Updates

Configuring Unsloth on Linux for LLM Fine Tuning

What is unsloth?

Installing Unsloth Locally

Installing spack

Installing Python 3.12.X

Configure a new environment for Python

Installing PyTorch

Installing unsloth

Some extra packages

How to update unsloth?

What’s next?

Accelerating Java programs on RISC-V CPUs with Vector Instructions

High-Level Overview of RISC-V

What options are available for RISC-V hardware in 2025?

What kind of Operating System can you run?

Is Java available for RISC-V?

So, can we run TornadoVM?

Vectorization

Workflow for auto-vectorization in TornadoVM with OCK

Building LLVM and OCK from source

Keep an eye on RAM, Swapping and Thermals

Compiling OCK for RISC-V

Final Configuration for OpenCL

Build TornadoVM for RISC-V

Checking Vector Instructions for RISC-V

Preliminary Results on RISC-V

Conclusions

Discussions

Appendix

Building JDK with HSDIS on Linux

Intro

Getting the dependencies

For Fedora 41:

For Ubuntu 24.04 LTS:

Get Binutils

Build a JDK with HSDIS from the master branch (e.g., JDK 25)

Build hsdis for JDK 21

Enabling the Disassembler: An Example

Babylon OpenJDK: A Guide for Beginners and Comparison with TornadoVM

Introduction

HAT: Heterogeneous Accelerator Toolkit

NDRange API

Memory Mapping

Accelerator and Compute Context

Example: Expressing Parallel Matrix Multiplication for GPUs

How does Babylon/HAT internally work for GPUs?

Installation and Configuration of Babylon for GPUs

Install prerequisites

Installation of Babylon Code-Reflection with OpenJDK 24

Configure Babylon (Java JDK with Babylon Port)

Configure HAT

Run Examples on GPUs

Performance Evaluation of Matrix Multiplication on GPUs

Examples:

Backends:

How to reproduce?

What is `unsloth`?

Installing `spack`

Installing `unsloth`

How to update `unsloth`?

Build a JDK with HSDIS from the `master` branch (e.g., JDK 25)

Build `hsdis` for JDK 21