zsmith.co

RAVM: compile once and run anywhere

Revision 16
© 2013-2018 by Zack Smith. All rights reserved.

What is RAVM?

RAVM is my nascent virtual machine project, to support write-once, someday-run-anywhere execution of bytecode.

RAVM stands for RISC-Approximating Virtual Machine, meaning it uses a load-store architecture and a simplified instruction format to maximize performance.

The project currently consists of:

  • the virtual machine itself, which I coded mostly in x86 assembly language
  • a simple assembler (RASM) that produces bytecode, which is to say the virtual machine language that RAVM runs.

What is the justification for RAVM?

It's the result of a series of rational deductions

Q: What causes poor security in Java, Flash and JavaScript?
A: Exploits that target just in time (JIT) compilation, because these require converting data pages into executable pages.

Q: If many exploits rely on such just in time compilation, what is the alternative?
A: Emulation.

Q: Is emulation fast, or is it slow?
A: It is typically slow (Qemu), sometimes unbearably slow (Bochs).

Q: What is the cause of the slowness of emulation?
A: It may be the complexity of certain instruction sets.

Q: Is emulation of x86 insufferably slow?
A: Typically yes.

Q: Is emulation of ARM also slow?
A: It is slow, but less so than x86.

Q: Can we devise or identify an instruction set that is faster to emulate?
A: Yes, surely.

Q: Should a fast-to-emulate instruction set be inspired by an existing instruction set that runs on custom hardware? For example the x86, ARM, or MIPS?
A: Instruction sets that are designed to be optimal in hardware will usually be complex to emulate in software, so I propose that the instruction set be designed primarily to emulate fast on real hardware.

Q: What makes an instruction set difficult and time-consuming to emulate?
A: Many things:
  • Variable-length instructions
  • The need to bit-shift and mask to get register numbers
  • Maintenance of flags
  • Support for different sizes of operands e.g. AL, AX, EAX, RAX.
  • Complex addressing modes such as segmentation, indirect with shifted offset, etc.

Q: Can we someday design a new instruction set that is so fast to emulate in software, that we can forever abandon JIT compilation?
A: That is the question that my experiments, described below, were meant to answer.

Q: But was it indeed proven?
A: I believe so.

Q: Why do you call this a series of rational deductions?
A: Because the thinking is not motivated by the profit motives of companies or individuals (i.e. irrationality) and the development of RAVM/RXVM/etc proceeded as a series of scientific experiments.

Speed

As far as I know RAVM implements the fastest, useful emulating machine language on x86.

The speed with which a given machine language can be emulated is inversely related to the complexity of the encoding of the instruction set.

After much experimentation, I settled on an encoding that executes faster than ARM or MIPS could ever possibly be emulated, both of these being considerably faster to emulate on x86 than x86 on x86.

RAVM has fewer decisions to make to execute an instruction and there are fewer x86 instructions overall to execute one RAVM instruction, than compared to ARM, MIPS or x86 emulation.

It should also use fewer caches lines, allowing for better performance.

Security

RAVM is fast despite the fact that I check for out of bound accesses that could otherwise let malware compromise RAVM itself.

Simplicity

The instruction set encoding is really quite simple and it would be easy to adapt a compiler to generate RAVM instructions e.g. LLVM.

Why emulate at all?

  • Emulators are portable and therefore architecture independent.
  • Emulation provides an isolated sandbox environment in which to run software.
  • Emulation avoids the security risks of virtualization e.g. breakout bugs.
  • Emulation avoids the security risks of Just In Time (JIT) compilation i.e. converting data pages to code pages.

Why did I start RAVM?

After hearing for the 100th time that browser-borne Java is a major attack vector, analogous to mosquitoes carrying malaria, I asked myself the rhetorical question Could I do better than Java? Rather than assume the affirmative or the negative without evidence, which would be irrational, I set about experimenting in order to answer this question.

In addition, it was my long-standing observation that VMs that emulate some form or another of either machine language or bytecode are always very inefficient. This is why Java bytecode execution was always painfully slow before JIT compilation.

There are two problems to solve:

  1. Achieving security
  2. Achieving fast execution

One without the other limits usefulness.

In the early years, Java bytecode execution was neither fast nor secure.

Even today Java only has fast execution because of JIT (just in time) compilation, which is considered a good hack even though it adds to the complexity of the VM and opens a gaping hole through which exploits can flow.

JIT compilation relies on a major security weakness, namely the ability to convert a region of memory that is flagged as non-executable data into one that is executable code. It is a conversion of bytecode into runnable machine code. This breaks a cardinal rule of computer security: Never execute data.

Modern computer hackers use this feature when they deploy ROP and JOP attacks to overcome the NX (no execute) bit protection and it permits them to do code injection.

Operating systems should never allow applications to convert data areas into code areas, but they do so because of software like Java, and other languages that do JIT compilation such as Javascript and Flash.

The experiments

RAVM is the end result of a series of assembly-language experiments in which I devised and tested various instruction set architectures (ISAs), instruction word sizes, and various implementations of the interpreter code before settling on the current one.

I tried an 8-bit instruction size, 16-bit instruction size and 32-bit instruction size.

I tried register sets with 8 registers, 16 registers, 32 registers and 256 registers.

RXVM: 8-bit instruction size experiment

The maximum performance that I saw, running all experiments on my Intel Core i5 laptop, was from a virtual machine that I called RXVM, which used 8-bit instructions and had only 8 registers, four of which resided within the 32-bit x86 register set. It peaked at about 800 MIPS on my 2.4 GHz Core i5, albeit only for an empty loop. Most arithmetic operations ran at half that speed. RXVM was (sometimes) fast but rather impractical, having only 8 registers. The rule of thumb is that most programs require at least 12 registers.

An 8-bit instruction size also presents a severe limitation on how many instructions can be encoded.

I believed I had an encoding scheme like this:

Opcode class Function
00 move register to register
01 add register to register
10 8 single-register ops e.g. increment, push, pop, load immediate
11 other

RXVM was fastest when using the four registers that resided inside x86 registers, which sped up their operations.

RXVM was fast in part because each 8-bit instruction served as an index into a jump table pointing to hard-coded operations. No register indices were ever shifted and masked. An instruction moving R3 to R7 executed actually executed different code than one moving R3 to R6.

There was a risk of a bloated mass of code snippets implementing every instruction permutation that would be too large to fit into the L1 instruction cache.

I hypothesize that by using x64 assembly language code I could achieve a faster variant of RXVM, because I could put all register data in the Intel registers. Still, only 8 registers and few opcodes would be too limiting.

RYVM: 16-bit instruction size experiment

The 16-bit instruction word size turned out to be not as fast as 8-bit and not faster than the 32-bit size. Why? In a 16-bit word you can have either 4 or 5 bits to indicate which register an operand resides in. Therefore register indices and opcode have to be shifted and AND'd. That takes time.

In addition, with more registers, their data largely has to reside on the stack, which takes twice as long to access as keeping emulated registers' data inside x86 registers.

RZVM: 16-bit instruction size with large jump table

Couldn't the RXVM approach of using a jump table be used for 16-bit instructions? Yes. This would allow for 32 registers i.e. two 5-bit register numbers and a 6-bit opcode fit into 16 bits. I briefly tried this but my tests showed poor performance on my Core i5 CPU. I didn't keep notes on this experiment, but I assume that having far more registers that mostly had to reside on the stack caused the slowdown.

A new experiment?

It might be worthwhile to try a variant on this idea on the x64 architecture in which not all of the 16-bit instruction is used e.g. imagine if just 12 bits are used.
  • 7 bits encode two register operands of an 11-register set (sqrt(128) = 11.3)
  • a 5-bit opcode encodes 32 operations.

This would require a 4096-entry jump table, which on a 64-bit CPU would require 32kB. This is the entire L1 data cache on my Core i5.

If the total number of instructions could be reduced to free up more data cache space, that could allow for better performacne.

If each code snippet implementing each instruction permutation occupied only a few bytes -- less than 8 -- all the snippets might fit in the Core i5's 32kB instruction cache. But 8 bytes leaves enough space for only about one instruction and a relative jump back to the main VM loop, or instead of a jump I could use a RET instruction.

But the good news is that all 11 registers would likely fit inside the x64 register set, allowing for very fast execution.

The implications of using a jump table and 11 registers:

  • 121 register to register MOV's.
  • 121 two-register ADD's.
  • 121 two-register SUB's.
  • 121 two-register XOR's.
  • Etc.

RAVM: 32-bit instruction size experiment

The 32-bit instruction word size with 256 registers is fastest because the register numbers can be obtained very quickly on an x86. The shift-AND operation can be done with one move instruction: MOVZX.

For this reason, RAVM provides 256 registers, which take up 1024 bytes of the L1 data cache, and runs faster than RYVM. On my 2.4 GHz Core i5, RAVM can execute its bytecode at about 380 MIPS for most instructions. That includes full bounds checking at runtime for newly calculated memory addresses. Without checks it would be faster but less secure.

But really, 256 registers? Yes. This many registers, while not strictly necessary for most purposes, permits the fastest decoding of instructions by making use of MOVZX instead of shift and mask. To reduce the number would require AND'ing and that is time-consuming.

RAVM

What CPUs was I targeting?

My focus was on current x86 CPUs running in 32 bit mode, however these days I rarely run any code in 32-bit mode so the next revision of RAVM should be 64-bit.

Early x86 CPUs were not the focus of my work because most consumers have newer Intel CPUs. Nor were AMD CPUs as I simply don't have access to them.

The ARM processors are naturally of interest as well.

Achieving security

To protect against buffer overruns RAVM does bounds-checking for four memory regions, each of which is separate in memory.

  • The program stack
  • The code area
  • The data area
  • The register set

What is RAVM's architecture?

Registers:

256 32-bit registers.
A stack pointer.
An instruction pointer.

There are no CPU flags.

Unlike the Java VM, RAVM does not implement an object-oriented machine language. No instructions take an object pointer. Instead RAVM implements a somewhat ordinary 32-bit processor and this helps to keep it simple, small, fast and secure.

Object oriented programming is all good and well however it is not needed in every case.

The KISS rule applies. But the goal is to put a wall of separation between machine code and high-level languages. The conflation of the two risks adding more security holes.

Instruction set architecture:

  • 32-bit instruction format.
  • 1- and 2-operands instructions.
  • Can support 3 operands in the future.
  • Presently only integer instructions.
  • Presently no vector instructions.

The next iteration of RAVM will provide 64-bit registers.

No flags? For now there are none, but some operations require a carry flag so I may add a carry flag later.

Tools

RASM

My assembler is called rasm. It takes .asm files and outputs RAVM machine code.

What about a compiler?

At this time there is no compiler. I may eventually adapt a C compiler to produce RAVM machine code.

Downloads

x86 RAVM

Downloads available here

ARM RAVM

In mid-July 2013, I began a port of RAVM to the ARM processor, in the form of an iOS app. This effort has somewhat stalled.

Intel64 (x86_64) RAVM

This has not yet been started.

Contributing

There is a need for someone to adapt a C compiler (or any other language) to generate RAVM bytecode. Perhaps LLVM could be adapted to this purpose.