QEMU Accelerator Technical Documentation
The QEMU Accelerator (KQEMU) is a driver allowing a user application to run x86 code in a Virtual Machine (VM). The code can be either user or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very similar in essence to the VM86 Linux syscall call, but it adds some new concepts to improve memory handling.
KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD, Solaris). It can execute code from many guest OSes (e.g. Linux, Windows 2000/XP) even if the host CPU does not support hardware virtualization.
In that document, we assume that the reader has good knowledge of the x86 processor and of the problems associated with the virtualization of x86 code.
We describe the version 1.3.0 of the Linux implementation. The implementations on other OSes use the same calls, so they can be understood by reading the Linux API specification.
KQEMU manipulates three kinds of addresses:
KQEMU has a physical page table which is used to associate a RAM address or a device I/O address range to a given physical page. It also tells if a given RAM address is visible as read-only memory. The same RAM address can be mapped at several different physical addresses. Only 4 GB of physical address space is supported in the current KQEMU implementation. Hence the bits of order >= 32 of the physical addresses are ignored.
The physical page table has the following structure:
phys_to_ram_map
is a pointer to an array of 1024 pointers. If
phys_to_ram_map[a]
is NULL, then the physical memory range
(a << 22)
to ((a + 1) << 22)
is unassigned. Otherwise,
it points to an array of 1024 32 bit RAM
addresses. phys_to_ram_map[a][b]
describe the mapping of the 4K
physical page (a << 22) | (b << 12)
. The bits from 4 to 12 give
the device type. The following devices are defined:
IO_MEM_RAM (0)
IO_MEM_ROM (1)
IO_MEM_UNASSIGNED (2)
All other device types are handled by KQEMU as unassigned memory.
In the current implementation, KQEMU does not support dynamic modification of the physical page by the client.
It is very important for the VM to be able to tell if a given RAM page has been modified. It can be used to optimize VGA refreshes, to flush a dynamic translator cache (when used with QEMU), to handle live migration or to optimize MMU emulation.
In KQEMU, each RAM page has an associated dirty byte in the
array init_params.ram_dirty
. The dirty byte is set to
0xff
if the corresponding RAM page is modified. That way, at
most 8 clients can manage a dirty bit in each page.
KQEMU reserves one dirty bit 0x04
for its internal use.
The client must notify KQEMU if some entries of the array
init_params.ram_dirty
were modified from 0xff
to a
different value. The address of the corresponding RAM pages are stored
by the client in the array init_parms.ram_pages_to_update
.
The client must also notify KQEMU if a RAM page has been modified
independently of the init_params.ram_dirty
state. It is done
with the init_params.modified_ram_pages
array.
Symmetrically, KQEMU notifies the client if a RAM page has been
modified with the init_params.modified_ram_pages
array. The
client can use this information for example to invalidate a dynamic
translation cache.
A user client wishing to create a new virtual machine must open the device `/dev/kqemu'. There is no hard limit on the number of virtual machines that can be created and run at the same time, except for the available memory.
KQEMU_GET_VERSION
ioctlIt returns the KQEMU API version as an int. The client must use it to determine if it is compatible with the KQEMU driver.
KQEMU_INIT
ioctl
Input parameter: struct kqemu_init init_params
It must be called once to initialize the VM. The following structure is used as input parameter:
struct kqemu_init { uint8_t *ram_base; unsigned long ram_size; uint8_t *ram_dirty; uint32_t **phys_to_ram_map; unsigned long *pages_to_flush; unsigned long *ram_pages_to_update; unsigned long *modified_ram_pages; };
The pointers ram_base
, ram_dirty
,
phys_to_ram_map
, pages_to_flush
,
ram_pages_to_update
and modified_ram_pages
must be page
aligned and must point to user allocated memory.
On Linux, due to a kernel bug related to memory swapping, the corresponding memory must be mmaped from a file. We plan to remove this restriction in a future implementation.
ram_size
must be a multiple of 4K and is the quantity of RAM
allocated to the VM.
ram_base
is a pointer to the VM RAM. It must contain at least
ram_size
bytes.
ram_dirty
is a pointer to a byte array of length
ramsize/4096
. Each byte indicates if the corresponding VM RAM
page has been modified (see section 2.2 RAM page dirtiness)
phys_to_ram_map
is a pointer to an array of 1024 pointers. It
defines a mapping from the VM physical addresses to the RAM addresses
(see section 2.1 RAM, Physical and Virtual addresses)
pages_to_flush
is a pointer to an array of
KQEMU_MAX_PAGES_TO_FLUSH
longs. It is used to indicate which
TLB must be flushed before executing code in the VM.
ram_pages_to_update
is a pointer to an array of
KQEMU_MAX_RAM_PAGES_TO_UPDATE
longs. It is used to notify the VM that
some RAM pages have been dirtied.
modified_ram_pages
is a pointer to an array of
KQEMU_MAX_MODIFIED_RAM_PAGES
longs. It is used to notify the VM or the
client that RAM pages have been modified.
The value 0 is return if the ioctl succeeded.
KQEMU_MODIFY_RAM_PAGE
ioctl
Input parameter: int nb_pages
Notify the VM that nb_pages
RAM pages were modified. The
corresponding RAM page addresses are written by the client in the
init_state.modified_ram_pages
array given with the KQEMU_INIT ioctl.
Note: This ioctl does currently nothing, but the clients must use it for later compatibility.
KQEMU_EXEC
ioctl
Input/Output parameter: struct kqemu_cpu_state cpu_state
Structure definitions:
struct kqemu_segment_cache { uint32_t selector; unsigned long base; uint32_t limit; uint32_t flags; }; struct kqemu_cpu_state { #ifdef __x86_64__ unsigned long regs[16]; #else unsigned long regs[8]; #endif unsigned long eip; unsigned long eflags; uint32_t dummy0, dummy1, dumm2, dummy3, dummy4; struct kqemu_segment_cache segs[6]; /* selector values */ struct kqemu_segment_cache ldt; struct kqemu_segment_cache tr; struct kqemu_segment_cache gdt; /* only base and limit are used */ struct kqemu_segment_cache idt; /* only base and limit are used */ unsigned long cr0; unsigned long dummy5; unsigned long cr2; unsigned long cr3; unsigned long cr4; uint32_t a20_mask; /* sysenter registers */ uint32_t sysenter_cs; uint32_t sysenter_esp; uint32_t sysenter_eip; uint64_t efer; uint64_t star; #ifdef __x86_64__ unsigned long lstar; unsigned long cstar; unsigned long fmask; unsigned long kernelgsbase; #endif uint64_t tsc_offset; unsigned long dr0; unsigned long dr1; unsigned long dr2; unsigned long dr3; unsigned long dr6; unsigned long dr7; uint8_t cpl; uint8_t user_only; uint32_t error_code; unsigned long next_eip; unsigned int nb_pages_to_flush; long retval; unsigned int nb_ram_pages_to_update; unsigned int nb_modified_ram_pages; };
Execute x86 instructions in the VM context. The full x86 CPU state is defined in this structure. It contains in particular the value of the 8 (or 16 for x86_64) general purpose registers, the contents of the segment caches, the RIP and EFLAGS values, etc...
If cpu_state.user_only
is 1, a user only emulation is
done. cpu_state.cpl
must be 3 in that case.
KQEMU_EXEC
does the following:
cpu_state.nb_ram_pages_to_update
RAM pages from the array
init_params.ram_pages_to_update
. If
cpu_state.nb_ram_pages_to_update
has the value
KQEMU_RAM_PAGES_UPDATE_ALL
, it means that all the RAM pages may
have been dirtied. The array init_params.ram_pages_to_update
is
ignored in that case.
cpu_state.nb_modified_ram_pages
RAM pages from the array
init_params.modified_ram_pages
where modified by the client.
init_params.pages_to_flush
of length
cpu_state.nb_pages_to_flush
. If
cpu_state.nb_pages_to_flush
is KQEMU_FLUSH_ALL
, all the
TLBs are flushed. The array init_params.pages_to_flush
is
ignored in that case.
cpu_state
.
cpu_state
.
cpu_state.retval
.
cpu_state.nb_pages_to_flush
and
init_params.pages_to_flush
to notify the client that some
virtual CPU TLBs were flushed. The client can use this notification to
synchronize its own virtual TLBs with KQEMU.
cpu_state.nb_ram_pages_to_update
to 1 if some
RAM dirty bytes were transitionned from dirty (0xff) to a non dirty
value. Otherwise, cpu_state.nb_ram_pages_to_update
is set to 0.
cpu_state.nb_modified_ram_pages
and
init_params.modified_ram_pages
to notify the client that some
RAM pages were modified.
cpu_state.retval
indicate the reason why the execution was
stopped:
KQEMU_RET_EXCEPTION | n
cpu_state.error_code
contains the exception error code if it is
needed. It should be noted that in user only emulation, KQEMU
handles no exceptions by itself.
KQEMU_RET_INT | n
cpu_state.next_eip
contains value of RIP after the instruction raising the
interrupt. cpu_state.eip
contains the value of RIP at the
intruction raising the interrupt.
KQEMU_RET_SOFTMMU
KQEMU_RET_INTR
KQEMU_RET_SYSCALL
cpu_state.next_eip
contains value of RIP after the
instruction. cpu_state.eip
contains the RIP of the intruction.
KQEMU_RET_ABORT
The main priority when implementing KQEMU was simplicity and security. Unlike other virtualization systems, it does not do any dynamic translation nor code patching.
Note 1: KQEMU does not currently use the hardware virtualization features of newer x86 CPUs. We expect that the limitations would be different in that case.
Note 2: KQEMU supports both x86 and x86_64 CPUs.
Before entering the VM, the following conditions must be satisfied :
If EFLAGS.IF is set, the following assumptions are made on the executing code:
If eflags.IF if reset the code is interpreted, so the VM code can be accurately executed. Some intructions trap to the user space emulator because the interpreter does not handle them. A limitation of the interpreter is that currently segment limits are not always tested.
The VM code is always run with CPL = 3 on the host, so the VM code has no more priviliedge than regular user code.
The MMU is used to protect the memory used by the KQEMU monitor. That way, no segment limit patching is necessary. Moreover, the guest OS is free to use any virtual address, in particular the ones near the start or the end of the virtual address space. The price to pay is that CR3 must be modified at every emulated system call because different page tables are needed for user and kernel modes.
This document was generated on 6 February 2007 using texi2html 1.56k.