“Recently, I have been reading the book “QEMU/KVM Source Code Analysis and Application” by Li Qiang to learn about Linux kernel virtualization related knowledge. I am summarizing and extracting important knowledge points from the book through notes. This series mainly focuses on the introduction to CPU virtualization in QEMU.“
KVM is a kernel-based virtual machine monitor with a simple and clear architecture that fully reuses many features of the Linux kernel. This article will introduce the initialization process of the KVM module.
KVM Source Code Organization
The code organization of KVM in the Linux kernel tree mainly includes two parts: general code and architecture-specific code.

KVM is essentially a virtualization abstraction solution, and current mainstream processor architectures, including x86, ARM, and RISCV, have their own virtualization architecture implementations. KVM, as an abstraction layer, masks the differences in underlying virtualization architecture implementations and provides a unified interface for user-space programs (mainly QEMU).
The main code of KVM is located in the kernel tree under the virt/kvm
directory, representing the common code for all CPU architectures, which corresponds to the source code of kvm.ko
.
Architecture-Specific Code
The CPU architecture code is located under the arch/
directory. For example, the architecture-related code for x86 is under arch/x86/kvm
. Furthermore, the same architecture may have multiple different implementations. For example, under the x86 architecture, there are CPU implementations from both Intel and AMD, so there are multiple implementation codes under the x86 directory:
-
The
arch/x86/vmx/
directory: mainly contains vmx.c
code, corresponding to Intel’s VM-X scheme, which is ultimately compiled into kvm-intel.ko
;
-
The
arch/x86/svm/
directory: mainly contains svm.c
code, corresponding to AMD’s AMD-V scheme, which is ultimately compiled into kvm-amd.ko
.
In addition, the CPU architecture code also includes virtualization codes such as interrupt controllers (ioapic.c
and lapic.c
), performance monitoring units (pmu.c
), and CPUID (cpuid.c
).
As a side note, developers familiar with the Linux kernel should immediately recognize that this kind of source code organization is also common in other subsystems of the Linux kernel.
All virtualization implementations of KVM (Intel and AMD) will register a kvm_x86_ops
structure with the KVM module, so some functions in KVM act merely as a shell. They may first call the kvm_arch_xxx
function, which indicates calling architecture-related functions, and if the kvm_arch_xxx
function needs to call implementation-related code, it will invoke the relevant callback functions in the kvm_x86_ops
structure.
The general part and architecture-related part of KVM code are compiled separately into Linux kernel modules, so they need to be loaded simultaneously during use. On Intel platforms, this means both kvm.ko
and kvm-intel.ko
kernel modules.
-
The
kvm.ko
initialization code does nothing, merely loading the code into memory;
-
The
kvm-intel.ko
is responsible for enabling and disabling KVM.
After KVM initialization is complete, it presents the KVM interface to user space (the virtualization unified interface mentioned earlier). These interfaces are exported by kvm.ko
, and when user programs call these interfaces, the general code in kvm.ko
will call the relevant code in the kvm-intel.ko
architecture. The calling relationship is shown below:

KVM Module Initialization
The initialization of the KVM module mainly includes initializing CPU and architecture-independent data and setting up architecture-related virtualization support.
On Intel platforms, the VMM can only enter VMX mode when the CPU is in protected mode and paging is enabled. The steps to enable VMX mode can be summarized as follows:
-
CPUID checks whether the CPU supports VMX;
-
CPUID.1:ECX.VMX[bit 5]=1
indicates that the CPU supports VMX.
Check the VMX capabilities supported by the CPU by reading the MSR registers related to VMX capabilities;
-
IA32_VMX_BASIC
register: basic VMX capability information;
-
IA32_VMX_PINBASED_CTLS
and IA32_VMX_PROCBASED_CTLS
registers: indicate the values that can be set in the VMCS area related to VM-execution.
Allocate a 4KB aligned memory as the VMXON area;
-
IA32_VMX_BASICMSR
register: indicates the size of the VMXON area.
Initialize the version identifier of the VMXON area;
Ensure that the CR0 register of the current CPU operating mode meets the conditions for entering VMX;
-
That is,
CR0.PE=1
, CR0.PG=1
;
-
Other necessary settings are reported through the
IA32_VMX_CR0_FIXED0
and IA32_VMX_CR0_FIXED1
registers.
-
-
Other settings that
CR4
needs to meet are reported through IA32_VMX_CR4_FIXED0
and IA32_VMX_CR4_FIXED1
.
Ensure that the IA32_FEATURE_CONTROL
register is correctly set, with its lock bit (bit 0) set to 1, which is usually programmed by the BIOS.
Use the physical address of the VMXON area as an operand to call the VMXON instruction. After execution, if RFLAGS.CF=0
, it indicates that the VMXON instruction was executed successfully.
After entering VMX mode, when executing VMXOFF with CPL=0
in VMX root, if RFLAGS.CF
and RFLAGS.ZF
are both 0, it indicates that the CPU has exited VMX mode.
KVM Initialization Process
The KVM initialization process is completed in the module registration function vmx_init
of kvm-intel.ko
. Below, we will analyze the code using the latest Linux kernel v6.5 as an example.
static int __init vmx_init(void){ int r, cpu; if (!kvm_is_vmx_supported()) return -EOPNOTSUPP; ... r = kvm_x86_vendor_init(&vmx_init_ops); ... /* * Common KVM initialization _must_ come last, after this, /dev/kvm is * exposed to userspace! */ r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx), THIS_MODULE); ...}
vmx_init
function is roughly divided into three parts:
-
kvm_is_vmx_supported
: checks whether VMX mode is supported and enabled; otherwise, the subsequent initialization process is meaningless;
-
kvm_x86_vendor_init
: completes the architecture-specific initialization process, with the parameter &vmx_init_ops
containing various initialization callback functions specific to Intel VT-x;
-
kvm_init
: completes the KVM general initialization process.
kvm_x86_vendor_init
function, after acquiring the relevant locks, ultimately calls __kvm_x86_vendor_init
to complete the actual initialization process. The main flow of this function, after removing some comments, unimportant processes, and error handling paths, is as follows:
struct kvm_x86_init_ops { int (*hardware_setup)(void); unsigned int (*handle_intel_pt_intr)(void); struct kvm_x86_ops *runtime_ops; struct kvm_pmu_ops *pmu_ops;};void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops){ memcpy(&kvm_pmu_ops, pmu_ops, sizeof(kvm_pmu_ops)); ...}static inline void kvm_ops_update(struct kvm_x86_init_ops *ops){ memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops)); ... kvm_pmu_ops_update(ops->pmu_ops);}static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops){ int r, cpu; ... r = kvm_mmu_vendor_module_init(); ... kvm_init_pmu_capability(ops->pmu_ops); r = ops->hardware_setup(); ... kvm_ops_update(ops); for_each_online_cpu(cpu) { smp_call_function_single(cpu, kvm_x86_check_cpu_compat, &r, 1); if (r < 0) goto out_unwind_ops; } /* * Point of no return! DO NOT add error paths below this point unless * absolutely necessary, as most operations from this point forward * require unwinding. */ kvm_timer_init(); ... kvm_init_msr_lists(); ...}
The key processes include:
-
kvm_mmu_vendor_module_init
: completes the initialization of the MMU architecture-related parts in memory virtualization, but most of the initialization process will be postponed until the vendor module (kvm-intel.ko
or kvm-amd.ko
) is loaded, as many masks/values will be modified by VMX or SVM;
-
kvm_init_pmu_capability
: initializes the PMU capabilities. If the PMU capability is enabled (module parameter enable_pmu
), the initialization of struct x86_pmu_capability kvm_pmu_cap
will be completed at this step, which mainly records the version information, the number of counters (num_counters_gp
and num_counters_fixed
), the counter bit width (bit_width_gp
and bit_width_fixed
), and the PMU event masks (events_mask
and events_mask_len
);
-
ops->hardware_setup()
: used to create some data structures closely related to starting KVM and initialize some hardware features. It covers a lot of content, including MMU and Extended Page Table (EPT) settings and initialization, nested virtualization configurations, and the setup of the CPU supported features list, etc. For specifics, you can directly check the arch/x86/kvm/vmx/vmx.c#hardware_setup(void)
method;
-
kvm_ops_update
: copies the runtime methods kvm_x86_runtime_ops
and PMU-related methods kvm_pmu_ops
from the initialization method list kvm_x86_init_ops
to the virtualization unified interface list kvm_x86_ops
. After KVM completes initialization, it will handle user program interface call requests through kvm_x86_ops
;
-
kvm_x86_check_cpu_compat
: for each CPU, this method will be called to check whether the features of all CPUs are consistent;
-
kvm_timer_init
: clock initialization;
-
kvm_init_msr_lists
: initializes the list of MSRs supported by KVM.
kvm_init
will complete the general initialization of KVM. After this process is complete, the KVM module will expose /dev/kvm
to user space as an interface for communication between user-space programs (QEMU) and the KVM module.
int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module){ int r; int cpu; ... kvm_vcpu_cache = kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align, SLAB_ACCOUNT, offsetof(struct kvm_vcpu, arch), offsetofend(struct kvm_vcpu, stats_id) - offsetof(struct kvm_vcpu, arch), NULL); ... r = kvm_irqfd_init(); if (r) goto err_irqfd; r = kvm_async_pf_init(); if (r) goto err_async_pf; kvm_chardev_ops.owner = module; kvm_preempt_ops.sched_in = kvm_sched_in; kvm_preempt_ops.sched_out = kvm_sched_out; ... r = kvm_vfio_ops_init(); ... /* * Registration _must_ be the very last thing done, as this exposes * /dev/kvm to userspace, i.e. all infrastructure must be setup! */ r = misc_register(&kvm_dev); ...}
The key processes include:
-
kvm_vcpu_cache
: creates a cache for the VCPU structure and assigns it to kvm_vcpu_cache
, allowing for faster allocation of VCPU space;
-
kvm_irqfd_init
: initializes the data related to irqfd, mainly creating a thread kvm-irqfd-cleanup
;
-
kvm_async_pf_init
: initializes the data related to async_pf, mainly creating an async_pf_cache
cache structure;
-
kvm_sched_in
and kvm_sched_out
: sets the sched_in
and sched_out
of kvm_preempt_ops
, which will be called when the thread where the virtual machine VCPU resides is preempted or scheduled;
-
kvm_vfio_ops_init
: registers the kvm_vfio_ops
interface;
-
misc_register(&kvm_dev)
: calls misc_register
to create the kvm_dev
misc device, which is the /dev/kvm
device file.
Important Processes in KVM Initialization
kvm_init
‘s first important function call is ops->hardware_setup()
, which is the hardware_setup
member of the implementation-related vmx_init_ops
. The code for this function is as follows, and we will only look at the parts related to VMCS:
static __init int hardware_setup(void){ int r; ... if (setup_vmcs_config(&vmcs_config, &vmx_capability) < 0) return -EIO; ... r = alloc_kvm_area(); ...}static __init int alloc_kvm_area(void){ int cpu; for_each_possible_cpu(cpu) { struct vmcs *vmcs; vmcs = alloc_vmcs_cpu(false, cpu, GFP_KERNEL); if (!vmcs) { free_kvm_area(); return -ENOMEM; } ... per_cpu(vmxarea, cpu) = vmcs; } return 0;}
First, setup_vmcs_config
is called to set a global variable vmcs_config
. This function fills in vmcs_config
according to the characteristics support status of the CS (corresponding to enable condition 2). Later, this configuration will be used to initialize VMCS when creating virtual CPUs.
Then, alloc_kvm_area
is called to allocate a vmcs
structure for each physical CPU and place it in the vmxarea
per-cpu variable (corresponding to enable conditions 3 and 4).
2 kvm_x86_check_cpu_compat
kvm_init
calls the second important function kvm_x86_check_cpu_compat
, which completes the check by calling kvm_x86_check_processor_compatibility
, ultimately calling vmx_check_processor_compat
.
static struct kvm_x86_ops vmx_x86_ops __initdata = { ... .check_processor_compatibility = vmx_check_processor_compat, ...}static int kvm_x86_check_processor_compatibility(void){ int cpu = smp_processor_id(); struct cpuinfo_x86 *c = &cpu_data(cpu); /* * Compatibility checks are done when loading KVM and when enabling * hardware, e.g. during CPU hotplug, to ensure all online CPUs are * compatible, i.e. KVM should never perform a compatibility check on * an offline CPU. */ WARN_ON(!cpu_online(cpu)); if (__cr4_reserved_bits(cpu_has, c) != __cr4_reserved_bits(cpu_has, &boot_cpu_data)) return -EIO; return static_call(kvm_x86_check_processor_compatibility)();}static void kvm_x86_check_cpu_compat(void *ret){ *(int *)ret = kvm_x86_check_processor_compatibility();}
kvm_init
will call the kvm_x86_check_cpu_compat
function for each online CPU, and the corresponding vmx_check_processor_compat
function code is as follows:
static int vmx_check_processor_compat(void){ int cpu = raw_smp_processor_id(); struct vmcs_config vmcs_conf; struct vmx_capability vmx_cap; ... if (setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0) { pr_err("Failed to setup VMCS config on CPU %d\n", cpu); return -EIO; } ... if (memcmp(&vmcs_config, &vmcs_conf, sizeof(struct vmcs_config))) { pr_err("Inconsistent VMCS config on CPU %d\n", cpu); return -EIO; } return 0;}
In the hardware_setup
function, calling setup_vmcs_config
constructs a vmcs_config
based on the characteristics of the currently running physical CPU. Here, a vmcs_conf
is constructed for all physical CPUs, which is then compared with the global vmcs_config
to ensure that all physical CPUs have the same vmcs_conf
. This is to guarantee that errors do not occur when scheduling VCPUs on physical CPUs.
3 misc_register(&kvm_dev)
kvm_init
‘s last important task is to create a misc device /dev/kvm
. The definition and corresponding operations of this device are as follows:
static struct file_operations kvm_chardev_ops = { .unlocked_ioctl = kvm_dev_ioctl, .llseek = noop_llseek, KVM_COMPAT(kvm_dev_ioctl),};static struct miscdevice kvm_dev = { KVM_MINOR, "kvm", &kvm_chardev_ops,};
As can be seen, this device only supports the ioctl system call, and of course, open and close system calls will be handled by the misc device framework.
kvm_dev_ioctl
code is as follows:
static long kvm_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg){ int r = -EINVAL; switch (ioctl) { case KVM_GET_API_VERSION: if (arg) goto out; r = KVM_API_VERSION; break; case KVM_CREATE_VM: r = kvm_dev_ioctl_create_vm(arg); break; case KVM_CHECK_EXTENSION: r = kvm_vm_ioctl_check_extension_generic(NULL, arg); break; case KVM_GET_VCPU_MMAP_SIZE: if (arg) goto out; r = PAGE_SIZE; /* struct kvm_run */#ifdef CONFIG_X86 r += PAGE_SIZE; /* pio data page */#endif#ifdef CONFIG_KVM_MMIO r += PAGE_SIZE; /* coalesced mmio ring page */#endif break; case KVM_TRACE_ENABLE: case KVM_TRACE_PAUSE: case KVM_TRACE_DISABLE: r = -EOPNOTSUPP; break; default: return kvm_arch_dev_ioctl(filp, ioctl, arg); }out: return r;}
From an architectural perspective, the /dev/kvm
device’s ioctl interface is divided into two categories:
-
One category is general interfaces, such as
KVM_API_VERSION
and KVM_CREATE_VM
;
-
The other category consists of architecture-specific interfaces, with ioctl handled by the
kvm_arch_dev_ioctl
function.
From a content perspective, KVM’s ioctl handles requests at the KVM level, such as:
-
KVM_GET_API_VERSION
returns the version number of KVM;
-
KVM_CREATE_VM
creates a virtual machine;
-
KVM_CHECK_EXTENSION
checks whether KVM supports some general extensions;
-
KVM_GET_VCPU_MMAP_SIZE
returns the size of shared memory between QEMU and KVM in VCPU.
This is the main work of kvm_init
. It can be seen that the initialization process of the KVM module mainly involves hardware checks, allocating caches for commonly used structures, creating a /dev/kvm
device, obtaining a configuration structure for vmcs
, and setting some global variables based on CPU characteristics, allocating a vmcs
structure for each physical CPU.
It is worth noting that at this point, the CPU is not yet in VMX mode because during the vmx_init
initialization process, 1
was not written to CR4.VMXE
, and the VMXON
area was not allocated. This is actually a kind of lazy strategy; after all, if the KVM module is loaded but no virtual machine is created, there is no need for the CPU to enter VMX mode. Therefore, the actual entry into VMX mode occurs when the first virtual machine is created.
-
QEMU/KVM Source Code Analysis and Application – Li Qiang