Kernel initialization. Part 5.
Continue of architecture-specific initialization
In the previous part, we stopped at the initialization of an architecture-specific stuff from the setup_arch function and now we will continue with it. As we reserved memory for the initrd, next step is the olpc_ofw_detect
which detects One Laptop Per Child support. We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the early_trap_init
function. This function initializes debug (#DB
- raised when the TF
flag of rflags is set) and int3
(#BP
) interrupts gate. If you don't know anything about interrupts, you can read about it in the Early interrupt and exception handling. In x86
architecture INT
, INTO
and INT3
are special instructions which allow a task to explicitly call an interrupt handler. The INT3
instruction calls the breakpoint (#BP
) handler. You may remember, we already saw it in the part about interrupts: and exceptions:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description |Type |Error Code|Source |
----------------------------------------------------------------------------------------------
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
----------------------------------------------------------------------------------------------
Debug interrupt #DB
is the primary method of invoking debuggers. early_trap_init
defined in the arch/x86/kernel/traps.c. This functions sets #DB
and #BP
handlers and reloads IDT:
void __init early_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
load_idt(&idt_descr);
}
We already saw implementation of the set_intr_gate
in the previous part about interrupts. Here are two similar functions set_intr_gate_ist
and set_system_intr_gate_ist
. Both of these two functions take three parameters:
- number of the interrupt;
- base address of the interrupt/exception handler;
- third parameter is -
Interrupt Stack Table
.IST
is a new mechanism in thex86_64
and part of the TSS. Every active thread in kernel mode has own kernel stack which is16
kilobytes. While a thread in user space, this kernel stack is empty.
In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - Kernel stacks. x86_64
provides feature which allows to switch to a new special
stack for during any events as non-maskable interrupt and etc... And the name of this feature is - Interrupt Stack Table
. There can be up to 7 IST
entries per CPU and every entry points to the dedicated stack. In our case this is DEBUG_STACK
.
set_intr_gate_ist
and set_system_intr_gate_ist
work by the same principle as set_intr_gate
with only one difference. Both of these functions checks
interrupt number and call _set_gate
inside:
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
as set_intr_gate
does this. But set_intr_gate
calls _set_gate
with dpl - 0, and ist - 0, but set_intr_gate_ist
and set_system_intr_gate_ist
sets ist
as DEBUG_STACK
and set_system_intr_gate_ist
sets dpl
as 0x3
which is the lowest privilege. When an interrupt occurs and the hardware loads such a descriptor, then hardware automatically sets the new stack pointer based on the IST value, then invokes the interrupt handler. All of the special kernel stacks will be set in the cpu_init
function (we will see it later).
As #DB
and #BP
gates written to the idt_descr
, we reload IDT
table with load_idt
which just call ldtr
instruction. Now let's look on interrupt handlers and will try to understand how they works. Of course, I can't cover all interrupt handlers in this book and I do not see the point in this. It is very interesting to delve in the linux kernel source code, so we will see how debug
handler implemented in this part, and understand how other interrupt handlers are implemented will be your task.
DB handler
As you can read above, we passed address of the #DB
handler as &debug
in the set_intr_gate_ist
. lxr.free-electrons.com is a great resource for searching identifiers in the linux kernel source code, but unfortunately you will not find debug
handler with it. All of you can find, it is debug
definition in the arch/x86/include/asm/traps.h:
asmlinkage void debug(void);
We can see asmlinkage
attribute which tells to us that debug
is function written with assembly. Yeah, again and again assembly :). Implementation of the #DB
handler as other handlers is in this arch/x86/entry/entry_64.S and defined with the idtentry
assembly macro:
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry
is a macro which defines an interrupt/exception entry point. As you can see it takes five arguments:
- name of the interrupt entry point;
- name of the interrupt handler;
- has interrupt error code or not;
- paranoid - if this parameter = 1, switch to special stack (read above);
- shift_ist - stack to switch during interrupt.
Now let's look on idtentry
macro implementation. This macro defined in the same assembly file and defines debug
function with the ENTRY
macro. For the start idtentry
macro checks that given parameters are correct in case if need to switch to the special stack. In the next step it checks that give interrupt returns error code. If interrupt does not return error code (in our case #DB
does not return error code), it calls INTR_FRAME
or XCPT_FRAME
if interrupt has error code. Both of these macros XCPT_FRAME
and INTR_FRAME
do nothing and need only for the building initial frame state for interrupts. They uses CFI
directives and used for debugging. More info you can find in the CFI directives. As comment from the arch/x86/kernel/entry_64.S says: CFI macros are used to generate dwarf2 unwind information for better backtraces. They don't change any code.
so we will ignore them.
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
/* Sanity check */
.if \shift_ist != -1 && \paranoid == 0
.error "using shift_ist requires paranoid=1"
.endif
.if \has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif
...
...
...
You can remember from the previous part about early interrupts/exceptions handling that after interrupt occurs, current stack will have following format:
+-----------------------+
| |
+40 | SS |
+32 | RSP |
+24 | RFLAGS |
+16 | CS |
+8 | RIP |
0 | Error Code | <---- rsp
| |
+-----------------------+
The next two macro from the idtentry
implementation are:
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
First ASM_CLAC
macro depends on CONFIG_X86_SMAP
configuration option and need for security reason, more about it you can read here. The second PARAVIRT_ADJUST_EXCEPTION_FRAME
macro is for handling handle Xen-type-exceptions (this chapter about kernel initialization and we will not consider virtualization stuff here).
The next piece of code checks if interrupt has error code or not and pushes $-1
which is 0xffffffffffffffff
on x86_64
on the stack if not:
.ifeq \has_error_code
pushq_cfi $-1
.endif
We need to do it as dummy
error code for stack consistency for all interrupts. In the next step we subtract from the stack pointer $ORIG_RAX-R15
:
subq $ORIG_RAX-R15, %rsp
where ORIRG_RAX
, R15
and other macros defined in the arch/x86/include/asm/calling.h and ORIG_RAX-R15
is 120 bytes. General purpose registers will occupy these 120 bytes because we need to store all registers on the stack during interrupt handling. After we set stack for general purpose registers, the next step is checking that interrupt came from userspace with:
testl $3, CS(%rsp)
jnz 1f
Here we checks first and second bits in the CS
. You can remember that CS
register contains segment selector where first two bits are RPL
. All privilege levels are integers in the range 0–3, where the lowest number corresponds to the highest privilege. So if interrupt came from the kernel mode we call save_paranoid
or jump on label 1
if not. In the save_paranoid
we store all general purpose registers on the stack and switch user gs
on kernel gs
if need:
movl $1,%ebx
movl $MSR_GS_BASE,%ecx
rdmsr
testl %edx,%edx
js 1f
SWAPGS
xorl %ebx,%ebx
1: ret
In the next steps we put pt_regs
pointer to the rdi
, save error code in the rsi
if it has and call interrupt handler which is - do_debug
in our case from the arch/x86/kernel/traps.c. do_debug
like other handlers takes two parameters:
- pt_regs - is a structure which presents set of CPU registers which are saved in the process' memory region;
- error code - error code of interrupt.
After interrupt handler finished its work, calls paranoid_exit
which restores stack, switch on userspace if interrupt came from there and calls iret
. That's all. Of course it is not all :), but we will see more deeply in the separate chapter about interrupts.
This is general view of the idtentry
macro for #DB
interrupt. All interrupts are similar to this implementation and defined with idtentry too. After early_trap_init
finished its work, the next function is early_cpu_init
. This function defined in the arch/x86/kernel/cpu/common.c and collects information about CPU and its vendor.
Early ioremap initialization
The next step is initialization of early ioremap
. In general there are two ways to communicate with devices:
- I/O Ports;
- Device memory.
We already saw first method (outb/inb
instructions) in the part about linux kernel booting process. The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So ioremap
used to map device memory into kernel address space.
As i wrote above next function is the early_ioremap_init
which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like ioremap
are available. Implementation of this function is in the arch/x86/mm/ioremap.c. At the start of the early_ioremap_init
we can see definition of the pmd
pointer with pmd_t
type (which presents page middle directory entry typedef struct { pmdval_t pmd; } pmd_t;
where pmdval_t
is unsigned long
) and make a check that fixmap
aligned in a correct way:
pmd_t *pmd;
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
fixmap
- is fixed virtual address mappings which extends from FIXADDR_START
to FIXADDR_TOP
. Fixed virtual addresses are needed for subsystems that need to know the virtual address at compile time. After the check early_ioremap_init
makes a call of the early_ioremap_setup
function from the mm/early_ioremap.c. early_ioremap_setup
fills slot_virt
array of the unsigned long
with virtual addresses with 512 temporary boot-time fix-mappings:
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
After this we get page middle directory entry for the FIX_BTMAP_BEGIN
and put to the pmd
variable, fills bm_pte
with zeros which is boot time page tables and call pmd_populate_kernel
function for setting given page table entry in the given page middle directory:
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
pmd_populate_kernel(&init_mm, pmd, bm_pte);
That's all for this. If you feeling puzzled, don't worry. There is special part about ioremap
and fixmaps
in the Linux Kernel Memory Management. Part 2 chapter.
Obtaining major and minor numbers for the root device
After early ioremap
was initialized, you can see the following code:
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
This code obtains major and minor numbers for the root device where initrd
will be mounted later in the do_mount_root
function. Major number of the device identifies a driver associated with the device. Minor number referred on the device controlled by driver. Note that old_decode_dev
takes one parameter from the boot_params_structure
. As we can read from the x86 linux kernel boot protocol:
Field name: root_dev
Type: modify (optional)
Offset/size: 0x1fc/2
Protocol: ALL
The default root device device number. The use of this field is
deprecated, use the "root=" option on the command line instead.
Now let's try to understand what old_decode_dev
does. Actually it just calls MKDEV
inside which generates dev_t
from the give major and minor numbers. It's implementation is pretty simple:
static inline dev_t old_decode_dev(u16 val)
{
return MKDEV((val >> 8) & 255, val & 255);
}
where dev_t
is a kernel data type to present major/minor number pair. But what's the strange old_
prefix? For historical reasons, there are two ways of managing the major and minor numbers of a device. In the first way major and minor numbers occupied 2 bytes. You can see it in the previous code: 8 bit for major number and 8 bit for minor number. But there is a problem: only 256 major numbers and 256 minor numbers are possible. So 16-bit integer was replaced by 32-bit integer where 12 bits reserved for major number and 20 bits for minor. You can see this in the new_decode_dev
implementation:
static inline dev_t new_decode_dev(u32 dev)
{
unsigned major = (dev & 0xfff00) >> 8;
unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
return MKDEV(major, minor);
}
After calculation we will get 0xfff
or 12 bits for major
if it is 0xffffffff
and 0xfffff
or 20 bits for minor
. So in the end of execution of the old_decode_dev
we will get major and minor numbers for the root device in ROOT_DEV
.
Memory map setup
The next point is the setup of the memory map with the call of the setup_memory_map
function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the Video mode initialization and transition to protected mode)), Extended display identification data, video mode, bootloader_type and etc...:
screen_info = boot_params.screen_info;
edid_info = boot_params.edid_info;
saved_video_mode = boot_params.hdr.vid_mode;
bootloader_type = boot_params.hdr.type_of_loader;
if ((bootloader_type >> 4) == 0xe) {
bootloader_type &= 0xf;
bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
}
bootloader_version = bootloader_type & 0xf;
bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
All of these parameters we got during boot time and stored in the boot_params
structure. After this we need to setup the end of the I/O memory. As you know one of the main purposes of the kernel is resource management. And one of the resource is memory. As we already know there are two ways to communicate with devices are I/O ports and device memory. All information about registered resources are available through:
- /proc/ioports - provides a list of currently registered port regions used for input or output communication with a device;
- /proc/iomem - provides current map of the system's memory for each physical device.
At the moment we are interested in /proc/iomem
:
cat /proc/iomem
00000000-00000fff : reserved
00001000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cffff : Video ROM
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000fffff : reserved
000e0000-000e3fff : PCI Bus 0000:00
000e4000-000e7fff : PCI Bus 0000:00
000f0000-000fffff : System ROM
As you can see range of addresses are shown in hexadecimal notation with its owner. Linux kernel provides API for managing any resources in a general way. Global resources (for example PICs or I/O ports) can be divided into subsets - relating to any hardware bus slot. The main structure resource
:
struct resource {
resource_size_t start;
resource_size_t end;
const char *name;
unsigned long flags;
struct resource *parent, *sibling, *child;
};
presents abstraction for a tree-like subset of system resources. This structure provides range of addresses from start
to end
(resource_size_t
is phys_addr_t
or u64
for x86_64
) which a resource covers, name
of a resource (you see these names in the /proc/iomem
output) and flags
of a resource (All resources flags defined in the include/linux/ioport.h). The last are three pointers to the resource
structure. These pointers enable a tree-like structure:
+-------------+ +-------------+
| | | |
| parent |------| sibling |
| | | |
+-------------+ +-------------+
|
|
+-------------+
| |
| child |
| |
+-------------+
Every subset of resources has root range resources. For iomem
it is iomem_resource
which defined as:
struct resource iomem_resource = {
.name = "PCI mem",
.start = 0,
.end = -1,
.flags = IORESOURCE_MEM,
};
EXPORT_SYMBOL(iomem_resource);
TODO EXPORT_SYMBOL
iomem_resource
defines root addresses range for io memory with PCI mem
name and IORESOURCE_MEM
(0x00000200
) as flags. As i wrote above our current point is setup the end address of the iomem
. We will do it with:
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
Here we shift 1
on boot_cpu_data.x86_phys_bits
. boot_cpu_data
is cpuinfo_x86
structure which we filled during execution of the early_cpu_init
. As you can understand from the name of the x86_phys_bits
field, it presents maximum bits amount of the maximum physical address in the system. Note also that iomem_resource
is passed to the EXPORT_SYMBOL
macro. This macro exports the given symbol (iomem_resource
in our case) for dynamic linking or in other words it makes a symbol accessible to dynamically loaded modules.
After we set the end address of the root iomem
resource address range, as I wrote above the next step will be setup of the memory map. It will be produced with the call of the setup_ memory_map
function:
void __init setup_memory_map(void)
{
char *who;
who = x86_init.resources.memory_setup();
memcpy(&e820_saved, &e820, sizeof(struct e820map));
printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");
e820_print_map(who);
}
First of all we call look here the call of the x86_init.resources.memory_setup
. x86_init
is a x86_init_ops
structure which presents platform specific setup functions as resources initialization, pci initialization and etc... initialization of the x86_init
is in the arch/x86/kernel/x86_init.c. I will not give here the full description because it is very long, but only one part which interests us for now:
struct x86_init_ops x86_init __initdata = {
.resources = {
.probe_roms = probe_roms,
.reserve_resources = reserve_standard_io_resources,
.memory_setup = default_machine_specific_memory_setup,
},
...
...
...
}
As we can see here memory_setup
field is default_machine_specific_memory_setup
where we get the number of the e820 entries which we collected in the boot time, sanitize the BIOS e820 map and fill e820map
structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute dmesg
command and you can see something like this:
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable
...
...
...
Copying of the BIOS Enhanced Disk Device information
The next two steps is parsing of the setup_data
with parse_setup_data
function and copying BIOS EDD to the safe place. setup_data
is a field from the kernel boot header and as we can read from the x86
boot protocol:
Field name: setup_data
Type: write (special)
Offset/size: 0x250/8
Protocol: 2.09+
The 64-bit physical pointer to NULL terminated single linked list of
struct setup_data. This is used to define a more extensible boot
parameters passing mechanism.
It used for storing setup information for different types as device tree blob, EFI setup data and etc... In the second step we copy BIOS EDD information from the boot_params
structure that we collected in the arch/x86/boot/edd.c to the edd
structure:
static inline void __init copy_edd(void)
{
memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
sizeof(edd.mbr_signature));
memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
edd.edd_info_nr = boot_params.eddbuf_entries;
}
Memory descriptor initialization
The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called memory descriptor
. Directly in the linux kernel source code memory descriptor presented with mm_struct
structure. mm_struct
contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the include/linux/mm_types.h. As every process has its own memory descriptor, task_struct
structure contains it in the mm
and active_mm
field. And our first init
process has it too. You can remember that we saw the part of initialization of the init task_struct
with INIT_TASK
macro in the previous part:
#define INIT_TASK(tsk) \
{
...
...
...
.mm = NULL, \
.active_mm = &init_mm, \
...
}
mm
points to the process address space and active_mm
points to the active address space if process has no address space such as kernel threads (more about it you can read in the documentation). Now we fill memory descriptor of the initial process:
init_mm.start_code = (unsigned long) _text;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
with the kernel's text, data and brk. init_mm
is the memory descriptor of the initial process and defined as:
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
INIT_MM_CONTEXT(init_mm)
};
where mm_rb
is a red-black tree of the virtual memory areas, pgd
is a pointer to the page global directory, mm_users
is address space users, mm_count
is primary usage counter and mmap_sem
is memory area semaphore. After we setup memory descriptor of the initial process, next step is initialization of the Intel Memory Protection Extensions with mpx_mm_init
. The next step is initialization of the code/data/bss resources with:
code_resource.start = __pa_symbol(_text);
code_resource.end = __pa_symbol(_etext)-1;
data_resource.start = __pa_symbol(_etext);
data_resource.end = __pa_symbol(_edata)-1;
bss_resource.start = __pa_symbol(__bss_start);
bss_resource.end = __pa_symbol(__bss_stop)-1;
We already know a little about resource
structure (read above). Here we fills code/data/bss resources with their physical addresses. You can see it in the /proc/iomem
:
00100000-be825fff : System RAM
01000000-015bb392 : Kernel code
015bb393-01930c3f : Kernel data
01a11000-01ac3fff : Kernel bss
All of these structures are defined in the arch/x86/kernel/setup.c and look like typical resource initialization:
static struct resource code_resource = {
.name = "Kernel code",
.start = 0,
.end = 0,
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
};
The last step which we will cover in this part will be NX
configuration. NX-bit
or no execute bit is 63-bit in the page directory entry which controls the ability to execute code from all physical pages mapped by the table entry. This bit can only be used/set when the no-execute
page-protection mechanism is enabled by the setting EFER.NXE
to 1. In the x86_configure_nx
function we check that CPU has support of NX-bit
and it does not disabled. After the check we fill __supported_pte_mask
depend on it:
void x86_configure_nx(void)
{
if (cpu_has_nx && !disable_nx)
__supported_pte_mask |= _PAGE_NX;
else
__supported_pte_mask &= ~_PAGE_NX;
}
Conclusion
It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the setup_arch
function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the setup_arch
is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like Fix-mapped
addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - Linux kernel memory management Part 2.. In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, Desktop Management Interface
scanning and many many more.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.