Kernel initialization. Part 10.
End of the linux kernel initialization process
This is tenth part of the chapter about linux kernel initialization process and in the previous part we saw the initialization of the RCU and stopped on the call of the acpi_early_init
function. This part will be the last part of the Kernel initialization process chapter, so let's finish it.
After the call of the acpi_early_init
function from the init/main.c, we can see the following code:
#ifdef CONFIG_X86_ESPFIX64
init_espfix_bsp();
#endif
Here we can see the call of the init_espfix_bsp
function which depends on the CONFIG_X86_ESPFIX64
kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the arch/x86/kernel/espfix_64.c and prevents leaking of 31:16
bits of the esp
register during returning to 16-bit stack. First of all we install espfix
page upper directory into the kernel page directory in the init_espfix_bs
:
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
Where ESPFIX_BASE_ADDR
is:
#define PGDIR_SHIFT 39
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
Also we can find it in the Documentation/x86/x86_64/mm:
... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...
After we've filled page global directory with the espfix
pud, the next step is call of the init_espfix_random
and init_espfix_ap
functions. The first function returns random locations for the espfix
page and the second enables the espfix
for the current CPU. After the init_espfix_bsp
finished the work, we can see the call of the thread_info_cache_init
function which defined in the kernel/fork.c and allocates cache for the thread_info
if THREAD_SIZE
is less than PAGE_SIZE
:
# if THREAD_SIZE >= PAGE_SIZE
...
...
...
void thread_info_cache_init(void)
{
thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
THREAD_SIZE, 0, NULL);
BUG_ON(thread_info_cache == NULL);
}
...
...
...
#endif
As we already know the PAGE_SIZE
is (_AC(1,UL) << PAGE_SHIFT)
or 4096
bytes and THREAD_SIZE
is (PAGE_SIZE << THREAD_SIZE_ORDER)
or 16384
bytes for the x86_64
. The next function after the thread_info_cache_init
is the cred_init
from the kernel/cred.c. This function just allocates cache for the credentials (like uid
, gid
, etc.):
void __init cred_init(void)
{
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
}
more about credentials you can read in the Documentation/security/credentials.txt. Next step is the fork_init
function from the kernel/fork.c. The fork_init
function allocates cache for the task_struct
. Let's look on the implementation of the fork_init
. First of all we can see definitions of the ARCH_MIN_TASKALIGN
macro and creation of a slab where task_structs will be allocated:
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
#ifndef ARCH_MIN_TASKALIGN
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
#endif
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
#endif
As we can see this code depends on the CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR
kernel configuration option. This configuration option shows the presence of the alloc_task_struct
for the given architecture. As x86_64
has no alloc_task_struct
function, this code will not work and even will not be compiled on the x86_64
.
Allocating cache for init task
After this we can see the call of the arch_task_cache_init
function in the fork_init
:
void arch_task_cache_init(void)
{
task_xstate_cachep =
kmem_cache_create("task_xstate", xstate_size,
__alignof__(union thread_xstate),
SLAB_PANIC | SLAB_NOTRACK, NULL);
setup_xstate_comp();
}
The arch_task_cache_init
does initialization of the architecture-specific caches. In our case it is x86_64
, so as we can see, the arch_task_cache_init
allocates cache for the task_xstate
which represents FPU state and sets up offsets and sizes of all extended states in xsave area with the call of the setup_xstate_comp
function. After the arch_task_cache_init
we calculate default maximum number of threads with the:
set_max_threads(MAX_THREADS);
where default maximum number of threads is:
#define FUTEX_TID_MASK 0x3fffffff
#define MAX_THREADS FUTEX_TID_MASK
In the end of the fork_init
function we initialize signal handler:
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];
As we know the init_task
is an instance of the task_struct
structure, so it contains signal
field which represents signal handler. It has following type struct signal_struct
. On the first two lines we can see setting of the current and maximum limit of the resource limits
. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here rlim
is resource control limit and presented by the:
struct rlimit {
__kernel_ulong_t rlim_cur;
__kernel_ulong_t rlim_max;
};
structure from the include/uapi/linux/resource.h. In our case the resource is the RLIMIT_NPROC
which is the maximum number of processes that user can own and RLIMIT_SIGPENDING
- the maximum number of pending signals. We can see it in the:
cat /proc/self/limits
Limit Soft Limit Hard Limit Units
...
...
...
Max processes 63815 63815 processes
Max pending signals 63815 63815 signals
...
...
...
Initialization of the caches
The next function after the fork_init
is the proc_caches_init
from the kernel/fork.c. This function allocates caches for the memory descriptors (or mm_struct
structure). At the beginning of the proc_caches_init
we can see allocation of the different SLAB caches with the call of the kmem_cache_create
:
sighand_cachep
- manage information about installed signal handlers;signal_cachep
- manage information about process signal descriptor;files_cachep
- manage information about opened files;fs_cachep
- manage filesystem information.
After this we allocate SLAB
cache for the mm_struct
structures:
mm_cachep = kmem_cache_create("mm_struct",
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
After this we allocate SLAB
cache for the important vm_area_struct
which used by the kernel to manage virtual memory space:
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
Note, that we use KMEM_CACHE
macro here instead of the kmem_cache_create
. This macro is defined in the include/linux/slab.h and just expands to the kmem_cache_create
call:
#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
The KMEM_CACHE
has one difference from kmem_cache_create
. Take a look on __alignof__
operator. The KMEM_CACHE
macro aligns SLAB
to the size of the given structure, but kmem_cache_create
uses given value to align space. After this we can see the call of the mmap_init
and nsproxy_cache_init
functions. The first function initializes virtual memory area SLAB
and the second function initializes SLAB
for namespaces.
The next function after the proc_caches_init
is buffer_init
. This function is defined in the fs/buffer.c source code file and allocate cache for the buffer_head
. The buffer_head
is a special structure which defined in the include/linux/buffer_head.h and used for managing buffers. In the start of the buffer_init
function we allocate cache for the struct buffer_head
structures with the call of the kmem_cache_create
function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:
nrpages = (nr_free_buffer_pages() * 10) / 100;
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
which will be equal to the 10%
of the ZONE_NORMAL
(all RAM from the 4GB on the x86_64
). The next function after the buffer_init
is - vfs_caches_init
. This function allocates SLAB
caches and hashtable for different VFS caches. We already saw the vfs_caches_init_early
function in the eighth part of the linux kernel initialization process which initialized caches for dcache
(or directory-cache) and inode cache. The vfs_caches_init
function makes post-early initialization of the dcache
and inode
caches, private data cache, hash tables for the mount points, etc. More details about VFS will be described in the separate part. After this we can see signals_init
function. This function is defined in the kernel/signal.c and allocates a cache for the sigqueue
structures which represents queue of the real time signals. The next function is page_writeback_init
. This function initializes the ratio for the dirty pages. Every low-level page entry contains the dirty
bit which indicates whether a page has been written to after been loaded into memory.
Creation of the root for the procfs
After all of this preparations we need to create the root for the proc filesystem. We will do it with the call of the proc_root_init
function from the fs/proc/root.c. At the start of the proc_root_init
function we allocate the cache for the inodes and register a new filesystem in the system with the:
err = register_filesystem(&proc_fs_type);
if (err)
return;
As I wrote above we will not dive into details about VFS and different filesystems in this chapter, but will see it in the chapter about the VFS
. After we've registered a new filesystem in our system, we call the proc_self_init
function from the fs/proc/self.c and this function allocates inode
number for the self
(/proc/self
directory refers to the process accessing the /proc
filesystem). The next step after the proc_self_init
is proc_setup_thread_self
which setups the /proc/thread-self
directory which contains information about current thread. After this we create /proc/self/mounts
symlink which will contains mount points with the call of the
proc_symlink("mounts", NULL, "self/mounts");
and a couple of directories depends on the different configuration options:
#ifdef CONFIG_SYSVIPC
proc_mkdir("sysvipc", NULL);
#endif
proc_mkdir("fs", NULL);
proc_mkdir("driver", NULL);
proc_mkdir("fs/nfsd", NULL);
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
proc_mkdir("openprom", NULL);
#endif
proc_mkdir("bus", NULL);
...
...
...
if (!proc_mkdir("tty", NULL))
return;
proc_mkdir("tty/ldisc", NULL);
...
...
...
In the end of the proc_root_init
we call the proc_sys_init
function which creates /proc/sys
directory and initializes the Sysctl.
It is the end of start_kernel
function. I did not describe all functions which are called in the start_kernel
. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are taskstats_init_early
which exports per-task statistic to the user-space, delayacct_init
- initializes per-task delay accounting, key_init
and security_init
initialize different security stuff, check_bugs
- fix some architecture-dependent bugs, ftrace_init
function executes initialization of the ftrace, cgroup_init
makes initialization of the rest of the cgroup subsystem,etc. Many of these parts and subsystems will be described in the other chapters.
That's all. Finally we have passed through the long-long start_kernel
function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the start_kernel
we can see the last call of the - rest_init
function. Let's go ahead.
First steps after the start_kernel
The rest_init
function is defined in the same source code file as start_kernel
function, and this file is init/main.c. In the beginning of the rest_init
we can see call of the two following functions:
rcu_scheduler_starting();
smpboot_thread_init();
The first rcu_scheduler_starting
makes RCU scheduler active and the second smpboot_thread_init
registers the smpboot_thread_notifier
CPU notifier (more about it you can read in the CPU hotplug documentation. After this we can see the following calls:
kernel_thread(kernel_init, NULL, CLONE_FS);
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
Here the kernel_thread
function (defined in the kernel/fork.c) creates new kernel thread.As we can see the kernel_thread
function takes three arguments:
- Function which will be executed in a new thread;
- Parameter for the
kernel_init
function; - Flags.
We will not dive into details about kernel_thread
implementation (we will see it in the chapter which describe scheduler, just need to say that kernel_thread
invokes clone). Now we only need to know that we create new kernel thread with kernel_thread
function, parent and child of the thread will use shared information about filesystem and it will start to execute kernel_init
function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two kernel_thread
calls we create two new kernel threads with the PID = 1
for init
process and PID = 2
for kthreadd
. We already know what is init
process. Let's look on the kthreadd
. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the ps
util:
$ ps -ef | grep kthreadd
root 2 0 0 Jan11 ? 00:00:00 [kthreadd]
Let's postpone kernel_init
and kthreadd
for now and go ahead in the rest_init
. In the next step after we have created two new kernel threads we can see the following code:
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
The first rcu_read_lock
function marks the beginning of an RCU read-side critical section and the rcu_read_unlock
marks the end of an RCU read-side critical section. We call these functions because we need to protect the find_task_by_pid_ns
. The find_task_by_pid_ns
returns pointer to the task_struct
by the given pid. So, here we are getting the pointer to the task_struct
for PID = 2
(we got it after kthreadd
creation with the kernel_thread
). In the next step we call complete
function
complete(&kthreadd_done);
and pass address of the kthreadd_done
. The kthreadd_done
defined as
static __initdata DECLARE_COMPLETION(kthreadd_done);
where DECLARE_COMPLETION
macro defined as:
#define DECLARE_COMPLETION(work) \
struct completion work = COMPLETION_INITIALIZER(work)
and expands to the definition of the completion
structure. This structure is defined in the include/linux/completion.h and presents completions
concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the complete
structure and we did it with the DECLARE_COMPLETION
. The second is call of the wait_for_completion
. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call complete
function. Note that we call wait_for_completion
with the kthreadd_done
in the beginning of the kernel_init_freeable
:
wait_for_completion(&kthreadd_done);
And the last step is to call complete
function as we saw it above. After this the kernel_init_freeable
function will not be executed while kthreadd
thread will not be set. After the kthreadd
was set, we can see three following functions in the rest_init
:
init_idle_bootup_task(current);
schedule_preempt_disabled();
cpu_startup_entry(CPUHP_ONLINE);
The first init_idle_bootup_task
function from the kernel/sched/core.c sets the Scheduling class for the current process (idle
class in our case):
void init_idle_bootup_task(struct task_struct *idle)
{
idle->sched_class = &idle_sched_class;
}
where idle
class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function schedule_preempt_disabled
disables preempt in idle
tasks. And the third function cpu_startup_entry
is defined in the kernel/sched/idle.c and calls cpu_idle_loop
from the kernel/sched/idle.c. The cpu_idle_loop
function works as process with PID = 0
and works in the background. Main purpose of the cpu_idle_loop
is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with idle
scheduling class (we just set the current
task to the idle
with the call of the init_idle_bootup_task
function), so the idle
thread does not do useful work but just checks if there is an active task to switch to:
static void cpu_idle_loop(void)
{
...
...
...
while (1) {
while (!need_resched()) {
...
...
...
}
...
}
More about it will be in the chapter about scheduler. So for this moment the start_kernel
calls the rest_init
function which spawns an init
(kernel_init
function) process and become idle
process itself. Now is time to look on the kernel_init
. Execution of the kernel_init
function starts from the call of the kernel_init_freeable
function. The kernel_init_freeable
function first of all waits for the completion of the kthreadd
setup. I already wrote about it above:
wait_for_completion(&kthreadd_done);
After this we set gfp_allowed_mask
to __GFP_BITS_MASK
which means that system is already running, set allowed cpus/mems to all CPUs and NUMA nodes with the set_mems_allowed
function, allow init
process to run on any CPU with the set_cpus_allowed_ptr
, set pid for the cad
or Ctrl-Alt-Delete
, do preparation for booting of the other CPUs with the call of the smp_prepare_cpus
, call early initcalls with the do_pre_smp_initcalls
, initialize SMP
with the smp_init
and initialize lockup_detector with the call of the lockup_detector_init
and initialize scheduler with the sched_init_smp
.
After this we can see the call of the following functions - do_basic_setup
. Before we will call the do_basic_setup
function, our kernel already initialized for this moment. As comment says:
Now we can finally start doing some real work..
The do_basic_setup
will reinitialize cpuset to the active CPUs, initialize the khelper
- which is a kernel thread which used for making calls out to userspace from within the kernel, initialize tmpfs, initialize drivers
subsystem, enable the user-mode helper workqueue
and make post-early call of the initcalls
. We can see opening of the dev/console
and dup twice file descriptors from 0
to 2
after the do_basic_setup
:
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
pr_err("Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
We are using two system calls here sys_open
and sys_dup
. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that rdinit=
option was passed to the kernel command line or set default path of the ramdisk:
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
Check user's permissions for the ramdisk
and call the prepare_namespace
function from the init/do_mounts.c which checks and mounts the initrd:
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
This is the end of the kernel_init_freeable
function and we need return to the kernel_init
. The next step after the kernel_init_freeable
finished its execution is the async_synchronize_full
. This function waits until all asynchronous function calls have been done and after it we will call the free_initmem
which will release all memory occupied by the initialization stuff which located between __init_begin
and __init_end
. After this we protect .rodata
with the mark_rodata_ro
and update state of the system from the SYSTEM_BOOTING
to the
system_state = SYSTEM_RUNNING;
And tries to run the init
process:
if (ramdisk_execute_command) {
ret = run_init_process(ramdisk_execute_command);
if (!ret)
return 0;
pr_err("Failed to execute %s (error %d)\n",
ramdisk_execute_command, ret);
}
First of all it checks the ramdisk_execute_command
which we set in the kernel_init_freeable
function and it will be equal to the value of the rdinit=
kernel command line parameters or /init
by default. The run_init_process
function fills the first element of the argv_init
array:
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
which represents arguments of the init
program and call do_execve
function:
argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),
(const char __user *const __user *)argv_init,
(const char __user *const __user *)envp_init);
The do_execve
function is defined in the include/linux/sched.h and runs program with the given file name and arguments. If we did not pass rdinit=
option to the kernel command line, kernel starts to check the execute_command
which is equal to value of the init=
kernel command line parameter:
if (execute_command) {
ret = run_init_process(execute_command);
if (!ret)
return 0;
panic("Requested init %s failed (error %d).",
execute_command, ret);
}
If we did not pass init=
kernel command line parameter either, kernel tries to run one of the following executable files:
if (!try_to_run_init_process("/sbin/init") ||
!try_to_run_init_process("/etc/init") ||
!try_to_run_init_process("/bin/init") ||
!try_to_run_init_process("/bin/sh"))
return 0;
Otherwise we finish with panic:
panic("No working init found. Try passing init= option to kernel. "
"See Linux Documentation/init.txt for guidance.");
That's all! Linux kernel initialization process is finished!
Conclusion
It is the end of the tenth part about the linux kernel initialization process. It is not only the tenth
part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first part of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - start_kernel
and finished with the launch of the first init
process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.