From the assumption made in the title of my question "Fork create a new process that is exactly the same as its parent". I am wondering how a fork is really made by the operating system.
Considering a heavy process (huge RAM footprint) that fork itself to accomplish a small task (list the files into a directory). From the assumption, I expect that the child process will be as big as the first one. However my common sense tells me that it cannot be the case.
How does it work in the real world?
Best How To :
As others mentioned in the comments, a technique called Copy-On-Write mitigates the heavy cost of copying the entire memory space of the parent process. Copy-on-write means that memory pages are shared read-only between parent and child until either of them decides to write -- at which point the page is copied and each process gets its own private copy. This technique easily prevents a huge amount of copying that in a lot of cases would be a waste of time because the child will
exec() or do something simple and exit.
Here's what happens in detail:
When you call
fork(2), the only immediate cost you incur is the cost of allocating a new unique process descriptor and the cost of copying the parent's page tables. In Linux,
fork(2) is implemented by the
clone(2) syscall, which is a more general syscall that allows the caller to control which parts of the new process are shared with the parent. When called from
fork(2), a set of flags are passed to indicate that nothing is to be shared (you can choose to share memory, file descriptors, etc - this is how threads are implemented: by calling
CLONE_VM, which means "share the memory space").
Under the hood, each process's memory page has a bit flag that is the copy-on-write flag that indicates whether that page should be copied before being written to.
fork(2) marks every writeable page in a process with that bit. Each page also maintains a reference count.
So, when a process forks, the kernel sets the copy-on-write bit on every non-private, writeable page of that process and increments the reference count by one. The child process has pointers to these same pages.
Then, every page is marked read-only so that an attempt to write to the page generates a page fault - this is needed to wake up the kernel so that it has a chance of seeing what happened and what needs to be done.
When either of the processes writes to a page that is still being shared, and thus is marked read-only, the kernel wakes up and attempts to figure out why there is a page fault. Assuming the parent / child process is writing to a legit location, the kernel eventually sees that the page fault was generated because the page is marked copy-on-write and there is more than one reference to that page.
The kernel then allocates memory, copies the page into the new location, and the write can proceed.
What is different across forks
You said that
fork(2) creates a new process that is exactly the same as its parent. This is not quite true. There are several differences between the parent and the child:
- The process ID is different
- The parent process ID is different
- The child's resource usage (CPU time, etc) are set to 0
- File locks owned by the parent are not inherited
- The set of pending signals on the child is cleared
- Pending alarms are cleared on the child
vfork(2) syscall is very similar to
fork(), but it does absolutely no copying - it doesn't even copy the parent's page tables. With the introduction of copy-on-write, it's not as widely used anymore, but historically it was used by processes that would call
exec() after forking.
Naturally, attempting to write to memory in the child process after a
vfork() results in chaos.