Add high-level overview chapter

iluuu1994 · iluuu1994 · commit d8ea4901dee6 · 2024-02-12T19:08:54.000+01:00
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -2,6 +2,12 @@
  php-src docs
 ##############
 
+.. toctree::
+   :caption: Introduction
+   :hidden:
+
+   introduction/high-level-overview
+
 .. toctree::
    :caption: Core
    :hidden:
diff --git a/docs/source/introduction/high-level-overview.rst b/docs/source/introduction/high-level-overview.rst
@@ -0,0 +1,220 @@
+#####################
+ High-level overview
+#####################
+
+PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't
+compiled into machine-readable code ahead of time. Instead, the source files are read, processed and
+interpreted when the program is executed. This can be very convenient for developers for rapid
+prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges
+to performance, which is one of the primary reasons interpreters can be complex. php-src borrows
+many concepts from compilers and other interpreters.
+
+**********
+ Concepts
+**********
+
+The goal of the interpreter is to read the users source files from disk, and to simulate the users
+intent. This process can be split into distinct phases that are easier to understand and implement.
+
+-  Tokenization - splitting whole source files into words, called tokens.
+-  Parsing - building a tree structure from tokens, called AST (abstract syntax tree).
+-  Compilation - turning the tree structure into a list of operations, called opcodes.
+-  Interpretation - reading and executing opcodes.
+
+php-src as a whole can be seen as a pipeline consisting of these stages.
+
+.. code:: haskell
+
+   source_code
+     |> tokenizer   -- tokens
+     |> parser      -- ast
+     |> compiler    -- opcodes
+     |> interpreter
+
+Let's go into these phases in a bit more detail.
+
+**************
+ Tokenization
+**************
+
+Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file
+and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple
+integer constant representing the token, and a lexeme, the literal string used in the source code.
+
+.. code:: php
+
+   if ($cond) {
+       echo "Cond is true\n";
+   }
+
+.. code:: text
+
+   T_IF                       "if"
+   T_WHITESPACE               " "
+                              "("
+   T_VARIABLE                 "$cond"
+                              ")"
+   T_WHITESPACE               " "
+                              "{"
+   T_WHITESPACE               "\n    "
+   T_ECHO                     "echo"
+   T_WHITESPACE               " "
+   T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"'
+                              ";"
+   T_WHITESPACE               "\n"
+                              "}"
+
+While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate
+this process. It takes a definition file and generates efficient C code to build these tokens from a
+stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the
+`re2c documentation`_ for details.
+
+.. _re2c documentation: https://re2c.org/
+
+*********
+ Parsing
+*********
+
+Parsing is the process of reading the tokens generated from the tokenizer and building a tree
+structure from it. To humans, nesting seems obvious when looking at source code, given indentation
+through whitespace and the usage of symbols like ``()`` and ``{}``. The tokens are transformed into
+a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is
+represented by generic AST nodes with a ``kind`` field. There are "normal" nodes with a
+predetermined number of children, lists with an arbitrary number of children, and
+:doc:`../core/data-structures/zval` nodes that store some underlying primitive value, like a string.
+
+Here is a simplified example of what an AST from the tokens above might look like.
+
+.. code:: text
+
+   zend_ast_list {
+       kind: ZEND_AST_IF,
+       children: 1,
+       child: [
+           zend_ast {
+               kind: ZEND_AST_IF_ELEM,
+               child: [
+                   zend_ast {
+                       kind: ZEND_AST_VAR,
+                       child: [
+                           zend_ast_zval {
+                               kind: ZEND_AST_ZVAL,
+                               zval: "cond",
+                           },
+                       ],
+                   },
+                   zend_ast_list {
+                       kind: ZEND_AST_STMT_LIST,
+                       children: 1,
+                       child: [
+                           zend_ast {
+                               kind: ZEND_AST_ECHO,
+                               child: [
+                                   zend_ast_zval {
+                                       kind: ZEND_AST_ZVAL,
+                                       zval: "Cond is true\n",
+                                   },
+                               ],
+                           },
+                       ],
+                   },
+               ],
+           },
+       ],
+   }
+
+The nodes may also store additional flags in the ``attr`` field for various purposes depending on
+the node kind. They also store their original position in the source code in the ``lineno`` field.
+These fields are omitted in the example for brevity.
+
+Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a
+grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the
+`Bison documentation`_ for details. Luckily, the syntax is quite approachable.
+
+.. _bison documentation: https://www.gnu.org/software/bison/manual/
+
+*************
+ Compilation
+*************
+
+Computers don't understand human language, or even programming languages. They only understand
+machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For
+example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain
+condition, etc. It turns out that even complex expressions can be reduced to a number of these
+simple instructions.
+
+PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run
+on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no
+physical machine that understands these instructions, but that this machine is implemented in
+software. This is our interpreter. This also means that we are free to make up instructions
+ourselves at will. Some of these instructions look very similar to something you'd find in an actual
+CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load
+property of object by name).
+
+With that little detour out of the way, the job of the compiler is to read the AST and translate it
+into our virtual machine instructions, also called opcodes. This code lives in
+``Zend/zend_compile.c``. The compiler is invoked for each function in your program, and generates a
+list of opcodes.
+
+Here's what the opcodes for the AST above might look like:
+
+.. code:: text
+
+   0000 JMPZ CV0($cond) 0002
+   0001 ECHO string("Cond is true\n")
+   0002 RETURN int(1)
+
+*************
+ Interpreter
+*************
+
+Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for
+instructions. This essentially means that each instructions may have a result value, and at most two
+operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals
+<../core/data-structures/zval>`.
+
+.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code
+
+How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in
+the generated ``Zend/zend_vm_opcodes.h`` file. The VM lives mostly in the ``Zend/zend_vm_def.h``
+file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php`` to generate the
+``Zend/zend_vm_execute.h`` file, containing the actual VM code.
+
+Let's step through the opcodes form the example above:
+
+-  We start at the top, i.e. ``JMPZ``. If its first instruction contains a "falsy" value, it will
+   jump to the instruction encoded in its second operand. If it is truthy, it will simply
+   fall-through to the next operand.
+
+-  The ``ECHO`` instruction prints its first operand.
+
+-  The ``RETURN`` operand terminates the current function.
+
+With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is
+truthy, and skip over the ``echo`` otherwise.
+
+That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes. The
+VM is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter.
+
+*********
+ Opcache
+*********
+
+As you may imagine, running this whole pipeline every time PHP serves a request is time consuming.
+Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file
+is included, we can in the cache whether the file is already there, and verify via timestamp that it
+has not been modified since it was compiled. If it has not, we may reuse the opcodes from cache.
+This dramatically speeds up the execution of PHP programs. This is precisely what the opcache
+extension does. It lives in the ``ext/opcache`` directory.
+
+Opcache also performs some optimizations on the opcodes before caching them. As opcaches are
+expected to be reused many times, it is profitable to spend some additional time simplifying them if
+possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``.
+
+JIT
+===
+
+The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler
+takes the virtual PHP opcodes and turns it into actual machine instructions, with additional
+information gained at runtime. JITs are very complex pieces of software, so this book will likely
+barely scratch the surface of how it works. It lives in ``ext/opcache/jit``.