|
| 1 | +##################### |
| 2 | + High-level overview |
| 3 | +##################### |
| 4 | + |
| 5 | +PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't |
| 6 | +compiled into machine-readable code ahead of time. Instead, the source files are read, processed and |
| 7 | +interpreted when the program is executed. This can be very convenient for developers for rapid |
| 8 | +prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges |
| 9 | +to performance, which is one of the primary reasons interpreters can be complex. php-src borrows |
| 10 | +many concepts from compilers and other interpreters. |
| 11 | + |
| 12 | +********** |
| 13 | + Concepts |
| 14 | +********** |
| 15 | + |
| 16 | +The goal of the interpreter is to read the users source files from disk, and to simulate the users |
| 17 | +intent. This process can be split into distinct phases that are easier to understand and implement. |
| 18 | + |
| 19 | +- Tokenization - splitting whole source files into words, called tokens. |
| 20 | +- Parsing - building a tree structure from tokens, called AST (abstract syntax tree). |
| 21 | +- Compilation - turning the tree structure into a list of operations, called opcodes. |
| 22 | +- Interpretation - reading and executing opcodes. |
| 23 | + |
| 24 | +php-src as a whole can be seen as a pipeline consisting of these stages. |
| 25 | + |
| 26 | +.. code:: haskell |
| 27 | +
|
| 28 | + source_code |
| 29 | + |> tokenizer -- tokens |
| 30 | + |> parser -- ast |
| 31 | + |> compiler -- opcodes |
| 32 | + |> interpreter |
| 33 | +
|
| 34 | +Let's go into these phases in a bit more detail. |
| 35 | + |
| 36 | +************** |
| 37 | + Tokenization |
| 38 | +************** |
| 39 | + |
| 40 | +Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file |
| 41 | +and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple |
| 42 | +integer constant representing the token, and a lexeme, the literal string used in the source code. |
| 43 | + |
| 44 | +.. code:: php |
| 45 | +
|
| 46 | + if ($cond) { |
| 47 | + echo "Cond is true\n"; |
| 48 | + } |
| 49 | +
|
| 50 | +.. code:: text |
| 51 | +
|
| 52 | + T_IF "if" |
| 53 | + T_WHITESPACE " " |
| 54 | + "(" |
| 55 | + T_VARIABLE "$cond" |
| 56 | + ")" |
| 57 | + T_WHITESPACE " " |
| 58 | + "{" |
| 59 | + T_WHITESPACE "\n " |
| 60 | + T_ECHO "echo" |
| 61 | + T_WHITESPACE " " |
| 62 | + T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"' |
| 63 | + ";" |
| 64 | + T_WHITESPACE "\n" |
| 65 | + "}" |
| 66 | +
|
| 67 | +While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate |
| 68 | +this process. It takes a definition file and generates efficient C code to build these tokens from a |
| 69 | +stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the |
| 70 | +`re2c documentation`_ for details. |
| 71 | + |
| 72 | +.. _re2c documentation: https://re2c.org/ |
| 73 | + |
| 74 | +********* |
| 75 | + Parsing |
| 76 | +********* |
| 77 | + |
| 78 | +Parsing is the process of reading the tokens generated from the tokenizer and building a tree |
| 79 | +structure from it. To humans, nesting seems obvious when looking at source code, given indentation |
| 80 | +through whitespace and the usage of symbols like ``()`` and ``{}``. The tokens are transformed into |
| 81 | +a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is |
| 82 | +represented by generic AST nodes with a ``kind`` field. There are "normal" nodes with a |
| 83 | +predetermined number of children, lists with an arbitrary number of children, and |
| 84 | +:doc:`../core/data-structures/zval` nodes that store some underlying primitive value, like a string. |
| 85 | + |
| 86 | +Here is a simplified example of what an AST from the tokens above might look like. |
| 87 | + |
| 88 | +.. code:: text |
| 89 | +
|
| 90 | + zend_ast_list { |
| 91 | + kind: ZEND_AST_IF, |
| 92 | + children: 1, |
| 93 | + child: [ |
| 94 | + zend_ast { |
| 95 | + kind: ZEND_AST_IF_ELEM, |
| 96 | + child: [ |
| 97 | + zend_ast { |
| 98 | + kind: ZEND_AST_VAR, |
| 99 | + child: [ |
| 100 | + zend_ast_zval { |
| 101 | + kind: ZEND_AST_ZVAL, |
| 102 | + zval: "cond", |
| 103 | + }, |
| 104 | + ], |
| 105 | + }, |
| 106 | + zend_ast_list { |
| 107 | + kind: ZEND_AST_STMT_LIST, |
| 108 | + children: 1, |
| 109 | + child: [ |
| 110 | + zend_ast { |
| 111 | + kind: ZEND_AST_ECHO, |
| 112 | + child: [ |
| 113 | + zend_ast_zval { |
| 114 | + kind: ZEND_AST_ZVAL, |
| 115 | + zval: "Cond is true\n", |
| 116 | + }, |
| 117 | + ], |
| 118 | + }, |
| 119 | + ], |
| 120 | + }, |
| 121 | + ], |
| 122 | + }, |
| 123 | + ], |
| 124 | + } |
| 125 | +
|
| 126 | +The nodes may also store additional flags in the ``attr`` field for various purposes depending on |
| 127 | +the node kind. They also store their original position in the source code in the ``lineno`` field. |
| 128 | +These fields are omitted in the example for brevity. |
| 129 | + |
| 130 | +Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a |
| 131 | +grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the |
| 132 | +`Bison documentation`_ for details. Luckily, the syntax is quite approachable. |
| 133 | + |
| 134 | +.. _bison documentation: https://www.gnu.org/software/bison/manual/ |
| 135 | + |
| 136 | +************* |
| 137 | + Compilation |
| 138 | +************* |
| 139 | + |
| 140 | +Computers don't understand human language, or even programming languages. They only understand |
| 141 | +machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For |
| 142 | +example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain |
| 143 | +condition, etc. It turns out that even complex expressions can be reduced to a number of these |
| 144 | +simple instructions. |
| 145 | + |
| 146 | +PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run |
| 147 | +on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no |
| 148 | +physical machine that understands these instructions, but that this machine is implemented in |
| 149 | +software. This is our interpreter. This also means that we are free to make up instructions |
| 150 | +ourselves at will. Some of these instructions look very similar to something you'd find in an actual |
| 151 | +CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load |
| 152 | +property of object by name). |
| 153 | + |
| 154 | +With that little detour out of the way, the job of the compiler is to read the AST and translate it |
| 155 | +into our virtual machine instructions, also called opcodes. This code lives in |
| 156 | +``Zend/zend_compile.c``. The compiler is invoked for each function in your program, and generates a |
| 157 | +list of opcodes. |
| 158 | + |
| 159 | +Here's what the opcodes for the AST above might look like: |
| 160 | + |
| 161 | +.. code:: text |
| 162 | +
|
| 163 | + 0000 JMPZ CV0($cond) 0002 |
| 164 | + 0001 ECHO string("Cond is true\n") |
| 165 | + 0002 RETURN int(1) |
| 166 | +
|
| 167 | +************* |
| 168 | + Interpreter |
| 169 | +************* |
| 170 | + |
| 171 | +Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for |
| 172 | +instructions. This essentially means that each instructions may have a result value, and at most two |
| 173 | +operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals |
| 174 | +<../core/data-structures/zval>`. |
| 175 | + |
| 176 | +.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code |
| 177 | + |
| 178 | +How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in |
| 179 | +the generated ``Zend/zend_vm_opcodes.h`` file. The VM lives mostly in the ``Zend/zend_vm_def.h`` |
| 180 | +file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php`` to generate the |
| 181 | +``Zend/zend_vm_execute.h`` file, containing the actual VM code. |
| 182 | + |
| 183 | +Let's step through the opcodes form the example above: |
| 184 | + |
| 185 | +- We start at the top, i.e. ``JMPZ``. If its first instruction contains a "falsy" value, it will |
| 186 | + jump to the instruction encoded in its second operand. If it is truthy, it will simply |
| 187 | + fall-through to the next operand. |
| 188 | + |
| 189 | +- The ``ECHO`` instruction prints its first operand. |
| 190 | + |
| 191 | +- The ``RETURN`` operand terminates the current function. |
| 192 | + |
| 193 | +With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is |
| 194 | +truthy, and skip over the ``echo`` otherwise. |
| 195 | + |
| 196 | +That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes. The |
| 197 | +VM is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter. |
| 198 | + |
| 199 | +********* |
| 200 | + Opcache |
| 201 | +********* |
| 202 | + |
| 203 | +As you may imagine, running this whole pipeline every time PHP serves a request is time consuming. |
| 204 | +Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file |
| 205 | +is included, we can in the cache whether the file is already there, and verify via timestamp that it |
| 206 | +has not been modified since it was compiled. If it has not, we may reuse the opcodes from cache. |
| 207 | +This dramatically speeds up the execution of PHP programs. This is precisely what the opcache |
| 208 | +extension does. It lives in the ``ext/opcache`` directory. |
| 209 | + |
| 210 | +Opcache also performs some optimizations on the opcodes before caching them. As opcaches are |
| 211 | +expected to be reused many times, it is profitable to spend some additional time simplifying them if |
| 212 | +possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``. |
| 213 | + |
| 214 | +JIT |
| 215 | +=== |
| 216 | + |
| 217 | +The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler |
| 218 | +takes the virtual PHP opcodes and turns it into actual machine instructions, with additional |
| 219 | +information gained at runtime. JITs are very complex pieces of software, so this book will likely |
| 220 | +barely scratch the surface of how it works. It lives in ``ext/opcache/jit``. |
0 commit comments