Skip to content

Commit d8ea490

Browse files
committed
Add high-level overview chapter
1 parent a173e0c commit d8ea490

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed

docs/source/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
php-src docs
33
##############
44

5+
.. toctree::
6+
:caption: Introduction
7+
:hidden:
8+
9+
introduction/high-level-overview
10+
511
.. toctree::
612
:caption: Core
713
:hidden:
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
#####################
2+
High-level overview
3+
#####################
4+
5+
PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't
6+
compiled into machine-readable code ahead of time. Instead, the source files are read, processed and
7+
interpreted when the program is executed. This can be very convenient for developers for rapid
8+
prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges
9+
to performance, which is one of the primary reasons interpreters can be complex. php-src borrows
10+
many concepts from compilers and other interpreters.
11+
12+
**********
13+
Concepts
14+
**********
15+
16+
The goal of the interpreter is to read the users source files from disk, and to simulate the users
17+
intent. This process can be split into distinct phases that are easier to understand and implement.
18+
19+
- Tokenization - splitting whole source files into words, called tokens.
20+
- Parsing - building a tree structure from tokens, called AST (abstract syntax tree).
21+
- Compilation - turning the tree structure into a list of operations, called opcodes.
22+
- Interpretation - reading and executing opcodes.
23+
24+
php-src as a whole can be seen as a pipeline consisting of these stages.
25+
26+
.. code:: haskell
27+
28+
source_code
29+
|> tokenizer -- tokens
30+
|> parser -- ast
31+
|> compiler -- opcodes
32+
|> interpreter
33+
34+
Let's go into these phases in a bit more detail.
35+
36+
**************
37+
Tokenization
38+
**************
39+
40+
Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file
41+
and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple
42+
integer constant representing the token, and a lexeme, the literal string used in the source code.
43+
44+
.. code:: php
45+
46+
if ($cond) {
47+
echo "Cond is true\n";
48+
}
49+
50+
.. code:: text
51+
52+
T_IF "if"
53+
T_WHITESPACE " "
54+
"("
55+
T_VARIABLE "$cond"
56+
")"
57+
T_WHITESPACE " "
58+
"{"
59+
T_WHITESPACE "\n "
60+
T_ECHO "echo"
61+
T_WHITESPACE " "
62+
T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"'
63+
";"
64+
T_WHITESPACE "\n"
65+
"}"
66+
67+
While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate
68+
this process. It takes a definition file and generates efficient C code to build these tokens from a
69+
stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the
70+
`re2c documentation`_ for details.
71+
72+
.. _re2c documentation: https://re2c.org/
73+
74+
*********
75+
Parsing
76+
*********
77+
78+
Parsing is the process of reading the tokens generated from the tokenizer and building a tree
79+
structure from it. To humans, nesting seems obvious when looking at source code, given indentation
80+
through whitespace and the usage of symbols like ``()`` and ``{}``. The tokens are transformed into
81+
a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is
82+
represented by generic AST nodes with a ``kind`` field. There are "normal" nodes with a
83+
predetermined number of children, lists with an arbitrary number of children, and
84+
:doc:`../core/data-structures/zval` nodes that store some underlying primitive value, like a string.
85+
86+
Here is a simplified example of what an AST from the tokens above might look like.
87+
88+
.. code:: text
89+
90+
zend_ast_list {
91+
kind: ZEND_AST_IF,
92+
children: 1,
93+
child: [
94+
zend_ast {
95+
kind: ZEND_AST_IF_ELEM,
96+
child: [
97+
zend_ast {
98+
kind: ZEND_AST_VAR,
99+
child: [
100+
zend_ast_zval {
101+
kind: ZEND_AST_ZVAL,
102+
zval: "cond",
103+
},
104+
],
105+
},
106+
zend_ast_list {
107+
kind: ZEND_AST_STMT_LIST,
108+
children: 1,
109+
child: [
110+
zend_ast {
111+
kind: ZEND_AST_ECHO,
112+
child: [
113+
zend_ast_zval {
114+
kind: ZEND_AST_ZVAL,
115+
zval: "Cond is true\n",
116+
},
117+
],
118+
},
119+
],
120+
},
121+
],
122+
},
123+
],
124+
}
125+
126+
The nodes may also store additional flags in the ``attr`` field for various purposes depending on
127+
the node kind. They also store their original position in the source code in the ``lineno`` field.
128+
These fields are omitted in the example for brevity.
129+
130+
Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a
131+
grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the
132+
`Bison documentation`_ for details. Luckily, the syntax is quite approachable.
133+
134+
.. _bison documentation: https://www.gnu.org/software/bison/manual/
135+
136+
*************
137+
Compilation
138+
*************
139+
140+
Computers don't understand human language, or even programming languages. They only understand
141+
machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For
142+
example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain
143+
condition, etc. It turns out that even complex expressions can be reduced to a number of these
144+
simple instructions.
145+
146+
PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run
147+
on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no
148+
physical machine that understands these instructions, but that this machine is implemented in
149+
software. This is our interpreter. This also means that we are free to make up instructions
150+
ourselves at will. Some of these instructions look very similar to something you'd find in an actual
151+
CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load
152+
property of object by name).
153+
154+
With that little detour out of the way, the job of the compiler is to read the AST and translate it
155+
into our virtual machine instructions, also called opcodes. This code lives in
156+
``Zend/zend_compile.c``. The compiler is invoked for each function in your program, and generates a
157+
list of opcodes.
158+
159+
Here's what the opcodes for the AST above might look like:
160+
161+
.. code:: text
162+
163+
0000 JMPZ CV0($cond) 0002
164+
0001 ECHO string("Cond is true\n")
165+
0002 RETURN int(1)
166+
167+
*************
168+
Interpreter
169+
*************
170+
171+
Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for
172+
instructions. This essentially means that each instructions may have a result value, and at most two
173+
operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals
174+
<../core/data-structures/zval>`.
175+
176+
.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code
177+
178+
How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in
179+
the generated ``Zend/zend_vm_opcodes.h`` file. The VM lives mostly in the ``Zend/zend_vm_def.h``
180+
file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php`` to generate the
181+
``Zend/zend_vm_execute.h`` file, containing the actual VM code.
182+
183+
Let's step through the opcodes form the example above:
184+
185+
- We start at the top, i.e. ``JMPZ``. If its first instruction contains a "falsy" value, it will
186+
jump to the instruction encoded in its second operand. If it is truthy, it will simply
187+
fall-through to the next operand.
188+
189+
- The ``ECHO`` instruction prints its first operand.
190+
191+
- The ``RETURN`` operand terminates the current function.
192+
193+
With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is
194+
truthy, and skip over the ``echo`` otherwise.
195+
196+
That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes. The
197+
VM is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter.
198+
199+
*********
200+
Opcache
201+
*********
202+
203+
As you may imagine, running this whole pipeline every time PHP serves a request is time consuming.
204+
Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file
205+
is included, we can in the cache whether the file is already there, and verify via timestamp that it
206+
has not been modified since it was compiled. If it has not, we may reuse the opcodes from cache.
207+
This dramatically speeds up the execution of PHP programs. This is precisely what the opcache
208+
extension does. It lives in the ``ext/opcache`` directory.
209+
210+
Opcache also performs some optimizations on the opcodes before caching them. As opcaches are
211+
expected to be reused many times, it is profitable to spend some additional time simplifying them if
212+
possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``.
213+
214+
JIT
215+
===
216+
217+
The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler
218+
takes the virtual PHP opcodes and turns it into actual machine instructions, with additional
219+
information gained at runtime. JITs are very complex pieces of software, so this book will likely
220+
barely scratch the surface of how it works. It lives in ``ext/opcache/jit``.

0 commit comments

Comments
 (0)