Skip to content

Update grammar.rst and compiler.rst to describe the PEG parser #601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 12 additions & 47 deletions compiler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ Abstract

In CPython, the compilation from source code to bytecode involves several steps:

1. Parse source code into a parse tree (:file:`Parser/pgen.c`)
2. Transform parse tree into an Abstract Syntax Tree (:file:`Python/ast.c`)
1. Tokenize the source code (:file:`Parser/tokenizer.c`)
2. Parse the stream of tokens into an Abstract Syntax Tree (:file:`Parser/parser.c`)
3. Transform AST into a Control Flow Graph (:file:`Python/compile.c`)
4. Emit bytecode based on the Control Flow Graph (:file:`Python/compile.c`)

Expand All @@ -23,49 +23,18 @@ in terms of the how the entire system works. You will most likely need
to read some source to have an exact understanding of all details.


Parse Trees
-----------
Parsing
-------

Python's parser is an LL(1) parser mostly based off of the
implementation laid out in the Dragon Book [Aho86]_.
As of Python 3.9, Python's parser is a PEG parser of a somewhat
unusual design (since its input is a stream of tokens rather than a
stream of characters as is more common with PEG parsers).

The grammar file for Python can be found in :file:`Grammar/Grammar` with the
numeric value of grammar rules stored in :file:`Include/graminit.h`. The
list of types of tokens (literal tokens, such as ``:``, numbers, etc.) can
be found in :file:`Grammar/Tokens` with the numeric value stored in
:file:`Include/token.h`. The parse tree is made up
of ``node *`` structs (as defined in :file:`Include/node.h`).

Querying data from the node structs can be done with the following
macros (which are all defined in :file:`Include/node.h`):

``CHILD(node *, int)``
Returns the nth child of the node using zero-offset indexing
``RCHILD(node *, int)``
Returns the nth child of the node from the right side; use
negative numbers!
``NCH(node *)``
Number of children the node has
``STR(node *)``
String representation of the node; e.g., will return ``:`` for a
``COLON`` token
``TYPE(node *)``
The type of node as specified in :file:`Include/graminit.h`
``REQ(node *, TYPE)``
Assert that the node is the type that is expected
``LINENO(node *)``
Retrieve the line number of the source code that led to the
creation of the parse rule; defined in :file:`Python/ast.c`

For example, consider the rule for 'while':

.. productionlist::
while_stmt: "while" `expression` ":" `suite` : ["else" ":" `suite`]

The node representing this will have ``TYPE(node) == while_stmt`` and
the number of children can be 4 or 7 depending on whether there is an
'else' statement. ``REQ(CHILD(node, 2), COLON)`` can be used to access
what should be the first ``:`` and require it be an actual ``:`` token.
The grammar file for Python can be found in
:file:`Grammar/python.gram`. The definitions for literal tokens
(such as ``:``, numbers, etc.) can be found in :file:`Grammar/Tokens`.
Various C files, including :file:`Parser/parser.c` are generated from
these (see :doc:`grammar`).


Abstract Syntax Trees (AST)
Expand Down Expand Up @@ -569,10 +538,6 @@ thanks to having to support both classic and new-style classes.
References
----------

.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
`Compilers: Principles, Techniques, and Tools`,
https://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108

.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
S. Serra. `The Zephyr Abstract Syntax Description Language.`_
In Proceedings of the Conference on Domain-Specific Languages, pp.
Expand Down
44 changes: 18 additions & 26 deletions grammar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,52 +7,44 @@ Abstract
--------

There's more to changing Python's grammar than editing
:file:`Grammar/Grammar`. This document aims to be a
checklist of places that must also be fixed.
:file:`Grammar/python.gram`. Here's a checklist.

It is probably incomplete. If you see omissions, submit a bug or patch.

This document is not intended to be an instruction manual on Python
grammar hacking, for several reasons.


Rationale
---------

People are getting this wrong all the time; it took well over a
year before someone `noticed <https://bugs.python.org/issue676521>`_
that adding the floor division
operator (``//``) broke the :mod:`parser` module.
NOTE: These instructions are for Python 3.9 and beyond. Earlier
versions use a different parser technology. You probably shouldn't
try to change the grammar of earlier Python versions, but if you
really want to, use GitHub to track down the earlier version of this
file in the devguide. (Python 3.9 itself actually supports both
parsers; the old parser can be invoked by passing ``-X oldparser``.)


Checklist
---------

Note: sometimes things mysteriously don't work. Before giving up, try ``make clean``.

* :file:`Grammar/Grammar`: OK, you'd probably worked this one out. :-) After changing
it, run ``make regen-grammar``, to regenerate :file:`Include/graminit.h` and
:file:`Python/graminit.c`. (This runs Python's parser generator, ``Python/pgen``).
* :file:`Grammar/python.gram`: The grammar, with actions that build AST nodes. After changing
it, run ``make regen-pegen``, to regenerate :file:`Parser/parser.c`.
(This runs Python's parser generator, ``Tools/peg_generator``).

* :file:`Grammar/Tokens` is a place for adding new token types. After
changing it, run ``make regen-token`` to regenerate :file:`Include/token.h`,
:file:`Parser/token.c`, :file:`Lib/token.py` and
:file:`Doc/library/token-list.inc`. If you change both ``Grammar`` and ``Tokens``,
run ``make regen-tokens`` before ``make regen-grammar``.
:file:`Doc/library/token-list.inc`. If you change both ``python.gram`` and ``Tokens``,
run ``make regen-token`` before ``make regen-pegen``.

* :file:`Parser/Python.asdl` may need changes to match the Grammar. Then run ``make
* :file:`Parser/Python.asdl` may need changes to match the grammar. Then run ``make
regen-ast`` to regenerate :file:`Include/Python-ast.h` and :file:`Python/Python-ast.c`.

* :file:`Parser/tokenizer.c` contains the tokenization code. This is where you would
add a new type of comment or string literal, for example.

* :file:`Python/ast.c` will need changes to create the AST objects involved with the
Grammar change.
* :file:`Python/ast.c` will need changes to validate AST objects involved with the
grammar change.

* The :doc:`compiler` has its own page.
* :file:`Python/ast_unparse.c` will need changes to unparse AST objects involved with the
grammar change ("unparsing" is used to turn annotations into strings per :pep:`563`).

* The :mod:`parser` module. Add some of your new syntax to ``test_parser``,
bang on :file:`Modules/parsermodule.c` until it passes.
* The :doc:`compiler` has its own page.

* Add some usage of your new syntax to ``test_grammar.py``.

Expand Down