Skip to content

Increase _PY_NSMALLPOSINTS size #133059

Open
@dg-pb

Description

@dg-pb

Feature or enhancement

Proposal:

See link to faster CPython for initial analyses and backstory if interested, but I will try to summarise it all here:


So the intent is to make int creation faster:

from collections import deque
consume = deque(maxlen=0).extend
INT_RANGE = list(range(100_000))
%timeit consume(range(100_000))    # 2.3 ms
%timeit consume(iter(INT_RANGE))   # 0.6 ms

Ant pre-storing int objects is straight forward path to achieve this:

from itertools import chain
chain = chain.from_iterable
%timeit consume(range(100_000))                         # 2.1 ms
%timeit consume(chain([range(256)] * (100_000 // 256))) # 0.5 ms

While increasing value of _PY_NSMALLPOSINTS does exactly that.


Benefits of this can be observed in benchmarks in pyperformance.
With each incremental increase of this number new statistically significant benefits surface:

  1. 1024 results in 10% faster regex_v8
  2. 2048 adds 4% faster regex_dna and 8% faster regex_effbot
  3. 8192 adds the following in the range of 4-11%: genshi_*, regex_compile, scimark_*, spectral_norm, xml_* and few more
  4. 100K adds 9% faster scimark_monte_carlo, 6% faster scimark_sparse_mat_mult and 12% faster spectral_norm

As it can be seen, each range of integers have benefits for different applications:

  1. <2048 covers most of standard everyday use cases. Additionally it would benefit some cases of iterative algorithms such as string parsing
  2. 8192 benefits a certain portion of scientific application of linear algebra with dimensions up to this number and integer operations where results largely fall into this range
  3. 100K+ has observable impact for scientific (and not only) applications of high dimensionality, such as sparse matrices, large graphs and similar

This number can be increased to 1M, 10M, etc and it is possible to find further observable benefits in specific applications.


Having that said, more involved scientific applications should aim to benefit from optimized libraries such as numpy, which most likely leaves (3) way out of scope of this change.

To attempt to find most optimal number the following two calculations can give some insight:

  1. Determining what sort of integers are used during startup and while importing all standard library packages.

Python is launched with:

from importlib import import_module
names = set(sys.stdlib_module_names)
names -= {'_msi', 'msilib', '_overlapped', '_winapi', '_wmi'}
names -= {'spwd', 'ossaudiodev', 'msvcrt', 'nt', 'winreg', 'winsound'}
names -= {'antigravity', 'this'}
mods = [import_module(name) for name in names]

Cumulative density graph of PyLongObject requests:

Image

256 captures ~83% of all used numbers in startup and imports.
Beyond 256, there are 2 visible increments of at 512 and 4096.
4096 would increase coverage to 93%.


  1. Github use case analysis

Above only captures imports and not the user use cases.
For that github search for \brange(\d+) function can provide some more insight.

Image
Total of 4.3M use cases.

It can be seen that 91% of ranges that are initialised with one argument have it set below 260. So current cache size covers it well.
However, the actual number of use cases is not indicative of performance benefit.
What is more appropriate here is to see how many integer objects would be re-used at each cache size.
Below is the graph of extra integers re-used with each incremental increase of this number.

Image

So current number 256 allows re-using of 99M (11+54+34) integers that are generated by range cases.
Increasing it to 1000 would add extra 33M.
Increasing it to 10000 would add extra 240M (33 + 2 + 5 + 200).
Etc...

So this is in line with findings from benchmark results above:

  1. There is benefit in increasing cache size to a medium sized number <= 10K.
  2. Increasing this number further has performance benefits up to very high value. Although these can be significant, they are rare cases and average user should not be paying for them.

So my best guess is that the upper bound for this number is ~10K, which:

  1. Increases startup time by 1% (with every extra 10K).
  2. Takes 280KB (2-3% of of initial Python process) of memory

There should be a very good reason to sacrifice anything more only for integer storage.
(And I suspect that this can be higher than what others would agree on.)

From all the information above together with some analyses from @eendebakpt in faster-cpython/ideas#725 it seems that both 1024 and 2048 could be good candidates.


So this is my best inference.
What do others think?

  1. Is such change possible?
  2. If yes, is upper bound above reasonable?
  3. What is the most optimal number to set?

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

faster-cpython/ideas#725

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions