Skip to content

Commit 206a997

Browse files
committed
Merge branch 'master' of https://github.com/pandas-dev/pandas into ref-libreduction-5
2 parents 2201ece + b9a9769 commit 206a997

File tree

249 files changed

+6472
-1999
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

249 files changed

+6472
-1999
lines changed

Makefile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,10 @@ doc:
2525
cd doc; \
2626
python make.py clean; \
2727
python make.py html
28+
29+
check:
30+
python3 scripts/validate_unwanted_patterns.py \
31+
--validation-type="private_function_across_module" \
32+
--included-file-extensions="py" \
33+
--excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored \
34+
pandas/

ci/code_checks.sh

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,14 @@ if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then
116116
fi
117117
RET=$(($RET + $?)) ; echo $MSG "DONE"
118118

119+
MSG='Check for use of private module attribute access' ; echo $MSG
120+
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
121+
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_function_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored --format="##[error]{source_path}:{line_number}:{msg}" pandas/
122+
else
123+
$BASE_DIR/scripts/validate_unwanted_patterns.py --validation-type="private_function_across_module" --included-file-extensions="py" --excluded-file-paths=pandas/tests,asv_bench/,pandas/_vendored pandas/
124+
fi
125+
RET=$(($RET + $?)) ; echo $MSG "DONE"
126+
119127
echo "isort --version-number"
120128
isort --version-number
121129

@@ -179,6 +187,10 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
179187
invgrep -R --include="*.py" -E "super\(\w*, (self|cls)\)" pandas
180188
RET=$(($RET + $?)) ; echo $MSG "DONE"
181189

190+
MSG='Check for use of builtin filter function' ; echo $MSG
191+
invgrep -R --include="*.py" -P '(?<!def)[\(\s]filter\(' pandas
192+
RET=$(($RET + $?)) ; echo $MSG "DONE"
193+
182194
# Check for the following code in testing: `np.testing` and `np.array_equal`
183195
MSG='Check for invalid testing' ; echo $MSG
184196
invgrep -r -E --include '*.py' --exclude testing.py '(numpy|np)(\.testing|\.array_equal)' pandas/tests/
@@ -226,15 +238,22 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
226238
invgrep -R --include=*.{py,pyx} '!r}' pandas
227239
RET=$(($RET + $?)) ; echo $MSG "DONE"
228240

241+
# -------------------------------------------------------------------------
242+
# Type annotations
243+
229244
MSG='Check for use of comment-based annotation syntax' ; echo $MSG
230245
invgrep -R --include="*.py" -P '# type: (?!ignore)' pandas
231246
RET=$(($RET + $?)) ; echo $MSG "DONE"
232247

233-
# https://github.com/python/mypy/issues/7384
234-
# MSG='Check for missing error codes with # type: ignore' ; echo $MSG
235-
# invgrep -R --include="*.py" -P '# type: ignore(?!\[)' pandas
236-
# RET=$(($RET + $?)) ; echo $MSG "DONE"
248+
MSG='Check for missing error codes with # type: ignore' ; echo $MSG
249+
invgrep -R --include="*.py" -P '# type:\s?ignore(?!\[)' pandas
250+
RET=$(($RET + $?)) ; echo $MSG "DONE"
251+
252+
MSG='Check for use of Union[Series, DataFrame] instead of FrameOrSeriesUnion alias' ; echo $MSG
253+
invgrep -R --include="*.py" --exclude=_typing.py -E 'Union\[.*(Series.*DataFrame|DataFrame.*Series).*\]' pandas
254+
RET=$(($RET + $?)) ; echo $MSG "DONE"
237255

256+
# -------------------------------------------------------------------------
238257
MSG='Check for use of foo.__class__ instead of type(foo)' ; echo $MSG
239258
invgrep -R --include=*.{py,pyx} '\.__class__' pandas
240259
RET=$(($RET + $?)) ; echo $MSG "DONE"

doc/source/development/contributing_docstring.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,18 +32,18 @@ The next example gives an idea of what a docstring looks like:
3232
Parameters
3333
----------
3434
num1 : int
35-
First number to add
35+
First number to add.
3636
num2 : int
37-
Second number to add
37+
Second number to add.
3838
3939
Returns
4040
-------
4141
int
42-
The sum of `num1` and `num2`
42+
The sum of `num1` and `num2`.
4343
4444
See Also
4545
--------
46-
subtract : Subtract one integer from another
46+
subtract : Subtract one integer from another.
4747
4848
Examples
4949
--------
@@ -998,4 +998,4 @@ mapping function names to docstrings. Wherever possible, we prefer using
998998

999999
See ``pandas.core.generic.NDFrame.fillna`` for an example template, and
10001000
``pandas.core.series.Series.fillna`` and ``pandas.core.generic.frame.fillna``
1001-
for the filled versions.
1001+
for the filled versions.

doc/source/reference/frame.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Attributes and underlying data
3737
DataFrame.shape
3838
DataFrame.memory_usage
3939
DataFrame.empty
40+
DataFrame.set_flags
4041

4142
Conversion
4243
~~~~~~~~~~
@@ -276,6 +277,21 @@ Time Series-related
276277
DataFrame.tz_convert
277278
DataFrame.tz_localize
278279

280+
.. _api.frame.flags:
281+
282+
Flags
283+
~~~~~
284+
285+
Flags refer to attributes of the pandas object. Properties of the dataset (like
286+
the date is was recorded, the URL it was accessed from, etc.) should be stored
287+
in :attr:`DataFrame.attrs`.
288+
289+
.. autosummary::
290+
:toctree: api/
291+
292+
Flags
293+
294+
279295
.. _api.frame.metadata:
280296

281297
Metadata

doc/source/reference/general_utility_functions.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Exceptions and warnings
3737

3838
errors.AccessorRegistrationWarning
3939
errors.DtypeWarning
40+
errors.DuplicateLabelError
4041
errors.EmptyDataError
4142
errors.InvalidIndexError
4243
errors.MergeError

doc/source/reference/series.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Attributes
3939
Series.empty
4040
Series.dtypes
4141
Series.name
42+
Series.flags
43+
Series.set_flags
4244

4345
Conversion
4446
----------
@@ -527,6 +529,19 @@ Sparse-dtype specific methods and attributes are provided under the
527529
Series.sparse.from_coo
528530
Series.sparse.to_coo
529531

532+
.. _api.series.flags:
533+
534+
Flags
535+
~~~~~
536+
537+
Flags refer to attributes of the pandas object. Properties of the dataset (like
538+
the date is was recorded, the URL it was accessed from, etc.) should be stored
539+
in :attr:`Series.attrs`.
540+
541+
.. autosummary::
542+
:toctree: api/
543+
544+
Flags
530545

531546
.. _api.series.metadata:
532547

doc/source/user_guide/computation.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,9 @@ compute the mean absolute deviation on a rolling basis:
361361
@savefig rolling_apply_ex.png
362362
s.rolling(window=60).apply(mad, raw=True).plot(style='k')
363363
364+
Using the Numba engine
365+
~~~~~~~~~~~~~~~~~~~~~~
366+
364367
.. versionadded:: 1.0
365368

366369
Additionally, :meth:`~Rolling.apply` can leverage `Numba <https://numba.pydata.org/>`__

doc/source/user_guide/duplicates.rst

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
.. _duplicates:
2+
3+
****************
4+
Duplicate Labels
5+
****************
6+
7+
:class:`Index` objects are not required to be unique; you can have duplicate row
8+
or column labels. This may be a bit confusing at first. If you're familiar with
9+
SQL, you know that row labels are similar to a primary key on a table, and you
10+
would never want duplicates in a SQL table. But one of pandas' roles is to clean
11+
messy, real-world data before it goes to some downstream system. And real-world
12+
data has duplicates, even in fields that are supposed to be unique.
13+
14+
This section describes how duplicate labels change the behavior of certain
15+
operations, and how prevent duplicates from arising during operations, or to
16+
detect them if they do.
17+
18+
.. ipython:: python
19+
20+
import pandas as pd
21+
import numpy as np
22+
23+
Consequences of Duplicate Labels
24+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25+
26+
Some pandas methods (:meth:`Series.reindex` for example) just don't work with
27+
duplicates present. The output can't be determined, and so pandas raises.
28+
29+
.. ipython:: python
30+
:okexcept:
31+
32+
s1 = pd.Series([0, 1, 2], index=['a', 'b', 'b'])
33+
s1.reindex(['a', 'b', 'c'])
34+
35+
Other methods, like indexing, can give very surprising results. Typically
36+
indexing with a scalar will *reduce dimensionality*. Slicing a ``DataFrame``
37+
with a scalar will return a ``Series``. Slicing a ``Series`` with a scalar will
38+
return a scalar. But with duplicates, this isn't the case.
39+
40+
.. ipython:: python
41+
42+
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'A', 'B'])
43+
df1
44+
45+
We have duplicates in the columns. If we slice ``'B'``, we get back a ``Series``
46+
47+
.. ipython:: python
48+
49+
df1['B'] # a series
50+
51+
But slicing ``'A'`` returns a ``DataFrame``
52+
53+
54+
.. ipython:: python
55+
56+
df1['A'] # a DataFrame
57+
58+
This applies to row labels as well
59+
60+
.. ipython:: python
61+
62+
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=['a', 'a', 'b'])
63+
df2
64+
df2.loc['b', 'A'] # a scalar
65+
df2.loc['a', 'A'] # a Series
66+
67+
Duplicate Label Detection
68+
~~~~~~~~~~~~~~~~~~~~~~~~~
69+
70+
You can check whether an :class:`Index` (storing the row or column labels) is
71+
unique with :attr:`Index.is_unique`:
72+
73+
.. ipython:: python
74+
75+
df2
76+
df2.index.is_unique
77+
df2.columns.is_unique
78+
79+
.. note::
80+
81+
Checking whether an index is unique is somewhat expensive for large datasets.
82+
Pandas does cache this result, so re-checking on the same index is very fast.
83+
84+
:meth:`Index.duplicated` will return a boolean ndarray indicating whether a
85+
label is repeated.
86+
87+
.. ipython:: python
88+
89+
df2.index.duplicated()
90+
91+
Which can be used as a boolean filter to drop duplicate rows.
92+
93+
.. ipython:: python
94+
95+
df2.loc[~df2.index.duplicated(), :]
96+
97+
If you need additional logic to handle duplicate labels, rather than just
98+
dropping the repeats, using :meth:`~DataFrame.groupby` on the index is a common
99+
trick. For example, we'll resolve duplicates by taking the average of all rows
100+
with the same label.
101+
102+
.. ipython:: python
103+
104+
df2.groupby(level=0).mean()
105+
106+
.. _duplicates.disallow:
107+
108+
Disallowing Duplicate Labels
109+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110+
111+
.. versionadded:: 1.2.0
112+
113+
As noted above, handling duplicates is an important feature when reading in raw
114+
data. That said, you may want to avoid introducing duplicates as part of a data
115+
processing pipeline (from methods like :meth:`pandas.concat`,
116+
:meth:`~DataFrame.rename`, etc.). Both :class:`Series` and :class:`DataFrame`
117+
*disallow* duplicate labels by calling ``.set_flags(allows_duplicate_labels=False)``.
118+
(the default is to allow them). If there are duplicate labels, an exception
119+
will be raised.
120+
121+
.. ipython:: python
122+
:okexcept:
123+
124+
pd.Series(
125+
[0, 1, 2],
126+
index=['a', 'b', 'b']
127+
).set_flags(allows_duplicate_labels=False)
128+
129+
This applies to both row and column labels for a :class:`DataFrame`
130+
131+
.. ipython:: python
132+
:okexcept:
133+
134+
pd.DataFrame(
135+
[[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],
136+
).set_flags(allows_duplicate_labels=False)
137+
138+
This attribute can be checked or set with :attr:`~DataFrame.flags.allows_duplicate_labels`,
139+
which indicates whether that object can have duplicate labels.
140+
141+
.. ipython:: python
142+
143+
df = (
144+
pd.DataFrame({"A": [0, 1, 2, 3]},
145+
index=['x', 'y', 'X', 'Y'])
146+
.set_flags(allows_duplicate_labels=False)
147+
)
148+
df
149+
df.flags.allows_duplicate_labels
150+
151+
:meth:`DataFrame.set_flags` can be used to return a new ``DataFrame`` with attributes
152+
like ``allows_duplicate_labels`` set to some value
153+
154+
.. ipython:: python
155+
156+
df2 = df.set_flags(allows_duplicate_labels=True)
157+
df2.flags.allows_duplicate_labels
158+
159+
The new ``DataFrame`` returned is a view on the same data as the old ``DataFrame``.
160+
Or the property can just be set directly on the same object
161+
162+
163+
.. ipython:: python
164+
165+
df2.flags.allows_duplicate_labels = False
166+
df2.flags.allows_duplicate_labels
167+
168+
When processing raw, messy data you might initially read in the messy data
169+
(which potentially has duplicate labels), deduplicate, and then disallow duplicates
170+
going forward, to ensure that your data pipeline doesn't introduce duplicates.
171+
172+
173+
.. code-block:: python
174+
175+
>>> raw = pd.read_csv("...")
176+
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
177+
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
178+
179+
Setting ``allows_duplicate_labels=True`` on a ``Series`` or ``DataFrame`` with duplicate
180+
labels or performing an operation that introduces duplicate labels on a ``Series`` or
181+
``DataFrame`` that disallows duplicates will raise an
182+
:class:`errors.DuplicateLabelError`.
183+
184+
.. ipython:: python
185+
:okexcept:
186+
187+
df.rename(str.upper)
188+
189+
This error message contains the labels that are duplicated, and the numeric positions
190+
of all the duplicates (including the "original") in the ``Series`` or ``DataFrame``
191+
192+
Duplicate Label Propagation
193+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
194+
195+
In general, disallowing duplicates is "sticky". It's preserved through
196+
operations.
197+
198+
.. ipython:: python
199+
:okexcept:
200+
201+
s1 = pd.Series(0, index=['a', 'b']).set_flags(allows_duplicate_labels=False)
202+
s1
203+
s1.head().rename({"a": "b"})
204+
205+
.. warning::
206+
207+
This is an experimental feature. Currently, many methods fail to
208+
propagate the ``allows_duplicate_labels`` value. In future versions
209+
it is expected that every method taking or returning one or more
210+
DataFrame or Series objects will propagate ``allows_duplicate_labels``.

doc/source/user_guide/enhancingperf.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,13 @@ nicer interface by passing/returning pandas objects.
373373
374374
In this example, using Numba was faster than Cython.
375375

376+
Numba as an argument
377+
~~~~~~~~~~~~~~~~~~~~
378+
379+
Additionally, we can leverage the power of `Numba <https://numba.pydata.org/>`__
380+
by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
381+
<stats.rolling_apply>` for an extensive example.
382+
376383
Vectorize
377384
~~~~~~~~~
378385

0 commit comments

Comments
 (0)