Skip to content

Commit 3a5118b

Browse files
committed
Merge branch 'index-values' into pandas-array-upstream+fu1
2 parents b4de5c4 + 3af8a21 commit 3a5118b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+2230
-1465
lines changed

ci/requirements-3.6.run

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ lxml
1313
html5lib
1414
jinja2
1515
sqlalchemy
16-
pymysql<0.8.0
16+
pymysql
1717
feather-format
1818
pyarrow
1919
psycopg2

doc/source/dsintro.rst

Lines changed: 65 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ constructed from the sorted keys of the dict, if possible.
9595

9696
NaN (not a number) is the standard missing data marker used in pandas.
9797

98-
**From scalar value**
98+
**From scalar value**
9999

100100
If ``data`` is a scalar value, an index must be
101101
provided. The value will be repeated to match the length of **index**.
@@ -154,7 +154,7 @@ See also the :ref:`section on attribute access<indexing.attribute_access>`.
154154
Vectorized operations and label alignment with Series
155155
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
156156

157-
When working with raw NumPy arrays, looping through value-by-value is usually
157+
When working with raw NumPy arrays, looping through value-by-value is usually
158158
not necessary. The same is true when working with Series in pandas.
159159
Series can also be passed into most NumPy methods expecting an ndarray.
160160

@@ -324,7 +324,7 @@ From a list of dicts
324324
From a dict of tuples
325325
~~~~~~~~~~~~~~~~~~~~~
326326

327-
You can automatically create a multi-indexed frame by passing a tuples
327+
You can automatically create a multi-indexed frame by passing a tuples
328328
dictionary.
329329

330330
.. ipython:: python
@@ -347,7 +347,7 @@ column name provided).
347347
**Missing Data**
348348

349349
Much more will be said on this topic in the :ref:`Missing data <missing_data>`
350-
section. To construct a DataFrame with missing data, we use ``np.nan`` to
350+
section. To construct a DataFrame with missing data, we use ``np.nan`` to
351351
represent missing values. Alternatively, you may pass a ``numpy.MaskedArray``
352352
as the data argument to the DataFrame constructor, and its masked entries will
353353
be considered missing.
@@ -370,7 +370,7 @@ set to ``'index'`` in order to use the dict keys as row labels.
370370

371371
``DataFrame.from_records`` takes a list of tuples or an ndarray with structured
372372
dtype. It works analogously to the normal ``DataFrame`` constructor, except that
373-
the resulting DataFrame index may be a specific field of the structured
373+
the resulting DataFrame index may be a specific field of the structured
374374
dtype. For example:
375375

376376
.. ipython:: python
@@ -506,25 +506,70 @@ to be inserted (for example, a ``Series`` or NumPy array), or a function
506506
of one argument to be called on the ``DataFrame``. A *copy* of the original
507507
DataFrame is returned, with the new values inserted.
508508

509+
.. versionmodified:: 0.23.0
510+
511+
Starting with Python 3.6 the order of ``**kwargs`` is preserved. This allows
512+
for *dependent* assignment, where an expression later in ``**kwargs`` can refer
513+
to a column created earlier in the same :meth:`~DataFrame.assign`.
514+
515+
.. ipython:: python
516+
517+
dfa = pd.DataFrame({"A": [1, 2, 3],
518+
"B": [4, 5, 6]})
519+
dfa.assign(C=lambda x: x['A'] + x['B'],
520+
D=lambda x: x['A'] + x['C'])
521+
522+
In the second expression, ``x['C']`` will refer to the newly created column,
523+
that's equal to ``dfa['A'] + dfa['B']``.
524+
525+
To write code compatible with all versions of Python, split the assignment in two.
526+
527+
.. ipython:: python
528+
529+
dependent = pd.DataFrame({"A": [1, 1, 1]})
530+
(dependent.assign(A=lambda x: x['A'] + 1)
531+
.assign(B=lambda x: x['A'] + 2))
532+
509533
.. warning::
510534

511-
Since the function signature of ``assign`` is ``**kwargs``, a dictionary,
512-
the order of the new columns in the resulting DataFrame cannot be guaranteed
513-
to match the order you pass in. To make things predictable, items are inserted
514-
alphabetically (by key) at the end of the DataFrame.
535+
Dependent assignment maybe subtly change the behavior of your code between
536+
Python 3.6 and older versions of Python.
537+
538+
If you wish write code that supports versions of python before and after 3.6,
539+
you'll need to take care when passing ``assign`` expressions that
540+
541+
* Updating an existing column
542+
* Refering to the newly updated column in the same ``assign``
543+
544+
For example, we'll update column "A" and then refer to it when creating "B".
545+
546+
.. code-block:: python
547+
548+
>>> dependent = pd.DataFrame({"A": [1, 1, 1]})
549+
>>> dependent.assign(A=lambda x: x["A"] + 1,
550+
B=lambda x: x["A"] + 2)
551+
552+
For Python 3.5 and earlier the expression creating ``B`` refers to the
553+
"old" value of ``A``, ``[1, 1, 1]``. The output is then
554+
555+
.. code-block:: python
556+
557+
A B
558+
0 2 3
559+
1 2 3
560+
2 2 3
561+
562+
For Python 3.6 and later, the expression creating ``A`` refers to the
563+
"new" value of ``A``, ``[2, 2, 2]``, which results in
564+
565+
.. code-block:: python
515566
516-
All expressions are computed first, and then assigned. So you can't refer
517-
to another column being assigned in the same call to ``assign``. For example:
567+
A B
568+
0 2 4
569+
1 2 4
570+
2 2 4
518571
519-
.. ipython::
520-
:verbatim:
521572
522-
In [1]: # Don't do this, bad reference to `C`
523-
df.assign(C = lambda x: x['A'] + x['B'],
524-
D = lambda x: x['A'] + x['C'])
525-
In [2]: # Instead, break it into two assigns
526-
(df.assign(C = lambda x: x['A'] + x['B'])
527-
.assign(D = lambda x: x['A'] + x['C']))
528573
529574
Indexing / Selection
530575
~~~~~~~~~~~~~~~~~~~~
@@ -914,7 +959,7 @@ For example, using the earlier example data, we could do:
914959
Squeezing
915960
~~~~~~~~~
916961

917-
Another way to change the dimensionality of an object is to ``squeeze`` a 1-len
962+
Another way to change the dimensionality of an object is to ``squeeze`` a 1-len
918963
object, similar to ``wp['Item1']``.
919964

920965
.. ipython:: python

doc/source/whatsnew/v0.23.0.txt

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,46 @@ Current Behavior:
248248

249249
pd.RangeIndex(1, 5) / 0
250250

251+
.. _whatsnew_0230.enhancements.assign_dependent:
252+
253+
``.assign()`` accepts dependent arguments
254+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255+
256+
The :func:`DataFrame.assign` now accepts dependent keyword arguments for python version later than 3.6 (see also `PEP 468
257+
<https://www.python.org/dev/peps/pep-0468/>`_). Later keyword arguments may now refer to earlier ones if the argument is a callable. See the
258+
:ref:`documentation here <dsintro.chained_assignment>` (:issue:`14207`)
259+
260+
.. ipython:: python
261+
262+
df = pd.DataFrame({'A': [1, 2, 3]})
263+
df
264+
df.assign(B=df.A, C=lambda x:x['A']+ x['B'])
265+
266+
.. warning::
267+
268+
This may subtly change the behavior of your code when you're
269+
using ``.assign()`` to update an existing column. Previously, callables
270+
referring to other variables being updated would get the "old" values
271+
272+
Previous Behaviour:
273+
274+
.. code-block:: ipython
275+
276+
In [2]: df = pd.DataFrame({"A": [1, 2, 3]})
277+
278+
In [3]: df.assign(A=lambda df: df.A + 1, C=lambda df: df.A * -1)
279+
Out[3]:
280+
A C
281+
0 2 -1
282+
1 3 -2
283+
2 4 -3
284+
285+
New Behaviour:
286+
287+
.. ipython:: python
288+
289+
df.assign(A=df.A+1, C= lambda df: df.A* -1)
290+
251291
.. _whatsnew_0230.enhancements.other:
252292

253293
Other Enhancements
@@ -460,6 +500,29 @@ To restore previous behavior, simply set ``expand`` to ``False``:
460500
extracted
461501
type(extracted)
462502

503+
.. _whatsnew_0230.api_breaking.cdt_ordered:
504+
505+
Default value for the ``ordered`` parameter of ``CategoricalDtype``
506+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
507+
508+
The default value of the ``ordered`` parameter for :class:`~pandas.api.types.CategoricalDtype` has changed from ``False`` to ``None`` to allow updating of ``categories`` without impacting ``ordered``. Behavior should remain consistent for downstream objects, such as :class:`Categorical` (:issue:`18790`)
509+
510+
In previous versions, the default value for the ``ordered`` parameter was ``False``. This could potentially lead to the ``ordered`` parameter unintentionally being changed from ``True`` to ``False`` when users attempt to update ``categories`` if ``ordered`` is not explicitly specified, as it would silently default to ``False``. The new behavior for ``ordered=None`` is to retain the existing value of ``ordered``.
511+
512+
New Behavior:
513+
514+
.. ipython:: python
515+
516+
from pandas.api.types import CategoricalDtype
517+
cat = pd.Categorical(list('abcaba'), ordered=True, categories=list('cba'))
518+
cat
519+
cdt = CategoricalDtype(categories=list('cbad'))
520+
cat.astype(cdt)
521+
522+
Notice in the example above that the converted ``Categorical`` has retained ``ordered=True``. Had the default value for ``ordered`` remained as ``False``, the converted ``Categorical`` would have become unordered, despite ``ordered=False`` never being explicitly specified. To change the value of ``ordered``, explicitly pass it to the new dtype, e.g. ``CategoricalDtype(categories=list('cbad'), ordered=False)``.
523+
524+
Note that the unintenional conversion of ``ordered`` discussed above did not arise in previous versions due to separate bugs that prevented ``astype`` from doing any type of category to category conversion (:issue:`10696`, :issue:`18593`). These bugs have been fixed in this release, and motivated changing the default value of ``ordered``.
525+
463526
.. _whatsnew_0230.api:
464527

465528
Other API Changes
@@ -581,6 +644,7 @@ Performance Improvements
581644
- Improved performance of :func:`DataFrame.median` with ``axis=1`` when bottleneck is not installed (:issue:`16468`)
582645
- Improved performance of :func:`MultiIndex.get_loc` for large indexes, at the cost of a reduction in performance for small ones (:issue:`18519`)
583646
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`)
647+
- Improved performance of :func:`DataFrameGroupBy.rank` (:issue:`15779`)
584648

585649
.. _whatsnew_0230.docs:
586650

@@ -639,7 +703,7 @@ Datetimelike
639703
- Bug in :class:`Series` floor-division where operating on a scalar ``timedelta`` raises an exception (:issue:`18846`)
640704
- Bug in :class:`Series`` with ``dtype='timedelta64[ns]`` where addition or subtraction of ``TimedeltaIndex`` had results cast to ``dtype='int64'`` (:issue:`17250`)
641705
- Bug in :class:`TimedeltaIndex` where division by a ``Series`` would return a ``TimedeltaIndex`` instead of a ``Series`` (issue:`19042`)
642-
- Bug in :class:`Series` with ``dtype='timedelta64[ns]`` where addition or subtraction of ``TimedeltaIndex`` could return a ``Series`` with an incorrect name (issue:`19043`)
706+
- Bug in :class:`Series` with ``dtype='timedelta64[ns]`` where addition or subtraction of ``TimedeltaIndex`` could return a ``Series`` with an incorrect name (:issue:`19043`)
643707
- Bug in :class:`DatetimeIndex` where the repr was not showing high-precision time values at the end of a day (e.g., 23:59:59.999999999) (:issue:`19030`)
644708
- Bug where dividing a scalar timedelta-like object with :class:`TimedeltaIndex` performed the reciprocal operation (:issue:`19125`)
645709
- Bug in ``.astype()`` to non-ns timedelta units would hold the incorrect dtype (:issue:`19176`, :issue:`19223`, :issue:`12425`)
@@ -649,6 +713,7 @@ Datetimelike
649713
- Bug in comparison of :class:`DatetimeIndex` against ``None`` or ``datetime.date`` objects raising ``TypeError`` for ``==`` and ``!=`` comparisons instead of all-``False`` and all-``True``, respectively (:issue:`19301`)
650714
- Bug in :class:`Timestamp` and :func:`to_datetime` where a string representing a barely out-of-bounds timestamp would be incorrectly rounded down instead of raising ``OutOfBoundsDatetime`` (:issue:`19382`)
651715
- Bug in :func:`Timestamp.floor` :func:`DatetimeIndex.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
716+
- Bug in :func:`to_datetime` where passing an out-of-bounds datetime with ``errors='coerce'`` and ``utc=True`` would raise ``OutOfBoundsDatetime`` instead of parsing to ``NaT`` (:issue:`19612`)
652717
-
653718

654719
Timezones
@@ -663,6 +728,7 @@ Timezones
663728
- Bug in tz-aware :class:`DatetimeIndex` where addition/subtraction with a :class:`TimedeltaIndex` or array with ``dtype='timedelta64[ns]'`` was incorrect (:issue:`17558`)
664729
- Bug in :func:`DatetimeIndex.insert` where inserting ``NaT`` into a timezone-aware index incorrectly raised (:issue:`16357`)
665730
- Bug in the :class:`DataFrame` constructor, where tz-aware Datetimeindex and a given column name will result in an empty ``DataFrame`` (:issue:`19157`)
731+
- Bug in :func:`Timestamp.tz_localize` where localizing a timestamp near the minimum or maximum valid values could overflow and return a timestamp with an incorrect nanosecond value (:issue:`12677`)
666732

667733
Offsets
668734
^^^^^^^
@@ -756,6 +822,7 @@ Sparse
756822
- Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`)
757823
- Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`)
758824
- Bug in :class:`SparseSeries.memory_usage` which caused segfault by accessing non sparse elements (:issue:`19368`)
825+
- Bug in constructing a ``SparseArray``: if ``data`` is a scalar and ``index`` is defined it will coerce to ``float64`` regardless of scalar's dtype. (:issue:`19163`)
759826

760827
Reshaping
761828
^^^^^^^^^
@@ -772,7 +839,7 @@ Reshaping
772839
- Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
773840
- Bug in :func:`concat` when concatting sparse and dense series it returns only a ``SparseDataFrame``. Should be a ``DataFrame``. (:issue:`18914`, :issue:`18686`, and :issue:`16874`)
774841
- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
775-
-
842+
- Bug in :func:`DataFrame.join` which does an *outer* instead of a *left* join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`)
776843

777844
Other
778845
^^^^^

pandas/_libs/algos.pxd

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,11 @@ cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
1111
a[0] = b[0]
1212
b[0] = t
1313
return 0
14+
15+
cdef enum TiebreakEnumType:
16+
TIEBREAK_AVERAGE
17+
TIEBREAK_MIN,
18+
TIEBREAK_MAX
19+
TIEBREAK_FIRST
20+
TIEBREAK_FIRST_DESCENDING
21+
TIEBREAK_DENSE

pandas/_libs/algos.pyx

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,6 @@ cdef double nan = NaN
3131

3232
cdef int64_t iNaT = get_nat()
3333

34-
cdef:
35-
int TIEBREAK_AVERAGE = 0
36-
int TIEBREAK_MIN = 1
37-
int TIEBREAK_MAX = 2
38-
int TIEBREAK_FIRST = 3
39-
int TIEBREAK_FIRST_DESCENDING = 4
40-
int TIEBREAK_DENSE = 5
41-
4234
tiebreakers = {
4335
'average': TIEBREAK_AVERAGE,
4436
'min': TIEBREAK_MIN,

0 commit comments

Comments
 (0)