Skip to content

API: should apply also follow result_type for axis=0 ? #19570

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Follow-up issue on #18577

In that PR @jreback cleaned up the apply(..., axis=1) result shape inconsistencies, and we added a keyword to control this.

For example, when the applied function returns an array or a list, it now defaults to returning a Series of those objects, or expanding it to multiple columns if you pass result_type explicitly:

In [1]: df = pd.DataFrame(np.tile(np.arange(3), 4).reshape(4, -1) + 1, columns=['A', 'B', 'C'], index=pd.date_range("2012-01-01", periods=4))

In [2]: df
Out[2]: 
            A  B  C
2012-01-01  1  2  3
2012-01-02  1  2  3
2012-01-03  1  2  3
2012-01-04  1  2  3

In [3]: df.apply(lambda x: np.array([0, 1, 2]), axis=1)
Out[3]: 
2012-01-01    [0, 1, 2]
2012-01-02    [0, 1, 2]
2012-01-03    [0, 1, 2]
2012-01-04    [0, 1, 2]
Freq: D, dtype: object

In [4]: df.apply(lambda x: np.array([0, 1, 2]), axis=1, result_type='expand')
Out[4]: 
            0  1  2
2012-01-01  0  1  2
2012-01-02  0  1  2
2012-01-03  0  1  2
2012-01-04  0  1  2

In [5]: df.apply(lambda x: np.array([0, 1, 2]), axis=1, result_type='broadcast')
Out[5]: 
            A  B  C
2012-01-01  0  1  2
2012-01-02  0  1  2
2012-01-03  0  1  2
2012-01-04  0  1  2

However, for axis=0, the default, we don't yet follow the same rules / the keyword in all cases. Some examples:

  • For list, it depends on the length (and if the length matches, it preserves the original index instead of new range index):

    In [16]: df.apply(lambda x: [0, 1, 2, 3])
    Out[16]: 
                A  B  C
    2012-01-01  0  0  0
    2012-01-02  1  1  1
    2012-01-03  2  2  2
    2012-01-04  3  3  3
    
    In [17]: df.apply(lambda x: [0, 1, 2, 3, 4])
    Out[17]: 
    A    [0, 1, 2, 3, 4]
    B    [0, 1, 2, 3, 4]
    C    [0, 1, 2, 3, 4]
    dtype: object
    

    (result_type='expand' and result_type='broadcast' do work correctly here)

  • For an array, it expands when the length does not match (so different as for axis=1, and also different as for list):

    In [23]: df.apply(lambda x: np.array([0, 1, 2, 3]))
    Out[23]: 
                A  B  C
    2012-01-01  0  0  0
    2012-01-02  1  1  1
    2012-01-03  2  2  2
    2012-01-04  3  3  3
    
    In [24]: df.apply(lambda x: np.array([0, 1, 2, 3, 4]))
    Out[24]: 
       A  B  C
    0  0  0  0
    1  1  1  1
    2  2  2  2
    3  3  3  3
    4  4  4  4
    

So the question is: should we follow the same rules for axis=0 as for axis=1?
I would say: ideally yes. But doing so might break some behaviour (although it might be possible to do that with warnings).

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignApplyApply, Aggregate, Transform, MapEnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions