Skip to content

ENH: generalize __init__ on a dict to abc.collections.Mapping and __getitem__ on a list to abc.collections.Sequence #58803

Open
@MilesCranmer

Description

@MilesCranmer

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish that DataFrame was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.

Feature Description

Initialization

This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:

using PythonCall

pd = pyimport("pandas")

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

would result in the following dataframe:

julia> df
Python:
   0
0  b
1  a

As @cjdoris discovered, this this is due to the fact "pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it becomes a juliacall.DictValue in Python)."

The way to fix this would be to have __init__ instead check for its input being an instance of abc.collections.Mapping instead, which includes both dict and abc.collections.Mapping.

Indexing

The second incompatibility here is that indexing using lists that are not exactly list does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

As @cjdoris found, this is due to __getitem__ checking for list.

The solution would be to check for the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

Alternative Solutions

Alternative solution 1

There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.

Alternative solution 2

The current workaround is to use these operations as follows (thanks to @mrkn)

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid pydict - it's only from checking the contents do you see it resulted in an unexpected dataframe.

Similarly, for indexing:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6

However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.

Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.

Alternative solution 3

As brought up by @cjdoris in JuliaPy/PythonCall.jl#501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.

However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.

Additional Context

Discussed in JuliaPy/PythonCall.jl#501

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignConstructorsSeries/DataFrame/Index/pd.array ConstructorsEnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions