Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish that DataFrame
was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.
Feature Description
Initialization
This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:
using PythonCall
pd = pyimport("pandas")
df = pd.DataFrame(Dict([
"a" => [1, 2, 3],
"b" => [4, 5, 6]
]))
would result in the following dataframe:
julia> df
Python:
0
0 b
1 a
As @cjdoris discovered, this this is due to the fact "pandas.DataFrame.__init__
explicitly checks if its argument is a dict
and Py(::Dict)
is not a dict
(it becomes a juliacall.DictValue
in Python)."
The way to fix this would be to have __init__
instead check for its input being an instance of abc.collections.Mapping
instead, which includes both dict
and abc.collections.Mapping
.
Indexing
The second incompatibility here is that indexing using lists that are not exactly list
does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:
julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
[1] __call__
@ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
[2] apply_if_callable
@ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
[3] __getitem__
@ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
[1] pythrow()
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
[2] errcheck
@ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
[3] pygetitem(x::Py, k::Vector{String})
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
[4] getindex(x::Py, i::Vector{String})
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
[5] top-level scope
@ REPL[18]:1
As @cjdoris found, this is due to __getitem__
checking for list
.
The solution would be to check for the more general abc.collections.Sequence
, which includes both list
and juliacall.VectorValue
.
Alternative Solutions
Alternative solution 1
There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.
Alternative solution 2
The current workaround is to use these operations as follows (thanks to @mrkn)
julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
b a
0 4 1
1 5 2
2 6 3
where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid pydict
- it's only from checking the contents do you see it resulted in an unexpected dataframe.
Similarly, for indexing:
julia> df[pylist(["a", "b"])]
Python:
a b
0 1 4
1 2 5
2 3 6
However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.
Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.
Alternative solution 3
As brought up by @cjdoris in JuliaPy/PythonCall.jl#501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.
However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.
Additional Context
Discussed in JuliaPy/PythonCall.jl#501