Skip to content

ENH: access arrow-backed map as a python dictionary #61427

Open
@mikelui

Description

@mikelui

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Users should be able to accessing a dataframe element–that is an Arrow-backed map–with normal python dict semantics.

Today, accessing an Arrow-backed map element will return a list of tuples per as_py() from MapScalar type–thus list semantics and not dictionary access semantics. Historically, this is because Arrow allows multiple keys, and ordering is not enforced. So converting to a python dictionary removes those two behaviors. (1) multiple keys will be removed and (2) the ordering may be changed. In practice, this is not the common case, and so it makes the common case hard.

The common case is that users want to interact with a map with traditional key/value access semantics. It's often a burden and source of confusion when users need to manually convert, a la

# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype)
my_dict = df["col_a"].iloc[0]

val = my_dict["key"]  # error, no key/value access semantics
val = dict(my_dict)["key"]  # users need to manually convert to a dict on each access

This behavior should also be available when using imperative iteration based methods like .iterrows(), which is another common patter for accessing element-by-element.

Feature Description

We can have a configuration for this in ArrowExtensionArray.

Arrow already has a maps_as_pydicts flag: .to_pandas(maps_as_pydicts=True) which controls this behavior only when not using pyarrow backed data frames (when using numpy backed data frames). This feature is already widely used in at last one large company.

The flag will generate a native python dictionary instead of a python list of (key, value) tuples. This flag has also made its way to lower-level apis and come up with competing dataframe libraries.

There's not an obvious place to put this in the types_mapper API. But, we can already see unexpected behavior when combining maps_as_pydicts=True with the types_mapper=pd.ArrowDtype

# pseudocode
df = table.to_pandas(types_mapper=pd.ArrowDtype, maps_as_pydicts=True)

# my_dict is still a `MapScalar`!! 
my_dict = df["col_a"].iloc[0]

When combined, maps_as_pydicts is effectively ignored, because the code path taken for types_mapper=pd.ArrowDtype makes no use of the flag.

So, this is all to say, when we see both of those flags, we should propagate the configuration to Pandas, so that it will use it during element access 1, 2

Such a change requires changes in both Arrow and Pandas.

Alternative Solutions

Alternatively, we can save some state in the underlying pyarrow array, so that calling as_py() on the MapScalar will automatically do the right thing.

Some breadcrumbs for context:

  • a MapScalar is generated when accessing a pyarrow MapArray 1, 2
  • this is accessed when retrieving an element from an ArrowExtensionArray 1, 2

So, one can imagine that this information is saved in the MapArray/Table itself. However, that also introduces action at a distance when converting a table to a dataframe, and then performing element access. It would be more straightforward to configure this during the conversion to Pandas and holding that configuration state in the dataframe.


Another partial alternative is making a .map accessor. I lack context on these accessors and don't know if they are an obvious solution, or a ham-fisted one.

Additional Context

Performance can be a consideration. When doing an element access, we'd be doing a conversion from the native Arrow array to a Python dictionary.

However, this is already the case. Element access on a MapScalar already traverses the underlying MapArray and coverts it to a python list 1, 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions