ENH:  Add a safe Option to hash_pandas_object with Default Value Set to True

### Feature Type

- [X] Adding new functionality to pandas

- [X] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

The current implementation of hash_pandas_object does not meet collision resistance requirements, although this is known to the developers. However, it is not prominently documented, and the function is already widely used in many downstream AI platforms, such as MLflow, AutoGluon, and others. These platforms use pandas_hash_object to convert DataFrame structures and then apply MD5 or SHA-256 for uniqueness checks, enabling caching and related functionalities. This makes these platforms more vulnerable to malicious datasets.

Therefore, I propose adding a safe option with a default value set to True. This would directly benefit the security of a large number of downstream applications. If not, the documentation should explicitly state that the function does not provide collision resistance and should not be used for caching or similar tasks.

### Feature Description

``` 
def hash_pandas_object(,,,,, safe=True):
        if safe == True:
            safe_hash_pandas_object(,,,,,)
        else:
             # Existing code
```

### Alternative Solutions

Alternatively, if users need to modify the function themselves, they can use to_pickle() to serialize the DataFrame before hashing.

```
df_bytes = df.to_pickle()
hash_object = hashlib.sha256(df_bytes)
```

### Additional Context

autogluon code:
https://github.com/autogluon/autogluon/blob/082d8bae7343f02e9dc9ce3db76bc3f305027b10/common/src/autogluon/common/utils/utils.py#L176

mlflow code at:
https://github.com/mlflow/mlflow/blob/615c4cbafd616e818ff17bfcd964e8366a5cd3ed/mlflow/data/digest_utils.py#L39

graphistry code at:
https://github.com/graphistry/pygraphistry/blob/52ea49afbea55291c41962f79a90d74d76c721b9/graphistry/util.py#L84

Developer discussion on pandas functionality: https://github.com/pandas-dev/pandas/issues/16372#issuecomment-428545609  

Documentation link for `hash_pandas_object`: https://pandas.pydata.org/docs/reference/api/pandas.util.hash_pandas_object.html#pandas.util.hash_pandas_object

one demo:
```
import pandas as pd
# Define two data dictionaries
data1 = {
    'A': [1604090909467468979, 2],
    'B': [4, 4]
}
data2 = {
    'A': [1, 2],
    'B': [3, 4]
}
# Convert dictionaries to DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Calculate the hash value for each DataFrame
hash_df1 = pd.util.hash_pandas_object(df1)
hash_df2 = pd.util.hash_pandas_object(df2)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add a safe Option to hash_pandas_object with Default Value Set to True #60428

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Add a safe Option to hash_pandas_object with Default Value Set to True #60428

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions