Skip to content

HistGradientBoostingClassifier does not support pd.Int64Dtype in v1.4.0 #28317

Closed
@timvink

Description

@timvink

Describe the bug

Fitting a HistGradientBoostingClassifier where one of the features has a pd.Int64Dtype dtype will give an error:

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Steps/Code to Reproduce

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train['i'] = 1
X_train['i'] = X_train['i'].astype(pd.Int64Dtype())
clf =  LogisticRegression()
clf.fit(X_train, y_train) # all good
clf = RandomForestClassifier()
clf.fit(X_train, y_train) # all good
clf = HistGradientBoostingClassifier()
clf.fit(X_train, y_train) # breaks

Expected Results

No error is thrown.

Actual Results

Stacktrace suggests it's related to HistGradientBoostingClassifier getting support for categorical dtypes in v1.4.0

stacktrace
File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:558, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    [556](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=555) # time spent predicting X for gradient and hessians update
    [557](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=556) acc_prediction_time = 0.0
--> [558](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=557) X, known_categories = self._preprocess_X(X, reset=True)
    [559](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=558) y = _check_y(y, estimator=self)
    [560](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=559) y = self._encode_y(y)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:271, in BaseHistGradientBoosting._preprocess_X(self, X, reset)
    [268](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=267)     return self._preprocessor.transform(X)
    [270](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=269) # At this point, reset is False, which runs during `fit`.
--> [271](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=270) self.is_categorical_ = self._check_categorical_features(X)
    [273](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=272) if self.is_categorical_ is None:
    [274](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=273)     self._preprocessor = None

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:374, in BaseHistGradientBoosting._check_categorical_features(self, X)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
--> [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
    [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:375, in <listcomp>(.0)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
    [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
--> [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:128, in PandasColumn.dtype(self)
    [126](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=125)     raise NotImplementedError("Non-string object dtypes are not supported yet")
    [127](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=126) else:
--> [128](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=127)     return self._dtype_from_pandasdtype(dtype)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:147, in PandasColumn._dtype_from_pandasdtype(self, dtype)
    [145](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=144)     byteorder = dtype.base.byteorder  # type: ignore[union-attr]
    [146](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=145) else:
--> [147](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=146)     byteorder = dtype.byteorder
    [149](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=148) return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), byteorder

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Versions

system information
System:
    python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
executable: /anaconda/envs/ds_data_schemas/bin/python
   machine: Linux-5.15.0-1053-azure-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions