Description
Discussion copied over from #49450
In OP of #49450(discusses turning on the _item_cache for CoW),
Context:
Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in
df._item_cache
, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly. I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> s1 = df["a"] >>> s2 = df["a"] >>> df['b'] = 10 # set existing column -> clears the item cache >>> s3 = df["a"] >>> s1 is s2 True >>> s1 is s3 False
This caching can also have other side effects, though. In investigating #29411, I found that methods like memory_usage
(also looks like round
, duplicated
, may be affected from a quick glance at frame.py) that iterate through all the columns by calling .items()
, will actually cause all the columns to be cached in _item_cache, which blows up memory usage.
This might be tricky to do, though, as Joris noted, since this would be a behavior change.
We should discuss here how we want to go about doing this(needs deprecation?).