Understanding string interning in pd.read_csv vs. other methods of creating large object columns

### Pandas version checks

- [X] I have checked that this issue has not already been reported.

- [X] I have confirmed this bug exists on the [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.

- [X] I have confirmed this bug exists on the [main branch](https://pandas.pydata.org/docs/dev/getting_started/install.html#installing-the-development-version-of-pandas) of pandas.


### Reproducible Example

```python
Let's say I have a hdf5 and csv that contain a single column/dataset of equivalent string data. When I read it in via hdf5

foo = pd.DataFrame()
dataset = h5py.File(file)[column][:] # dtype = S10, length = 10 million
foo['a'] = dataset # dtype is still S10, described in issue #52617
foo['a'] = foo['a'].astype(object)
```
At this point, the memory explodes to something colossal. By contrast, when I do

```
foo = pd.read_csv(large_file)
```
The memory stays really low, as though it is interning the strings in the read_csv codepath, or more likely it is doing some internal optimization such as `column.astype(category).astype(object)` for all the string columns. 

Can you confirm whether this is true? If not, can you describe why there is a discrepancy?

Lastly, in the former case, can you recommend a similar optimization? I'm nervous about casting everything to categories, because in the off chance is a tremendous number of unique strings, it could blow out the memory even worse.
```


### Issue Description

See above

### Expected Behavior

See above

### Installed Versions

<details>

Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : 66e3805b8cabe977f40c05259cc3fcf7ead5687d
python           : 3.8.13.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.18.0-348.23.1.el8_5.x86_64
Version          : #1 SMP Tue Apr 12 11:20:32 EDT 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.5
numpy            : 1.23.3
pytz             : 2022.4
dateutil         : 2.8.2
pip              : 22.2.2
setuptools       : 65.4.1
Cython           : 0.29.32
pytest           : 7.0.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.8.0
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.5.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : 1.3.5
fsspec           : 2022.8.2
fastparquet      : None
gcsfs            : None
matplotlib       : 3.6.0
numexpr          : 2.8.3
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 9.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.9.1
sqlalchemy       : 1.4.41
tables           : 3.6.1
tabulate         : 0.9.0
xarray           : 2022.9.0
xlrd             : 2.0.1
xlwt             : None
numba            : 0.56.2
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding string interning in pd.read_csv vs. other methods of creating large object columns #52639

Pandas version checks

Reproducible Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding string interning in pd.read_csv vs. other methods of creating large object columns #52639

Description

Pandas version checks

Reproducible Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions