Skip to content

str.extract raises ValueError with group named "name" #11385

Closed
@tdhock

Description

@tdhock

For a Series S, I find the S.str.extract method very useful. It is great how you implemented naming the resulting DataFrame columns according to the names specified in the capturing groups of the regular expression.

However there seems to be a bug when there is a capture group named "name" for example

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': '[email protected]',
...     'multiple': '[email protected] some text [email protected]',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<name>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> result = S.str.extract(pattern, re.VERBOSE)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/strings.py", line 1370, in extract
    return self._wrap_result(result, name=name)
  File "pandas/core/strings.py", line 1088, in _wrap_result
    name = kwargs.get('name') or getattr(result, 'name', None) or self.series.name
  File "pandas/core/generic.py", line 730, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> from pandas.util.print_versions import show_versions
>>> show_versions()

INSTALLED VERSIONS
------------------
commit: 5d953e3fba420b6721c7f1c5d53e5812fe113bbc
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.8.0-44-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.17.0+73.g5d953e3
nose: 1.1.2
pip: None
setuptools: 0.6
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: None
IPython: 0.12.1
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.7.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
>>>

The result I expected was

>>> exp_list = [
...     ("dave", "google", "com"),
...     ("rob", "gmail", "com"),
...     (np.nan, np.nan, np.nan),
...     ]
>>> exp = pd.DataFrame(
...     exp_list,
...     ["Dave", "multiple", "none"],
...     ["name", "domain", "tld"])
>>> print exp
          name  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>>

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions