首页 > 解决方案 > How can I split a text from a parenthesis in a CSV, and create another column with it

问题描述

I'm completely new to the Python world, so I've been struggling with this issue for a couple days now. I thank you guys in advance.

I have been trying to separate a single Row and column text in three diferente ones. To explain myself better, here's where I am.

So this is my pandas dataframe from a csv:

In[2]:

df = pd.read_csv('raw_csv/consejo_judicatura_guerrero.csv', header=None)
df.columns = ["institution"]
df

Out[2]:

       institution 
0      1.1.2. Consejo Nacional de Ciencias (CNCOO00012)     

Then, I try first to separate the 1.1.2. in a new column called number, which I kind of nailed it:

In[3]:

new_df = pd.DataFrame(df['institution'].str.split('. ',1).tolist(),columns=['number', 'institution'])

Out[3]:

       number institution 
0      1.1.2. Consejo Nacional de Ciencias (CNCOO00012)     

Finally, trying to split the (CNCOO00012) in a new column called unit_id I get the following:

In[4]:

new_df['institution'] = pd.DataFrame(new_df['institution'].str.split('(').tolist(),columns=['institution', 'unit_id'])

Out[4]:

------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-70d13206881c> in <module>
----> 1 new_df['institution'] = pd.DataFrame(new_df['institution'].str.split('(').tolist(),columns=['institution', 'unit_id'])

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    472                     if is_named_tuple(data[0]) and columns is None:
    473                         columns = data[0]._fields
--> 474                     arrays, columns = to_arrays(data, columns, dtype=dtype)
    475                     columns = ensure_index(columns)
    476 

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in to_arrays(data, columns, coerce_float, dtype)
    459         return [], []  # columns if columns is not None else []
    460     if isinstance(data[0], (list, tuple)):
--> 461         return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
    462     elif isinstance(data[0], abc.Mapping):
    463         return _list_of_dict_to_arrays(

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
    491     else:
    492         # list of lists
--> 493         content = list(lib.to_object_array(data).T)
    494     # gh-26429 do not raise user-facing AssertionError
    495     try:

pandas/_libs/lib.pyx in pandas._libs.lib.to_object_array()

TypeError: object of type 'NoneType' has no len()

What can I do to successfully achieve this task?

标签: pythonregexpandas

解决方案


You can use assign with str.split like below. But format of text should be fixed.

df.assign(number = df.institution.str.split().str[0], \
          unit_id = df.institution.str.split().str[-1])

Output:

                                        institution  number       unit_id
0  1.1.2. Consejo Nacional de Ciencias (CNCOO00012)  1.1.2.  (CNCOO00012)

Or If you want to strip () from unit_id use

df.assign(number = df.institution.str.split().str[0], \
          unit_id = df.institution.str.split().str[-1].str.strip('()'))

                                        institution  number     unit_id
0  1.1.2. Consejo Nacional de Ciencias (CNCOO00012)  1.1.2.  CNCOO00012

推荐阅读