首页 > 解决方案 > 'NoneType' 没有属性 'lower' - 清理文本时出错

问题描述

下面是我在数据块中运行的代码,下面是错误。

data = d.select("*").toPandas()
train, test = train_test_split(data, test_size = .20, random_state = True)
train['set'] = 'train'
test['set'] = 'test'
data = pd.concat([train,test], ignore_index=True)

def clean_text(text):
  return "".join([c for c in text.lower() if c not in punctuation])

data['text_cleaned'] = data['text'].map(clean_text)

tfidf = TfidfVectorizer()
tfidf.fit(data['text_cleaned'])

错误:

AttributeError: 'NoneType' object has no attribute 'lower'
/local_disk0/tmp/1582551158268-0/PythonShell.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/local_disk0/tmp/1582551158268-0/PythonShell.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import 

AttributeError: 'NoneType' object has no attribute 'lower'         

标签: pythonpysparkdatabrickstfidfvectorizer

解决方案


您可以过滤掉无:

data = d.select("*").toPandas()
train, test = train_test_split(data, test_size = .20, random_state = True)
train['set'] = 'train'
test['set'] = 'test'
data = pd.concat([train,test], ignore_index=True)

def clean_text(text):
    return "".join([c for c in text.lower() if (text is not None) and (c not in punctuation)])

data['text_cleaned'] = data['text'].map(clean_text)

tfidf = TfidfVectorizer()
tfidf.fit(data['text_cleaned'])

推荐阅读