首页 > 解决方案 > 我收到错误 AttributeError: 'Series' object has no attribute 'split'

问题描述

我有 2 行,第一行和第二行,每列都有单词(每一行基本上是一个文本)。| 行 | | | | | -------- | ---- |---- |---- | | 第一 | 单词1|单词2 |单词3....
| 第二 | 单词 1|单词 2 |单词 3....

我想看到相似之处。我没有频率,只有单词,但据我所知,这个算法也给了我频率。它给我这个错误的问题:

 AttributeError                            Traceback (most recent call 
 last)
 <ipython-input-11-1000d05112e2> in <module>
  28     return result
   29 
---> 30 get_jaccard_sim(first, second)
  31 

  <ipython-input-11-1000d05112e2> in get_jaccard_sim(first, second)
   22 
   23 def get_jaccard_sim(first, second):
  ---> 24     a = set(first.split())
   25     b = set(second.split())
   26     c = a.intersection(b)

~\anaconda\lib\site-packages\pandas\core\generic.py in 
  __getattr__(self, name)
   5128             if 
 self._info_axis._can_hold_identifiers_and_holds_name(name):
  5129                 return self[name]
-> 5130             return object.__getattribute__(self, name)
 5131 
 5132     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

我想拆分每个单词并获取文本中每个单词之间的频率和相似性。我也试图完全驾驭 NAN,但没有成功。打印列时仍然看到 NAN)

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from scipy import spatial
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import jaccard_score

data = pd.read_csv("articles of Shoshana Solomon2.csv", index_col ="article_id")


# retrieving row by loc method
first = data.iloc[2]
second = data.iloc[3]
#third = data.iloc[4]
#fourth=data.iloc[5]
list_no_nan=[first,second]
print(list_no_nan)

#list1 = [x for x in list_no_nan if str(list_no_nan) != 'nan']
#print(list1)


def get_jaccard_sim(first, second): 
  a = set(first.split()) 
  b = set(second.split())
  c = a.intersection(b)
  result=float(len(c)) / (len(a) + len(b) - len(c))
return result

get_jaccard_sim(first, second)

在这里修复什么?谢谢!

标签: pythonnlpseriessimilaritycosine-similarity

解决方案


firstpandas.core.series.Series从数据框的单行构建的对象。我认为您的问题是关于如何拆分这些列中的单词并创建一个集合。Series可以通过.str属性对 a 进行字符串操作。这将创建一个包含拆分文本列表的新系列。然后,您可以迭代这些以构建集合。itertools有一个方便的方法。

>>> import pandas as pd
>>> import itertools
>>> df=pd.DataFrame({"A":["one and"], "B":["two and and and"], "C":["three"]})
>>> first = df.iloc[0]
>>> print(first)
A            one and
B    two and and and
C              three
Name: 0, dtype: object
>>> split = first.str.split()
>>> print(split)
A              [one, and]
B    [two, and, and, and]
C                 [three]
Name: 0, dtype: object
>>> final = set(itertools.chain.from_iterable(split))
>>> print(final)
{'three', 'and', 'one', 'two'}

推荐阅读