首页 > 解决方案 > 拆分数据集时不能将单例数组视为有效集合


所以我从 DataFrame 中的 ES 索引获取数据。其中有以下列tags,texttitle

我正在尝试使用以下代码从此 DataFrame 中拆分数据:

# Get the labels
tags = df.tags

# Get the text
texts = df.text

# Split the dataset
x_train,x_test,y_train,y_test = train_test_split(texts, tags, test_size = 0.2, random_state = 7)


TypeError                                 Traceback (most recent call last)
<ipython-input-180-b8381ee0d3c2> in <module>
      5 # Split the dataset
----> 6 x_train,x_test,y_train,y_test = train_test_split(df['text'], tags, test_size = 0.2, random_state = 7)
      8 # Initialize a TfidfVectorizer

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
   2116         raise TypeError("Invalid parameters passed: %s" % str(options))
-> 2118     arrays = indexable(*arrays)
   2120     n_samples = _num_samples(arrays[0])

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    246     """
    247     result = [_make_indexable(X) for X in iterables]
--> 248     check_consistent_length(*result)
    249     return result

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    206     """
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
    206     """
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    150         if len(x.shape) == 0:
    151             raise TypeError("Singleton array %r cannot be considered"
--> 152                             " a valid collection." % x)
    153         # Check that shape is returning an integer or default to len
    154         # Dask dataframes may not return numeric shape[0] value

TypeError: Singleton array array(kt-rOnMBAC-oqacdW1Q-    On Monday night, Donald Trump traveled to West...
k9-rOnMBAC-oqacdW1Q-    Donald Trump is very busy right now trying to ...
lN-rOnMBAC-oqacdW1Q-    By now, we all know that upon having emergency...
ld-rOnMBAC-oqacdW1Q-    Donald Trump s horrible decisions and disgusti...
lt-rOnMBAC-oqacdW1Q-    It s tough sometimes to imagine that Donald Tr...
Y-CvOnMBAC-oqacdBwEJ    BRUSSELS (Reuters) - NATO allies on Tuesday we...
Z-CvOnMBAC-oqacdBwEJ    JAKARTA (Reuters) - Indonesia will buy 11 Sukh...
ZOCvOnMBAC-oqacdBwEJ    LONDON (Reuters) - LexisNexis, a provider of l...
ZeCvOnMBAC-oqacdBwEJ    MINSK (Reuters) - In the shadow of disused Sov...
ZuCvOnMBAC-oqacdBwEJ    MOSCOW (Reuters) - Vatican Secretary of State ...
Name: text, Length: 44908, dtype: object, dtype=object) cannot be considered a valid collection.

但是当检查它们.shapetexts and tags它们都是一样的(44908, 1)

标签: pythonscikit-learn


我找到了解决方案。Eland我使用以下代码从 ES 获取数据:

es = Elasticsearch("localhost:9200")
ed_df = ed.DataFrame(es_client=es,
                  columns=['tags', 'text', 'title']

我不知道的是 Elands DataFrame 和 Pandas 不完全一样


df = ed.eland_to_pandas(ed_df)
