python - 拆分数据集时不能将单例数组视为有效集合
问题描述
所以我从 DataFrame 中的 ES 索引获取数据。其中有以下列tags
,text
和title
。
我正在尝试使用以下代码从此 DataFrame 中拆分数据:
# Get the labels
tags = df.tags
# Get the text
texts = df.text
# Split the dataset
x_train,x_test,y_train,y_test = train_test_split(texts, tags, test_size = 0.2, random_state = 7)
但它不起作用,我收到以下错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-180-b8381ee0d3c2> in <module>
4
5 # Split the dataset
----> 6 x_train,x_test,y_train,y_test = train_test_split(df['text'], tags, test_size = 0.2, random_state = 7)
7
8 # Initialize a TfidfVectorizer
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2116 raise TypeError("Invalid parameters passed: %s" % str(options))
2117
-> 2118 arrays = indexable(*arrays)
2119
2120 n_samples = _num_samples(arrays[0])
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
246 """
247 result = [_make_indexable(X) for X in iterables]
--> 248 check_consistent_length(*result)
249 return result
250
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
206 """
207
--> 208 lengths = [_num_samples(X) for X in arrays if X is not None]
209 uniques = np.unique(lengths)
210 if len(uniques) > 1:
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
206 """
207
--> 208 lengths = [_num_samples(X) for X in arrays if X is not None]
209 uniques = np.unique(lengths)
210 if len(uniques) > 1:
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
150 if len(x.shape) == 0:
151 raise TypeError("Singleton array %r cannot be considered"
--> 152 " a valid collection." % x)
153 # Check that shape is returning an integer or default to len
154 # Dask dataframes may not return numeric shape[0] value
TypeError: Singleton array array(kt-rOnMBAC-oqacdW1Q- On Monday night, Donald Trump traveled to West...
k9-rOnMBAC-oqacdW1Q- Donald Trump is very busy right now trying to ...
lN-rOnMBAC-oqacdW1Q- By now, we all know that upon having emergency...
ld-rOnMBAC-oqacdW1Q- Donald Trump s horrible decisions and disgusti...
lt-rOnMBAC-oqacdW1Q- It s tough sometimes to imagine that Donald Tr...
...
Y-CvOnMBAC-oqacdBwEJ BRUSSELS (Reuters) - NATO allies on Tuesday we...
Z-CvOnMBAC-oqacdBwEJ JAKARTA (Reuters) - Indonesia will buy 11 Sukh...
ZOCvOnMBAC-oqacdBwEJ LONDON (Reuters) - LexisNexis, a provider of l...
ZeCvOnMBAC-oqacdBwEJ MINSK (Reuters) - In the shadow of disused Sov...
ZuCvOnMBAC-oqacdBwEJ MOSCOW (Reuters) - Vatican Secretary of State ...
Name: text, Length: 44908, dtype: object, dtype=object) cannot be considered a valid collection.
但是当检查它们.shape
时texts and tags
它们都是一样的(44908, 1)
解决方案
我找到了解决方案。Eland
我使用以下代码从 ES 获取数据:
es = Elasticsearch("localhost:9200")
ed_df = ed.DataFrame(es_client=es,
es_index_pattern='news',
columns=['tags', 'text', 'title']
)
我不知道的是 Elands DataFrame 和 Pandas 不完全一样
所以我不得不添加以下行:
df = ed.eland_to_pandas(ed_df)
推荐阅读
- python - 为什么将''替换为''不替换?
- c# - 无法从本地数据库更新 DataGridView
- json - 找不到资产的文件或变体:尝试加载 JSON 凭据时出现 assets/credentials.json 错误
- javascript - 从 Javascript 自动播放视频元素在 iOS Safari 上不起作用
- java - 如何通过不同的 Confluent Registry 源忽略 Avro 模式?
- react-native - 反应原生动画:滚动变慢时屏幕抖动
- java - 在 Spring 中初始化 Page 变量
- azure-devops - 适用于 Power 应用的 Azure Devops
- react-native - 错误:ENOENT:没有这样的文件或目录,打开 'android\app\src\main\assets\index.android.bundle'
- python - 关于lightgbm的迭代次数和训练大小的关系是否有任何经验法则?