python - 如何用 pandas 将句子拆分为句子 ID、单词和标签?
问题描述
我想将我的 pandas 数据框转换为可用于 NER 模型的格式。
我有一个像这样的熊猫数据框:
```
Sentence_id Sentence labels
1 Did not enjoy the new Windows 8 and touchscreen functions. Windows 8
1 Did not enjoy the new Windows 8 and touchscreen functions. touchscreen functions
```
可以转换成如下格式吗?</p>
```
Sentence_id words labels
1 Did O
1 not O
1 enjoy O
1 the O
1 new O
1 Windows B
1 8 I
1 and O
1 touchscreen B
1 functions I
1 . O
```
标签中的第一个单词应标记为“B”(开始),标签中的以下单词应标记为“I”(内部)。其他单词和标点符号应标记为 O(Outside)。
解决方案
解决方案有点长。但是您可以使用df.iterrows()
.
import string
ids = df.Sentence_id.unique().tolist() ## Assuming name of your dataframe is df
sentences = df.Sentence.unique().tolist()
labels = df.labels.unique().tolist()
def get_label(word, labels):
if word == labels[0]:
return 'B'
elif word in labels and word!= labels[0]:
return 'I'
else:
return 'O'
data = {}
exclude = set(string.punctuation)
for _, row in df.iterrows():
words = ''.join(ch for ch in row['Sentence'] if ch not in exclude).split()
puncts = ''.join(ch for ch in row['Sentence'] if ch in exclude).split()
labels = row['labels'].split()
for word in words:
if word in data:
if word in labels:
data[word][1] = get_label(word, labels)
else:
data[word] = [row['Sentence_id'], get_label(word, labels)]
for punct in puncts:
data[punct] = [row['Sentence_id'],'O']
## Processing the dictionary to input into dataframe
ids = []
words = []
labels = []
for key, val in data.items():
words.append(key)
ids.append(data[key][0])
labels.append(data[key][1])
new_df = pd.DataFrame({'Sentence_id':ids, 'words':words, 'labels':labels})
new_df
Sentence_id words labels
0 1 Did O
1 1 . O
2 1 not O
3 1 enjoy O
4 1 the O
5 1 new O
6 1 Windows B
7 1 8 I
8 1 and O
9 1 touchscreen B
10 1 functions I
推荐阅读
- ansible - 为什么要安装 Ossec 的代理/服务器模型?与无代理相反
- ruby-on-rails - rails link_to 按钮的奇怪问题
- python - 无法从“8UC3”转换为“bgr8”
- android - 处理程序中的 Android 消息队列以意外延迟发送
- angular - 从请求时间返回上下文数据,并在 Typescript 异步 http 调用中做出响应
- javascript - 使用nodejs时如何在visual studio代码中突出错误编写的代码?
- r - R中的字符串操作
- .net - 我自己的核心 CLR 分析器未附加到工作进程
- apache - 重新安装 Mariadb 和 Apache 设置后,永久链接/可读链接不再起作用
- java - HAPI FHIR 读取 API _summary=text 抛出错误 - “无法调用”java.lang.CharSequence.length()”,因为“csq”为空”