python-3.x - NLTK ne_tree Word Tokenize 来自列行的块(Python/Pandas/Jupyter)
问题描述
我刚开始学习 Natural Language Take 工具包。我正在尝试对单词进行分类。我基本上是在寻找人物、地点和组织的东西。
到目前为止,在脚本中定义单行文本有效。
ex = 'John'
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
输出:
(S (PERSON John/NNP))
我的问题是我可以用这个脚本指定一整列吗?
我的表如下。
例2:
Order Text
0 John
1 Chicago
2 stuff
3 question
订单基本上是我创建的索引。这个想法是以后我可以把句子分解成单词,保持一个键,然后融化。文本是我想要标记的。
当我运行此代码时,出现以下错误。也许我调用不正确,我需要指定列?谢谢您的帮助。
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex2)))
print(ne_tree)
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-80-5d4582e937dd> in <module>
----> 1 ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex3)))
2 print(ne_tree)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)
142 :type preserve_line: bool
143 """
--> 144 sentences = [text] if preserve_line else sent_tokenize(text, language)
145 return [
146 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
104 """
105 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106 return tokenizer.tokenize(text)
107
108
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
1275 Given a text, returns a list of the sentences in that text.
1276 """
-> 1277 return list(self.sentences_from_text(text, realign_boundaries))
1278
1279 def debug_decisions(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
1319 if realign_boundaries:
1320 slices = self._realign_boundaries(text, slices)
-> 1321 for sl in slices:
1322 yield (sl.start, sl.stop)
1323
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
1360 """
1361 realign = 0
-> 1362 for sl1, sl2 in _pair_iter(slices):
1363 sl1 = slice(sl1.start + realign, sl1.stop)
1364 if not sl2:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
316 it = iter(it)
317 try:
--> 318 prev = next(it)
319 except StopIteration:
320 return
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
1333 def _slices_from_text(self, text):
1334 last_break = 0
-> 1335 for match in self._lang_vars.period_context_re().finditer(text):
1336 context = match.group() + match.group('after_tok')
1337 if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
解决方案
您也必须将该函数应用于每一行值
ex2['results'] = ex2.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))
推荐阅读
- axapta - 在表单的init方法中初始化edit方法的值
- python - GCP - Python 创建物联网设备 PermissionDenied
- java - 无法执行垂直滚动
- javascript - 如何在我的 WordPress 插件中包含 javascript?
- javascript - 选择按钮活动类
- c - 从 'void *' 对 'int' 的赋值使指针从没有强制转换的整数
- sendgrid - sendgrid 出现错误:发件人地址与已验证的发件人身份不匹配
- python-3.x - 如何为 seaborn.lmplot 中的整个数据添加回归线?
- javascript - 如果 URL 包含某个参数,如何更改变量的值
- rust - 如何在 block_on 部分交换数据?