python - 如何使用带有 nltk.pos_tag() 函数的通用 POS 标签?
问题描述
我有一个文本,我想找到“ADJs”、“PRONs”、“VERBs”、“NOUNs”等的数量。我知道有.pos_tag()
功能,但它给了我不同的结果,我想得到“ADJs”的结果','代词','动词','名词'。这是我的代码:
import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from collections import Counter
sentence = "this is my sample text that I want to analyze with programming language"
# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()
# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()
# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)
for w_type, num in count_of_word_type_list:
print(w_type, num)
print()
上面的代码有效,但我想找到一种方法来获取这种类型的标签:
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy
我看到这里有一章:https ://www.nltk.org/book/ch05.html
说的是:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
但我不知道如何将其应用于我的例句。谢谢你的帮助。
解决方案
来自https://github.com/nltk/nltk/blob/develop/nltk/tag/init .py #L135
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
# Default Penntreebank tagset.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
# Universal POS tags.
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
推荐阅读
- python-3.7 - 不启动 Python Django ver 3.7.1 django ver 3.0.4
- php - 使 Avatar 也成为超链接
- python - 单击下拉菜单中的所有组合并在 Selenium + Python 中打印文本结果:等待功能不起作用
- javascript - JavaScript“keydown”工作不正确
- java - 我在计算器上找不到我的代码 0*2=2 或 0/2=2 的解决方案
- angular - Angular 9:无法创建新项目
- ansible - Ansible include_tasks 给出语法错误
- java - 捕捉到InterruptedException后我们真的应该中断当前线程吗?
- php - 显示数据透视表数据
- weka - Weka 中树叶的数量和树的大小是什么意思?