首页 > 解决方案 > 我应该使用哪些 NLP 措施来比较某些术语在不同文档中的重要性/中心性?

问题描述

我可以使用哪些 NLP(自然语言处理)度量来衡量文本或文本集合中不同单词的重要性和中心性?

示例:假设我有两个包含司法意见的语料库。语料库 A 包含法院裁定制造商因疏忽制造产品而承担责任的意见。语料库 B 包含具有相似事实但得出不同结果的观点。可以使用哪些措施来让我说某些术语对语料库 A 中的案例比对语料库 B 中的案例更“重要”或“中心”?

我试过的: - 原始词频 - TF-IDF

我知道还有更多(例如来自图论),但不确定从哪里开始并且背景有限。我将不胜感激有关使用哪些不同措施以及每种措施的利弊的任何建议或解释。FWIW,我正在使用 NTLK,并且对 Spacy 也有点熟悉。

背景:我正在计划一篇学术法律评论文章,试图解释为什么在两个不同的司法管辖区对某个主题的类似案件有不同的判决。我的假设:不同的术语在每组案例中更常见,也更重要。例如,集合 A 中的案例可能比集合 B 中的类似案例更多地使用术语“意图”,这表明第一组案例更关注该概念。

标签: nlpnltkdata-analysisnatural-language-processing

解决方案


Regarding your question how to "measure the importance and centrality of different words in a text": The textacy library has several algorithms for extracting key terms easily. The current state of the art algorithm for this task seems to be YAKE and textacy has an easy to use implementation for it. Note that textacy is built on top of spacy, so it would work something like this:

from textacy.ke import yake
import spacy
import wikipedia

nlp = spacy.load("en_core_web_sm")
text = wikipedia.page("Emmanuel Macron").content  # some string to extract key terms from
doc = nlp(text)

keywords = yake(doc, normalize="lemma", include_pos=["NOUN", "PROPN", "ADJ"], window_size=5, topn=10)

Regarding your final goal to "explain why similar cases [texts] on a certain topic are decided differently in two different jurisdictions": I'm not sure if keyword extraction is going to help you much for this. If I understand correctly you have two corpora: corpus A with a group of text with known outcome X and corpus B with known outcome Y. You can apply keyword extraction to both corpora separately, but this will just return the words that are most central to the respective corpus. It will not tell you which keywords are most exclusive to corpus A compared to corpus B. (Maybe you can interpret the keywords qualitatively and you might gain some insights qualitatively)

One alternative might be topic modeling. A topic model tells you which words are most exclusive to one topic compared to another (a topic is defined as a group of words that often occur together within and across texts). You could combine your two corpora A and B, run topic modeling on the combined corpus and hope that there are certain topics (word combinations) that are correlated with outcome X or Y. The best library in Python for topic modeling is Gensim (but I'm less familiar with it and I have the impression that topic modeling libraries in R are more comprehensive than in Python).


推荐阅读