scala - 我们如何使用任何 nlp 库在 scala 中提取命名实体
问题描述
我有一个巨大的文本文件,我必须从这个文件中只提取命名实体。为此,我正在使用 Scala 语言和 Databricks 集群。
val input = sc.textFile('....Mypath...').flatMap(line => line.split("""\W+"""))
val namedEnt = something(input)
谁能告诉我要编码什么来获得命名实体?
解决方案
If you convert your input
to a DataFrame (ex: .toDF
), this is how you can get the Named Entities out:
Just an example of Spark NLP installation
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
Actual example:
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
// make sure you are using the latest release 2.4.x
// Download and load the pre-trained pipeline that has NER in English
// Full list: https://github.com/JohnSnowLabs/spark-nlp-models
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
//Transfrom your DataFrame to a new DataFrame that has NER column
val annotation = pipeline.transform(inputDF)
// This would look something like this:
/*
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id| text| document| sentence| token| embeddings| ner| entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/
// This is where the results for entities are:
annotation.select("entities.result").show
Let me know if you have any questions or problems with your input data and I'll update my answer.
References:
推荐阅读
- python - 曲面计算器
- python - 用python控制autocad:pyautocad什么都不做
- django - 我如何通过 django 创建通知系统
- r - 有没有办法在其原始比例图中嵌套一个放大 y 比例的生存图?
- flutter - 如何用颤振刷新小部件?(使用 setState ?)
- css - 最喜欢的按钮使用物化显示为单词而不是心脏
- javascript - 如何调用函数类javascript
- java - 根据活动配置文件或 yml 属性值启用/禁用数据库连接
- javascript - 为什么console.log 给出行号而console.error 没有?
- javascript - 下载后文件未在 Safari 中自动打开