r - 我从一个包含 72 个项目的更大列表中提取了一个单词列表。我如何确定这些词来自哪个列表编号(1-72)?
问题描述
我从这个网站(https://www.cs.columbia.edu/~hgs/audio/harvard.html)导入了 720 个句子。有 72 个列表(每个列表包含 10 个句子。)并将其保存在适当的结构中。我在 R 中完成了这些步骤。代码如下所示。
#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()
headers <- read_html(url) %>%
html_nodes("h2") %>%
html_text()
#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1
for(sentence in sentences){
sentenceList <- c(sentenceList, sentence)
print(sentence)
if(length(sentenceList) == 10) { #if we have 10 sentences
harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
n <- n+1 #set our list name to the next one
}
}
#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)
然后,在 Python 中,我计算了所有以“ing”结尾的单词的列表,以及它们出现的频率,也就是它们在所有 72 个列表中出现的次数。
path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)
import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)
ingWordList = [];
for line in sen_List:
for word in line.split():
if word.endswith('ing'):
ingWordList.append(word)
ingWordCountDictionary = {};
for word in ingWordList:
word = word.replace('"', "")
word = word.lower()
if word in ingWordCountDictionary:
ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
else:
ingWordCountDictionary[word] = 1
print(ingWordCountDictionary)
f = open("ingWordCountDictionary.txt", "w")
for key, value in ingWordCountDictionary.items():
keyValuePairToWrite = "%s, %s\n"%(key, value)
f.write(keyValuePairToWrite)
f.close()
现在,我被要求创建一个数据集,该数据集显示每个“ing”单词来自哪个列表(72 中的 1)。这就是我不知道该怎么做。我显然知道它们是庞大的 72 项列表的一个子集,但我如何弄清楚这些词来自哪个列表。
预期的输出应如下所示:
[List Number] [-ing Word]
List 1 swing, ring, etc.,
List 2 moving
如此等等
解决方案
这是给你的一种方法。就我看到的预期结果而言,您似乎希望获得渐进形式的动词(V-ing)。(我不明白为什么你的结果中有 king。例如,如果你有 king,那么这里也应该有 spring。)如果你需要考虑词法类,我想你想使用koRpus
包。如果没有,例如,您可以使用该textstem
软件包。
首先,我抓取了链接并创建了一个数据框。然后,我将句子拆分为使用包unnest_tokens()
中的tidytext
单词,以及以“ing”结尾的子集单词。然后,我treetag()
在 koRpus 包中使用了。在使用该软件包之前,您需要自己安装 Treetagger。最后,我计算了这些渐进式动词在数据集中出现的次数。我希望这能帮到您。
library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("h2") %>%
html_text() -> so_list
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("li") %>%
html_text() -> so_text
# Create a data frame
sodf <- tibble(list_name = rep(so_list, each = 10),
text = so_text)
# Split senteces into words and get words ending with ING.
unnest_tokens(sodf, input = text, output = word) %>%
filter(grepl(x = word, pattern = "ing$")) -> sowords
# Use koRpus package to lemmatize the words in sowords$word.
treetag(sowords$word, treetagger = "manual", format = "obj",
TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
preset = "en")) -> out
# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.
filter(out@TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>%
count(token)
# A tibble: 16 x 2
# token n
# <chr> <int>
# 1 adding 1
# 2 bring 4
# 3 changing 1
# 4 drenching 1
# 5 dying 1
# 6 lodging 1
# 7 making 1
# 8 raging 1
# 9 shipping 1
#10 sing 1
#11 sleeping 2
#12 wading 1
#13 waiting 1
#14 wearing 1
#15 winding 2
#16 working 1
推荐阅读
- c# - 在没有焦点的情况下按移动键进入程序?C#
- amazon-web-services - 防止弹性 beanstalk 配置的 AWS 访问策略
- ruby-on-rails - 在 Rails 上测试的最佳方法是什么?
- nuxt.js - NuxtJS - 如何禁用路由加载/渲染?
- bash - 如何根据特殊字符从bash或shell中的行获取子字符串
- android - 如何在 Android 中使用 web3j 使用现有的 Contract
- css - 如何调整引导表列宽的垂直滚动条对齐效果?
- typescript - React Native Typescript - Ref 类型的问题
- dialogflow-es - 我们如何在 Google Dialog flow Chatbot 中将 Web 链接添加为按钮
- go - 如何中断系统调用