首页 > 解决方案 > 我从一个包含 72 个项目的更大列表中提取了一个单词列表。我如何确定这些词来自哪个列表编号(1-72)?

问题描述

我从这个网站(https://www.cs.columbia.edu/~hgs/audio/harvard.html)导入了 720 个句子。有 72 个列表(每个列表包含 10 个句子。)并将其保存在适当的结构中。我在 R 中完成了这些步骤。代码如下所示。

#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
  html_nodes("li") %>%
  html_text()
headers <- read_html(url) %>%
  html_nodes("h2") %>%
  html_text()

#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1

for(sentence in sentences){
  sentenceList <- c(sentenceList, sentence)
  print(sentence)
  if(length(sentenceList) == 10) { #if we have 10 sentences
    harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
    sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
    n <- n+1 #set our list name to the next one
  }
}

#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)

然后,在 Python 中,我计算了所有以“ing”结尾的单词的列表,以及它们出现的频率,也就是它们在所有 72 个列表中出现的次数。

path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)

import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)

ingWordList = [];
for line in sen_List:
    for word in line.split():
         if word.endswith('ing'):
                ingWordList.append(word)

ingWordCountDictionary = {};

for word in ingWordList:
    word = word.replace('"', "")
    word = word.lower()
    if word in ingWordCountDictionary:
        ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
    else: 
        ingWordCountDictionary[word] = 1

print(ingWordCountDictionary)

f = open("ingWordCountDictionary.txt", "w")

for key, value in ingWordCountDictionary.items():
    keyValuePairToWrite = "%s, %s\n"%(key, value)
    f.write(keyValuePairToWrite)


f.close()

现在,我被要求创建一个数据集,该数据集显示每个“ing”单词来自哪个列表(72 中的 1)。这就是我不知道该怎么做。我显然知道它们是庞大的 72 项列表的一个子集,但我如何弄清楚这些词来自哪个列表。

预期的输出应如下所示:

[List Number] [-ing Word]
List 1        swing, ring, etc.,
List 2        moving

如此等等

标签: rstring

解决方案


这是给你的一种方法。就我看到的预期结果而言,您似乎希望获得渐进形式的动词(V-ing)。(我不明白为什么你的结果中有 king。例如,如果你有 king,那么这里也应该有 spring。)如果你需要考虑词法类,我想你想使用koRpus包。如果没有,例如,您可以使用该textstem软件包。

首先,我抓取了链接并创建了一个数据框。然后,我将句子拆分为使用包unnest_tokens()中的tidytext单词,以及以“ing”结尾的子集单词。然后,我treetag()在 koRpus 包中使用了。在使用该软件包之前,您需要自己安装 Treetagger。最后,我计算了这些渐进式动词在数据集中出现的次数。我希望这能帮到您。

library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)

read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("h2") %>% 
  html_text() -> so_list

read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>% 
  html_nodes("li") %>% 
  html_text() -> so_text


# Create a data frame

sodf <- tibble(list_name = rep(so_list, each = 10),
           text = so_text)

# Split senteces into words and get words ending with ING.

unnest_tokens(sodf, input = text, output = word) %>% 
  filter(grepl(x = word, pattern = "ing$")) -> sowords

# Use koRpus package to lemmatize the words in sowords$word.

treetag(sowords$word, treetagger = "manual", format = "obj",
        TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
        TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
                          preset = "en")) -> out

# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.

filter(out@TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>% 
  count(token)

# A tibble: 16 x 2
#   token         n
#   <chr>     <int>
# 1 adding        1
# 2 bring         4
# 3 changing      1
# 4 drenching     1
# 5 dying         1
# 6 lodging       1
# 7 making        1
# 8 raging        1
# 9 shipping      1
#10 sing          1
#11 sleeping      2
#12 wading        1
#13 waiting       1
#14 wearing       1
#15 winding       2
#16 working       1

推荐阅读