首页 > 解决方案 > 如何从数据框中的列中获取词汇表?

问题描述

我有一个相当大的数据集存储在数据框中。如此之大,事实上,对数据集进行排序以生成示例数据集导致我的文本编辑器崩溃。因此,我提供了指向我正在使用的数据集的链接:

https://github.com/moonman239/Capstone-project/blob/master/data.zip

出于规划目的,我需要从 question、article_title 和 paragraph_context 列中检索单词的词汇表。

但是,似乎在拆分和合并列的过程中,我无意中通过将两个单词首尾相连而创建了一些单词(例如:“raised”和“in”变成了“raisedin”catalans”)

### Loading JSON datasets

import json
import re
regex = re.compile(r'\W+')
def readFile(filename):
  with open(filename) as file:
    fields = []
    JSON = json.loads(file.read())
    for article in JSON["data"]:
      articleTitle = article["title"]
      for paragraph in article["paragraphs"]:
        paragraphContext = paragraph["context"]
        for qas in paragraph["qas"]:
          question = qas["question"]
          for answer in qas["answers"]:
            fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
  fields = pd.DataFrame(fields)
  fields["question"] = fields["question"].str.replace(regex," ")
  assert not (fields["question"].str.contains("catalanswhat").any())
  fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
  fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
  assert not (fields["answer_text"].str.contains("catalanswhat").any())
  fields["article_title"] = fields["article_title"].str.replace("_"," ")
  assert not (fields["article_title"].str.contains("catalanswhat").any())
  return fields
# Load training dataset.
trainingData = readFile("train-v1.1.json")

# Vocabulary functions
def vocabulary():
  data_frame = trainingData
  data_frame = data_frame.astype("str")
  text_split = pd.concat((data_frame["question"],data_frame["paragraph_context"],data_frame["article_title"]),ignore_index=True)
  text_split = text_split.str.split()
  words = set()
  text_split.apply(words.update)
  return words
def vocabularySize():
  return len(vocabulary())

也失败的替代代码:

def vocabulary():
  data_frame = trainingData
  data_frame = data_frame.astype("str")
  concat = data_frame["question"].str.cat(sep=" ",others=[data_frame["paragraph_context"],data_frame["article_title"]])
  concat = concat.str.split(" ")
  words = set()
  concat.apply(words.update)
  print(words)
  assert "raisedin" not in words
  return words

标签: pythonpandas

解决方案


这是我解决问题的方法:

from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_json('train-v1.1.json')

words = []
for idx, row in df.iterrows():
    #title
    words.append(json_normalize(df['data'][idx])['title'].str.replace("_"," ").to_string(index = False))
    #paragraph context
    words.append(json_normalize(df['data'][idx], record_path = 'paragraphs')['context'].to_string(index = False))
    #question
    words.append(json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question'].to_string(index = False))


vectorizer = CountVectorizer()
count = vectorizer.fit_transform(words)
vectorizer.get_feature_names()

sklearn 有一个功能,可以按字面意思执行您想做的事情,获取一组数据的所有单个单词。为了使用它,我们需要将所有数据放入 1 个列表或系列中。

我们如何完成列表的创建是首先读取文件。我注意到其中嵌入了许多 json 文件,所以接下来我们将遍历所有不同的 json 并提取我们想要的数据,然后将其添加到名为 words 的列表中。

我们如何提取我们需要的信息来自以下内容:

json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question'].to_string(index = False)

我们查看包含单个 json 的 df 的数据列。我们在 json 中导航,直到我们想要通过 record_path。接下来我们获取我们想要的列,将其全部转换为字符串,然后将新列表附加到我们的主单词列表中。我们对所有不同的 json 文件都这样做。

如果您想执行更多的字符串操作(例如为“”删除“_”),您可以在 for 循环或主单词列表中执行此操作。我只为我的案例中的标题做了这件事。

最后,我们将添加单词。我们创建了一个称为 vectorizer 的 CountVectorizer,我们拟合并转换我们的列表。最后,我们可以通过 get_feature_names() 函数调查我们的 CountVectorizer 以查看每个单词。请注意,如果文本中有拼写错误,它们也会出现。

编辑

您可以使用下面的代码搜索单词并查看它们的位置。将检查中的值更改为您想要的任何值。

df = pd.read_json('train-v1.1.json')

vectorizer = CountVectorizer()
checking = ['raisedin']

for idx, row in df.iterrows():
    title = []
    para = []
    quest = []

    getTitle = json_normalize(df['data'][idx])['title'].str.replace("_"," ")
    getPara = json_normalize(df['data'][idx], record_path = 'paragraphs')['context']
    getQuest = json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question']

    title.append(getTitle.str.replace("_"," ").to_string(index = False))
    para.append(getPara.to_string(index = False))
    quest.append(getQuest.to_string(index = False))

    for word in checking:
        for allwords in [getTitle, getPara, getQuest]:
            count = vectorizer.fit_transform(allwords)
            test = vectorizer.get_feature_names()
            if word in test:
                print(getTitle)
                print(f"{word} is in: " +  allwords.loc[allwords.str.contains(word)])

0    Poultry
Name: title, dtype: object
93    raisedin is in: How long does it take for an broiler raisedin...
Name: question, dtype: object


推荐阅读