首页 > 解决方案 > 将word文档放入数据框

问题描述

我有几个word文档,我想在导出到excel之前将它们的内容整理到一个数据框中。到目前为止,我有这个代码:

import docx2txt

my_word_files = glob.glob(r"C:\Users\.......\*.docx")

for file in my_word_files:
    word = docx2txt.process(file)

这会将 word 文档转换为字符串,其内容如下所示:

问题

: 你的全名是什么?

<ANSWER_1>

约翰·史密斯

<ANSWER_1>

ETC....

每个问题都以“:”开头,每个答案都包含在两个 <ANSWER_...> 之间。我想要做的是把它变成一个数据框,如下所示:

What is your full name?   Question2    Question3   etc...
John Smith                Answer2      Answer3

每一行都是每个单词文件的答案,以便很好地整理所有内容。

标签: pythonpandasdataframe

解决方案


假设您从 pandas 数据框开始,我们可以在转向您想要的输出之前应用一些逻辑操作。

from io import StringIO
import pandas as pd
import numpy as np

d = """: What is your full name?

<ANSWER_1>

John Smith

<ANSWER_1>
: What is your Age
<ANSWER_1>
25
<ANSWER_1>
<ANSWER_1>
: Where are you located
<ANSWER_1>
London, fam.
<ANSWER_1>
<ANSWER_2>
Engurrland
<ANSWER_2>

"""

df = pd.read_csv(StringIO(d),sep='\t',header=None)

首先,让我们构建一个序列,这样我们就可以有一个问题+答案的逻辑顺序,并删除<>正文中的任何值。

df = df[~df[0].str.contains('<|>')]

df['sequence'] = df.groupby(df[0].str.contains('^:').cumsum()).cumcount()

df['questionSequence'] = np.where(
    df[0].str.contains("^:"), df.groupby(df[0].str.contains("^:")).cumcount(), np.nan
)

df['questionSequence'] = df['questionSequence'].ffill()

print(df)

                          0     sequence  questionSequence
0   : What is your full name?         0               0.0
2                  John Smith         1               0.0
4          : What is your Age         0               1.0
6                          25         1               1.0
9     : Where are you located         0               2.0
11               London, fam.         1               2.0
14                 Engurrland         2               2.0

接下来,我们要拆分问题列以创建一个单独的问题和答案列,并在我们处理它时删除讨厌的冒号。

df['question'] = np.where(df['sequence'].eq(0),df[0],np.nan)
df['question'] = df['question'].ffill().str.strip(': ')

这很好地形成了,现在让我们0通过过滤序列列来过滤掉列中的问题。

df1 = df1 = df[(df['sequence'].ne(0))].copy()

print(df1)

              0    sequence  questionSequence                 question
2     John Smith         1               0.0  What is your full name?
6             25         1               1.0         What is your Age
11  London, fam.         1               2.0    Where are you located
14    Engurrland         2               2.0    Where are you located

最后,我们将使用多索引以相同的问题顺序创建您的数据透视表。

final = pd.crosstab(
    df1["sequence"], [df1["questionSequence"], df1["question"]], df1[0], aggfunc="first"
)

在此处输入图像描述


推荐阅读