首页 > 解决方案 > 如何分别从 docx 和纯文本中获取表格数据?

问题描述

例如在我的 .docx 文件中,我有:

我怎样才能将所有这些点与文本分开?

我以前使用过这个功能:

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
    paragraphs = []
    text, footer, header = [], [], []
    
    for xmlfile in contentToRead:
        xml_content = document.read('word/{}'.format(xmlfile))
        tree = XML(xml_content)
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                textData = ''.join(texts)
                if xmlfile == "footer2.xml":
                    footer.append(textData)
                elif xmlfile == "header2.xml":
                    header.append(textData)
                else:
                    text.append(textData)

    document.close()
    return pd.DataFrame(text, columns=['Text']), pd.DataFrame(footer, columns=['Text']), pd.DataFrame(header,columns=['Text'])

但那部分

else:
    text.append(textData)

从页面获取所有数据(表格、纯文本等)。如何单独获取表格数据?(就像页脚和页眉一样)

标签: pythondocx

解决方案


推荐阅读