首页 > 解决方案 > 如何可视化pdf文件模式,因为我想解析它?

问题描述

当我试图解析 docx 和 pdf 格式的简历时。我想明智地解析简历信息部分,例如经验、教育、电子邮件 ID、电话号码、出生日期等。尝试了 docx、pdfminer、pdf2 等库,但没有得到解决方案。

https://github.com/acrosson/nlp/blob/master/information-extraction.py https://github.com/divapriya/Language_Processing

这是将pdf文本提取成文本

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as fh:
        # iterate over all pages of PDF document
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            # creating a resoure manager
            resource_manager = PDFResourceManager()
            # create a file handle
            fake_file_handle = io.StringIO()
            # creating a text converter object
            converter = TextConverter(resource_manager, fake_file_handle, codec='utf-8', laparams=LAParams())
            # creating a page interpreter
            page_interpreter = PDFPageInterpreter(resource_manager, converter)
            # process current page
            page_interpreter.process_page(page)
            # extract text
            text = fake_file_handle.getvalue()
            yield text
            # close open handles
            converter.close()
            fake_file_handle.close()

def calling_extract_text_from_pdf(pdf_path):
    fullPDFText = []
    for page in extract_text_from_pdf(pdf_path):
        text = ''
        text += ' ' + page

        fullPDFText.append(text)
        # print(text)
    pdf_extract_skill_text_1 = [
        line.replace('\n\n', '\n').replace('\n\x0c', '').replace('\n\uf0d8', '') for line in fullPDFText if line
    ]
    # print(pdf_extract_skill_text_1)
    pdf_fullTextString_1 = ''.join(pdf_extract_skill_text_1)
    # print('====S===')
    print(pdf_fullTextString_1) #want to divide this text into sections as per labels(Education, Experience, Skills, etc)

我想分段解析 pdf 和 docx 文档信息,例如:教育、技能、经验等

标签: python-3.xmachine-learningdeep-learning

解决方案


https://stackoverflow.com/questions/52683133/text-scraping-a-pdf-with-python-pdfquery
https://www.reddit.com/r/Python/comments/4bnjha/scraping_pdf_files_with_python/

这是我得到的一些链接。但是仍然很难从 pdf 中获取片段或部分。


推荐阅读