python - Python:将 pdf 转换为 csv/json
问题描述
我是转换文件的新 Python。我试图在这段代码中将 pdf 转换为 csv,我指的是这个 git repo:https ://github.com/bhishan/PDFMiningUsingLessAndSubprocess
我收到类似“文件 A Test Suite for Evaluation of English-to-Korean.pdf 失败”之类的错误。除“subprocess.Popen”外,一切正常。我在这里做错了什么?
PDF 文件链接(无法在 git 上添加附件):http ://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.503.4016&rep=rep1&type=pdf
import subprocess
import glob
import time
import csv
csvwriter = csv.writer(file('translation.csv', 'wb'))
csvwriter.writerow(['title','contributornames','institutions','abstract'])
def parse_pdf_buffer(buffer_file):
with open(buffer_file, 'rb') as f:
all_content = f.readlines()
for each_line in all_content[0:29]:
title_part = each_line[0]
contributornames = each_line[1]
institutions = each_line[1:3]
abstract = each_line[7:27]
title_part = " ".join(desc_part.split())
contributornames_part = " ".join(withdraw_part.split())
institutions_part = " ".join(desc_part.split())
abstract_part = " ".join(desc_part.split())
csvwriter.writerow(['title','contributornames','institutions','abstract'])
def read_pdf_file(file_name):
print file_name
try:
fileptr = open('koreanenglish_extracted.txt', 'wb') #parsed it from different code
command_out = subprocess.Popen(['less', file_name], stdout=fileptr, stderr=subprocess.STDOUT) #ERROR occurs here
time.sleep(2)
parse_pdf_buffer('koreanenglish_extracted.txt') #parsed it from different code
except:
print "failed for file", file_name
def main():
for file_name in glob.glob("*.pdf"): #capture all the pdf
read_pdf_file(file_name)
if __name__ == '__main__':
main()
FortunatoScienceParsed.txt 内容:复制粘贴到 txt 中。抱歉,我无法将文件作为附件上传。如果需要,我将在聊天中发送整个 koreanenglish_extracted.txt。非常感谢您的帮助!!!
A Test Suite for Evaluation of English-to-Korean Machine Translation Systems
Sungryong Koh, Jinee Maeng, Ji-Young Lee, Young-Sook Chae, Key-Sun ChoiKorea Terminology Research Center for Language and Knowledge Engineering (KORTERM)
Korea Advanced Institute of Science and Technology (KAIST)
Kusong-dong Yusong-gu Taejon 305-701 Korea
{koh,aphroditejin,jinny206}@world.kaist.ac.kr
, pinochae@chollian.net
, kschoi@cs.kaist.ac.kr
Abstract
This paper describes KORTERM™s test suite and their practicability.
The test-sets have been being constructed on the basis of f
ine-
grained classification of linguistic phenomena
to evaluate the technical st
atus of English-to-Korean
MT systems systematically.
They
consist of about 5000 test-sets and are growi
ng.
解决方案
推荐阅读
- codenameone - 可以在CN1应用程序中编辑iOSPort下的代码吗?
- python - Sipgate 如何处理传入的 VOIP 呼叫 Python
- java - 最佳实践:空值处理
- ios - 组合:以一定的延迟发布序列的元素
- python - 如何在 python 中处理带有套接字的 https 客户端?制作代理服务器
- java - JUnit 5 AfterAllTests 功能
- gulp - 无法读取 gulp-jest 中未定义的属性“runCLI”
- c# - 如何通过选择建议值来更改文本框值?
- azure - 如何在 Azure 云中安装 libvips NetVips - 错误:无法加载 DLL \'libvips-42.dll\' 或其依赖项之一
- linux - 服务器时间从不更新夏令时。我该如何更新这个?