python - Use PyPDF2 to detect non-embedded fonts in PDF file generated by Google Docs
问题描述
I was hoping someone could help me write a Python function to detect any fonts in the file which are not embedded in the file. I've attempted to use the script linked here, and it can detect the documents fonts, but it does not detect fonts which are embedded. I've pasted the script below for convenience:
from PyPDF2 import PdfFileReader
import sys
fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
def walk(obj, fnt, emb):
if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
elif '/FontName' in obj and fontkeys.intersection(set(obj)):
emb.add(obj['/FontName'])
for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)
return fnt, emb
if __name__ == '__main__':
fname = sys.argv[1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
for page in pdf.pages:
obj = page.getObject()
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)
unembedded = fonts - embedded
print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)
For example, I've downloaded a PDF from Google Docs (type some stuff, save as PDF) with the Arial font, and Adobe Reader has confirmed that the font is embedded. However, the script returns ['/ArialMT'] as a font, and an empty set for embedded fonts. Additionally, it does not look like any of the recursive objects have the keys {'/FontFile', '/FontFile2', '/FontFile3'}
. I've tried it on other PDFs and it works, so it must be something weird with the Google Docs PDFs. Let me know what other debug information I can give for this PDF file.
One thing I thought was that it was possible that Google Docs was only embedding fonts which were not in the 14 standard PDF fonts. However, i tried it with a weird font (pacifico), and the script also stated this font was not embedded, when Adobe claims it is.
I tried it with this PDF, and the script correctly stated that these 14 fonts were embedded.
解决方案
The issue is that this script does not handle lists. For example in the Google Docs example, in the PDF object, you see this structure:
{'/Encoding': '/Identity-H', '/Type': '/Font', '/BaseFont': '/Pacifico-Regular', '/ToUnicode': IndirectObject(9, 0), '/DescendantFonts': [IndirectObject(16, 0)], '/Subtype': '/Type0'}
The key DescendantFonts
maps to a list of values, which if you recurse deeper into will contain the keys for font files. You have to modify the script to test for arrays as well, for example:
if type(obj) == PyPDF2.generic.ArrayObject: # You can also do ducktyping here
for i in obj:
if hasattr(i, 'keys'):
walk(i, all_fonts, embedded_fonts)
推荐阅读
- javascript - 在 google dialogflow fullfilment 中使用 node.js 文档未定义(服务器端)
- mongodb - 无法将日志发送到 /var/mongodb/logs/mongod.log 文件并且无法将 mongod 分叉并作为守护进程运行,我无法添加用户
- linux - 如何在目标“安装”后执行 cmake add_custom_command
- shiny - 隐藏闪亮的框内容并仅保留显示框的标题
- c# - 哪个用户控件调用事件 MouseEnter?
- javascript - 如何在 React 中使用条件进行渲染
- sorting - AutoHotkey如何在对第一列中的数字进行降序排序的同时保持第二列的顺序?
- python - 有没有办法使用 python 将更新的图表样式应用于 powerpoint 中的图表?
- javascript - 如何在javascript中将字符串转换为对象数组
- sql-server - SQL Server 2019 标准 - 更改版本选择的输出