javascript - How is PDF.js getTextContent returning multiple character strings
问题描述
PDF.js getTextContent() is returning string values of multiple characters at times and I'm trying to discern why that is.
I've been regularly crawling 90+ page pdfs getting text content with PDF.js with 99% of results being broken down into their individual characters (thank you Mozilla).
let doc = await pdfjsLib.getDocument(name).promise;
let numPages = doc.numPages;
let lines = [];
let index = 0;
for (var pageNum = 1; pageNum <= numPages; pageNum++) {
let pageData = await doc.getPage(pageNum);
let viewport = pageData.getViewport(1);
console.log(viewport)
let pageBase = (numPages - pageNum) * 2000;
let pageHeight = (viewport.viewBox[3])
let pageWidth = viewport.viewBox[2]
let yArr = []
let textContent = await pageData.getTextContent();
*** Then we do some callback stuff with text - works fine ***
textContent USUALLY looks like this
Recently ran into a bug and I realized that some strings are coming in longer - usually page numbers see index 497
once again after running a filter
I'm really looking to acertain what exactly is causing some of these strings to register with PDF.js as multi-characters strings when the majority are single. I'd hate for this thing to surprise me down the road in a way I can't anticipate.
PDF.js: v2.6.347 Node.js v12.18.2
Debugging in VSCode: Version: 1.52.1 Electron: 9.3.5 Chrome: 83.0.4103.122 Node.js: 12.14.1 V8: 8.3.110.13-electron.0 OS: Darwin x64 19.6.0
Thanks!
解决方案
推荐阅读
- c# - 我可以在不漂移的情况下使用 RigidBody.Addforce 吗?
- ios - 将数组从视图控制器传递到视图控制器返回空数组
- netlogo - 隔离模型的适应:如何避免其他品种密度在patch-here
- c# - 如何在属性的子类中初始化数组?(未设置对象引用....错误)
- entity-framework - 实体框架触发查询以加载相关对象,尽管这些对象的显式加载已经完成
- google-cloud-platform - 无法通过 SSH 连接到 Google Cloud VM
- c# - C# - ActiveMQ - 消费者中的任务
- php - 如何在 xampp linux 上启用 mcrypt php 扩展
- vb.net - VB.NET 清除 Datagridview
- python - 将字符串与数字、单位和关键字匹配 python