首页 > 解决方案 > How is PDF.js getTextContent returning multiple character strings

问题描述

PDF.js getTextContent() is returning string values of multiple characters at times and I'm trying to discern why that is.

I've been regularly crawling 90+ page pdfs getting text content with PDF.js with 99% of results being broken down into their individual characters (thank you Mozilla).

  let doc = await pdfjsLib.getDocument(name).promise;
  let numPages = doc.numPages;


  let lines = [];
 let index = 0;
 for (var pageNum = 1; pageNum <= numPages; pageNum++) {
let pageData = await doc.getPage(pageNum);

let viewport = pageData.getViewport(1);
console.log(viewport)
let pageBase = (numPages - pageNum) * 2000;
let pageHeight = (viewport.viewBox[3])
let pageWidth = viewport.viewBox[2]

let yArr = []
let textContent = await pageData.getTextContent();
*** Then we do some callback stuff with text  - works fine ***

textContent USUALLY looks like this99% of data has a 'str' value of a single character

Recently ran into a bug and I realized that some strings are coming in longer - usually page numbersenter image description here see index 497

once again after running a filter enter image description here

I'm really looking to acertain what exactly is causing some of these strings to register with PDF.js as multi-characters strings when the majority are single. I'd hate for this thing to surprise me down the road in a way I can't anticipate.

PDF.js: v2.6.347 Node.js v12.18.2

Debugging in VSCode: Version: 1.52.1 Electron: 9.3.5 Chrome: 83.0.4103.122 Node.js: 12.14.1 V8: 8.3.110.13-electron.0 OS: Darwin x64 19.6.0

Thanks!

标签: javascriptnode.jspdfpdf.js

解决方案


推荐阅读