Go to file

Jonas Jenwald f90f9466e3 [api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962)

Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building.

While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used.
That seems *extremely* unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2]

Given that this patch adds a *minimum* batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a *simple* stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.

---
[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large.

[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257.
Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here.

[3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.

2021-09-09 00:01:07 +02:00

.github

Remove the npm test-command

2021-08-27 16:29:55 +02:00

docs

Remove the -es5/-legacy special handling in the gulp wintersmith task (PR 12978 follow-up)

2021-05-30 16:08:54 +02:00

examples

Export the XFA/StructTree-layers in the viewer components

2021-08-28 18:43:08 +02:00

extensions

[api-minor] Introduce a new annotationMode-option, in PDFPageProxy.{render, getOperatorList}

2021-08-24 01:13:02 +02:00

external

Fix Viewer API definitions and include in CI

2021-08-25 18:45:46 -04:00

l10n

Update l10n files

2021-09-05 10:01:23 +02:00

src

[api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962)

2021-09-09 00:01:07 +02:00

test

Merge pull request #13985 from Snuffleupagus/issue-11088

2021-09-08 22:15:27 +02:00

web

[GENERIC viewer] Add fallback logic for the old PDFPageView.update method signature

2021-09-04 11:39:34 +02:00

.editorconfig

Uses editorconfig to maintain consistent coding styles

2015-11-14 07:32:18 +05:30

.eslintignore

Include the test/resources/ folder when running ESLint/Stylelint

2021-08-04 13:50:44 +02:00

.eslintrc

[Regression] Re-factor the *internal* includeAnnotationStorage handling, since it's currently subtly wrong

2021-08-18 10:09:03 +02:00

.gitattributes

Fixing C++,PHP and Pascal presence in the repo

2015-10-29 13:03:51 -05:00

.gitignore

Include package-lock.json for reproducible builds

2018-06-02 20:29:47 +02:00

.gitmodules

Update fonttools location and version (issue 6223)

2015-07-17 12:51:09 +02:00

.gitpod.Dockerfile

Simplifies code contributions by automating the dev setup with gitpod.io

2019-11-06 04:12:19 +00:00

.gitpod.yml

Simplifies code contributions by automating the dev setup with gitpod.io

2019-11-06 04:12:19 +00:00

.mailmap

Add mgol's name to AUTHORS, add .mailmap

2017-11-22 10:46:11 +01:00

.prettierrc

Update Prettier to version 2.0

2020-04-14 12:28:14 +02:00

.stylelintignore

Include the test/resources/ folder when running ESLint/Stylelint

2021-08-04 13:50:44 +02:00

.stylelintrc

Enable the Stylelint shorthand-property-no-redundant-values rule

2021-01-22 14:36:02 +01:00

AUTHORS

Add SehyunPark to AUTHORS

2017-11-29 22:24:08 +09:00

CODE_OF_CONDUCT.md

Add Mozilla Code of Conduct file

2019-03-27 21:00:01 -07:00

EXPORT

Adds ECCN response statement

2017-10-23 13:31:36 -05:00

gulpfile.js

Remove the npm test-command

2021-08-27 16:29:55 +02:00

LICENSE

cleaned whitespace

2015-02-17 11:07:37 -05:00

package-lock.json

Update the webpack-stream package to the latest version

2021-09-05 09:58:59 +02:00

package.json

Update the webpack-stream package to the latest version

2021-09-05 09:58:59 +02:00

pdfjs.config

Bump versions in pdfjs.config

2021-07-25 13:39:57 +02:00

README.md

[api-minor] Rename -es5 to -legacy, to reduce confusion over what's actually supported (issue 12976)

2021-02-10 16:01:59 +01:00

systemjs.config.js

Enable the ESLint no-var rule globally

2021-03-13 16:12:53 +01:00

README.md

PDF.js

PDF.js is a Portable Document Format (PDF) viewer that is built with HTML5.

PDF.js is community-driven and supported by Mozilla. Our goal is to create a general-purpose, web standards-based platform for parsing and rendering PDFs.

Contributing

PDF.js is an open source project and always looking for more contributors. To get involved, visit:

Feel free to stop by our Matrix room for questions or guidance.

Getting Started

Online demo

Please note that the "Modern browsers" version assumes native support for features such as e.g. async/await, ReadableStream, optional chaining, and nullish coalescing.

Modern browsers: https://mozilla.github.io/pdf.js/web/viewer.html
Older browsers: https://mozilla.github.io/pdf.js/legacy/web/viewer.html

Browser Extensions

Firefox

PDF.js is built into version 19+ of Firefox.

Chrome

The official extension for Chrome can be installed from the Chrome Web Store. This extension is maintained by @Rob--W.
Build Your Own - Get the code as explained below and issue gulp chromium. Then open Chrome, go to Tools > Extension and load the (unpackaged) extension from the directory build/chromium.

Getting the Code

To get a local copy of the current code, clone it using git:

$ git clone https://github.com/mozilla/pdf.js.git
$ cd pdf.js

Next, install Node.js via the official package or via nvm. You need to install the gulp package globally (see also gulp's getting started):

$ npm install -g gulp-cli

If everything worked out, install all dependencies for PDF.js:

$ npm install

Finally, you need to start a local web server as some browsers do not allow opening PDF files using a file:// URL. Run:

$ gulp server

and then you can open:

http://localhost:8888/web/viewer.html

Please keep in mind that this requires a modern and fully up-to-date browser; refer to Building PDF.js for non-development usage of the PDF.js library.

It is also possible to view all test PDF files on the right side by opening:

http://localhost:8888/test/pdfs/?frame

Building PDF.js

In order to bundle all src/ files into two production scripts and build the generic viewer, run:

$ gulp generic

If you need to support older browsers, run:

$ gulp generic-legacy

This will generate pdf.js and pdf.worker.js in the build/generic/build/ directory (respectively build/generic-legacy/build/). Both scripts are needed but only pdf.js needs to be included since pdf.worker.js will be loaded by pdf.js. The PDF.js files are large and should be minified for production.

Using PDF.js in a web application

To use PDF.js in a web application you can choose to use a pre-built version of the library or to build it from source. We supply pre-built versions for usage with NPM and Bower under the pdfjs-dist name. For more information and examples please refer to the wiki page on this subject.

Including via a CDN

PDF.js is hosted on several free CDNs:

Learning

You can play with the PDF.js API directly from your browser using the live demos below:

Interactive examples

More examples can be found in the examples folder. Some of them are using the pdfjs-dist package, which can be built and installed in this repo directory via gulp dist-install command.

For an introduction to the PDF.js code, check out the presentation by our contributor Julian Viereck:

https://www.youtube.com/watch?v=Iv15UY-4Fg8

More learning resources can be found at:

https://github.com/mozilla/pdf.js/wiki/Additional-Learning-Resources

The API documentation can be found at:

https://mozilla.github.io/pdf.js/api/

Questions

Check out our FAQs and get answers to common questions:

https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions

Talk to us on Matrix:

https://chat.mozilla.org/#/room/#pdfjs:mozilla.org

File an issue:

https://github.com/mozilla/pdf.js/issues/new

https://twitter.com/pdfjs