Go to file
Jonas Jenwald ca719ecaa4 Add local caching of Functions, by reference, in the PDFFunctionFactory (issue 2541)
Note that compared other structures, such as e.g. Images and ColorSpaces, `Function`s are not referred to by name, which however does bring the advantage of being able to share the cache for an *entire* page.
Furthermore, similar to ColorSpaces, the parsing of individual `Function`s are generally fast enough to not really warrant trying to cache them in any "smarter" way than by reference. (Hence trying to do caching similar to e.g. Fonts would most likely be a losing proposition, given the amount of data lookup/parsing that'd be required.)

Originally I tried implementing this similar to e.g. the recently added ColorSpace caching (and in a couple of different ways), however it unfortunately turned out to be quite ugly/unwieldy given the sheer number of functions/methods where you'd thus need to pass in a `LocalFunctionCache` instance. (Also, the affected functions/methods didn't exactly have short signatures as-is.)
After going back and forth on this for a while it seemed to me that the simplest, or least "invasive" if you will, solution would be if each `PartialEvaluator` instance had its *own* `PDFFunctionFactory` instance (since the latter is already passed to all of the required code). This way each `PDFFunctionFactory` instances could have a local `Function` cache, without it being necessary to provide a `LocalFunctionCache` instance manually at every `PDFFunctionFactory.{create, createFromArray}` call-site.

Obviously, with this patch, there's now (potentially) more `PDFFunctionFactory` instances than before when the entire document shared just one. However, each such instance is really quite small and it's also tied to a `PartialEvaluator` instance and those are *not* kept alive and/or cached. To reduce the impact of these changes, I've tried to make as many of these structures as possible *lazily initialized*, specifically:

 - The `PDFFunctionFactory`, on `PartialEvaluator` instances, since not all kinds of general parsing actually requires it. For example: `getTextContent` calls won't cause any `Function` to be parsed, and even some `getOperatorList` calls won't trigger `Function` parsing (if a page contains e.g. no Patterns or "complex" ColorSpaces).

 - The `LocalFunctionCache`, on `PDFFunctionFactory` instances, since only certain parsing requires it. Generally speaking, only e.g. Patterns, "complex" ColorSpaces, and/or (some) SoftMasks will trigger any `Function` parsing.

To put these changes into perspective, when loading/rendering all (14) pages of the default `tracemonkey.pdf` file there's now a total of 6 `PDFFunctionFactory` and 1 `LocalFunctionCache` instances created thanks to the lazy initialization.
(If you instead would keep the document-"global" `PDFFunctionFactory` instance and pass around `LocalFunctionCache` instances everywhere, the numbers for the `tracemonkey.pdf` file would be instead be something like 1 `PDFFunctionFactory` and 6 `LocalFunctionCache` instances.)
All-in-all, I thus don't think that the `PDFFunctionFactory` changes should be generally problematic.

With these changes, we can also modify (some) call-sites to pass in a `Reference` rather than the actual `Function` data. This is nice since `Function`s can also be `Streams`, which are not cached on the `XRef` instance (given their potential size), and this way we can avoid unnecessary lookups and thus save some additional time/resources.

Obviously I had intended to include (standard) benchmark results with these changes, but for reasons I don't really understand the test run-time (even with `master`) of the document in issue 2541 is quite a bit slower than in the development viewer.
However, logging the time it takes for the relevant `PDFFunctionFactory`/`PDFFunction ` parsing shows that it takes *approximately* `0.5 ms` for the `Function` in question. Looking up a cached `Function`, on the other hand, is *one order of magnitude faster* which does add up when the same `Function` is invoked close to 2000 times.
2020-07-04 00:55:18 +02:00
.github Update links from IRC to Matrix. 2020-02-27 16:26:17 -08:00
docs Bump versions in pdfjs.config and update the getting started page of the website for the new release 2020-06-01 12:45:04 +02:00
examples [api-minor] Use the NodeCanvasFactory/NodeCMapReaderFactory classes as defaults in Node.js environments (issue 11900) 2020-07-02 04:44:23 +02:00
extensions Replace non-inclusive "whitelist" term with "allowlist" 2020-06-29 17:15:14 +02:00
external Prevent the (old) preprocessor from appending trailing whitespace when removing closing HTML comments 2020-06-11 12:15:18 +02:00
l10n Update l10n files 2020-06-27 11:37:41 +02:00
src Add local caching of Functions, by reference, in the PDFFunctionFactory (issue 2541) 2020-07-04 00:55:18 +02:00
test [api-minor] Use the NodeCanvasFactory/NodeCMapReaderFactory classes as defaults in Node.js environments (issue 11900) 2020-07-02 04:44:23 +02:00
web Merge pull request #11997 from Snuffleupagus/nullish-coalescing 2020-06-13 00:07:32 +02:00
.editorconfig Uses editorconfig to maintain consistent coding styles 2015-11-14 07:32:18 +05:30
.eslintignore Replace the bundled ReadableStream polyfill with the web-streams-polyfill npm package (issue 11157) 2019-09-23 22:16:59 +02:00
.eslintrc Enable the no-promise-executor-return ESLint rule 2020-07-01 13:01:39 +02:00
.gitattributes Fixing C++,PHP and Pascal presence in the repo 2015-10-29 13:03:51 -05:00
.gitignore Include package-lock.json for reproducible builds 2018-06-02 20:29:47 +02:00
.gitmodules Update fonttools location and version (issue 6223) 2015-07-17 12:51:09 +02:00
.gitpod.Dockerfile Simplifies code contributions by automating the dev setup with gitpod.io 2019-11-06 04:12:19 +00:00
.gitpod.yml Simplifies code contributions by automating the dev setup with gitpod.io 2019-11-06 04:12:19 +00:00
.mailmap Add mgol's name to AUTHORS, add .mailmap 2017-11-22 10:46:11 +01:00
.prettierrc Update Prettier to version 2.0 2020-04-14 12:28:14 +02:00
.travis.yml Use Node LTS releases to fix Travis CI builds (issue 10790) 2020-04-22 00:06:27 +02:00
AUTHORS Add SehyunPark to AUTHORS 2017-11-29 22:24:08 +09:00
CODE_OF_CONDUCT.md Add Mozilla Code of Conduct file 2019-03-27 21:00:01 -07:00
EXPORT Adds ECCN response statement 2017-10-23 13:31:36 -05:00
gulpfile.js [api-minor] Use the NodeCanvasFactory/NodeCMapReaderFactory classes as defaults in Node.js environments (issue 11900) 2020-07-02 04:44:23 +02:00
LICENSE cleaned whitespace 2015-02-17 11:07:37 -05:00
package-lock.json Fix *most* vulnerabilities reported by npm audit 2020-06-27 11:32:51 +02:00
package.json Update npm packages 2020-06-27 11:30:30 +02:00
pdfjs.config Bump versions in pdfjs.config and update the getting started page of the website for the new release 2020-06-01 12:45:04 +02:00
README.md Replace Mozilla Labs by just Mozilla 2020-06-29 17:48:35 +02:00
systemjs.config.js docs: Fix simple typo, occurences -> occurrences 2020-04-18 07:53:18 +10:00

PDF.js Build Status

PDF.js is a Portable Document Format (PDF) viewer that is built with HTML5.

PDF.js is community-driven and supported by Mozilla. Our goal is to create a general-purpose, web standards-based platform for parsing and rendering PDFs.

Contributing

PDF.js is an open source project and always looking for more contributors. To get involved, visit:

Feel free to stop by our Matrix room for questions or guidance.

Getting Started

Online demo

Please note that the "Modern browsers" version assumes native support for features such as e.g. async/await, Promise, and ReadableStream.

Browser Extensions

Firefox

PDF.js is built into version 19+ of Firefox.

Chrome

  • The official extension for Chrome can be installed from the Chrome Web Store. This extension is maintained by @Rob--W.
  • Build Your Own - Get the code as explained below and issue gulp chromium. Then open Chrome, go to Tools > Extension and load the (unpackaged) extension from the directory build/chromium.

Getting the Code

To get a local copy of the current code, clone it using git:

$ git clone https://github.com/mozilla/pdf.js.git
$ cd pdf.js

Next, install Node.js via the official package or via nvm. You need to install the gulp package globally (see also gulp's getting started):

$ npm install -g gulp-cli

If everything worked out, install all dependencies for PDF.js:

$ npm install

Finally, you need to start a local web server as some browsers do not allow opening PDF files using a file:// URL. Run:

$ gulp server

and then you can open:

Please keep in mind that this requires an ES6 compatible browser; refer to Building PDF.js for usage with older browsers.

It is also possible to view all test PDF files on the right side by opening:

Building PDF.js

In order to bundle all src/ files into two production scripts and build the generic viewer, run:

$ gulp generic

This will generate pdf.js and pdf.worker.js in the build/generic/build/ directory. Both scripts are needed but only pdf.js needs to be included since pdf.worker.js will be loaded by pdf.js. The PDF.js files are large and should be minified for production.

Using PDF.js in a web application

To use PDF.js in a web application you can choose to use a pre-built version of the library or to build it from source. We supply pre-built versions for usage with NPM and Bower under the pdfjs-dist name. For more information and examples please refer to the wiki page on this subject.

Including via a CDN

PDF.js is hosted on several free CDNs:

Learning

You can play with the PDF.js API directly from your browser using the live demos below:

More examples can be found in the examples folder. Some of them are using the pdfjs-dist package, which can be built and installed in this repo directory via gulp dist-install command.

For an introduction to the PDF.js code, check out the presentation by our contributor Julian Viereck:

More learning resources can be found at:

The API documentation can be found at:

Questions

Check out our FAQs and get answers to common questions:

Talk to us on Matrix:

File an issue:

Follow us on twitter: @pdfjs