In general it's recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
However, the `getDocument`-function also accepts a direct `PDFDataRangeTransport`-instance which just seems unnecessary.
*Please note:* The `PDFDataRangeTransport`-implementation was added specifically for the *built-in* Firefox PDF Viewer, however it's most likely not commonly used by any third-party (given that it requires manual PDF-data loading).
Furthermore, the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
This patch removes the recently introduced `transferPdfData` API-option, and simply enables transferring of TypedArray data *by default* instead of copying it. This will help reduce main-thread memory usage, however it will take ownership of the TypedArrays. Currently this only applies to the following cases:
- TypedArrays passed to the `getDocument`-function in the API, in order to open PDF documents from binary data.
- TypedArrays passed to a `PDFDataRangeTransport`-instance, used to support custom PDF document fetching/loading (see e.g. the Firefox PDF Viewer).
*PLEASE NOTE:* To avoid being affected by this, please simply *copy* any TypedArray data before passing it to either of the functions/methods mentioned above.
Now that we transfer TypedArray data that we previously only copied, we need to be more careful with input validation. Given how the `{IPDFStreamReader, IPDFStreamRangeReader}.read` methods will always return ArrayBuffer data, which is then transferred to the worker-thread[1], the actual TypedArray data passed to the API thus need to have the same exact size as its underlying ArrayBuffer to prevent issues.
Hence we'll check for this and only allow transferring of *safe* TypedArray data, and fallback to simply copying the data just as before. This obviously shouldn't be an issue in the Firefox PDF Viewer, but for the general PDF.js library we need to be more careful here.
---
[1] See e09ad99973/src/display/api.js (L2492-L2506) respectively e09ad99973/src/display/api.js (L2578-L2590)
Note how in the API we're transferring the PDF data that's fetched over the network[1]:
- f28bf23a31/src/display/api.js (L2467-L2480)
- f28bf23a31/src/display/api.js (L2553-L2564)
To support that functionality we have the `PDFDataTransportStream`, `PDFFetchStream`, `PDFNetworkStream`, and `PDFNodeStream` implementations. Here these stream-implementations vary slightly in how they handle `ArrayBuffer`s internally, w.r.t. transferring or copying the data:
- In `PDFDataTransportStream` we optionally, after PR 15908, allow transferring of the PDF data as provided externally (used e.g. in the Firefox PDF Viewer).
- In `PDFFetchStream` we're currenly always copying the PDF data returned by the Fetch API, which seems unnecessary. As discussed in PR 15908, it'd seem very weird if this sort of browser API didn't allow transferring of the returned data.
- In `PDFNetworkStream` we're already, since many years, transferring the PDF data returned by the `XMLHttpRequest` functionality. Note how the `getArrayBuffer` helper function simply returns an `ArrayBuffer` response as-is.
- In `PDFNodeStream` we're currently copying the PDF data, however this is unfortunately necessary since Node.js returns data as a `Buffer` object[2].
Given that the `PDFNetworkStream` has been, indirectly, supporting transferring of PDF data for years it would seem really strange if this didn't also apply to the `PDFFetchStream`-implementation.
Hence this patch simply enables transferring of PDF data, when accessed using the Fetch API, unconditionally to help reduced main-thread memory usage since the `PDFFetchStream`-implementation is used *by default* in browsers (for the GENERIC build).
---
[1] As opposed to PDF data being provided as e.g. a TypedArray when calling `getDocument` in the API.
[2] This is a "special" Node.js object, see https://nodejs.org/api/buffer.html#buffer, which doesn't exist in browsers.
Also, removes the `initialData`-parameter JSDocs for the `getDocument`-function given that this parameter has been completely unused since PR 8982 (over five years ago). Note that the `initialData`-parameter is, and always was, intended to be provided when initializing a `PDFDataRangeTransport`-instance.
Given that this is internal functionality, not exposed in the official API, it's not entirely clear (at least to me) why we can't just initialize this directly in `src/display/api.js` instead.
When testing both the development viewer and all the ways in which we run tests, everthing still appears to work just fine with this patch.
This was deprecated in PR 15758 but it's unfortunately quite difficult to tell if third-party users are depending on this, e.g. to implement custom error reporting, and if so to what extent.
However, thanks to the pre-processor we can limit *most* of this code to GENERIC builds which still seem like a worthwhile change.
These changes reduce the bundle size of the Firefox PDF Viewer by 3.8 kB in total.
This was deprecated in PR 15758 and given that it's quite unlikely that any third-party users are relying on this functionality, since it was only ever added to support telemetry reporting in the Firefox PDF Viewer, it should hopefully be fine to remove this fairly quickly.
These changes reduce the bundle size of the Firefox PDF Viewer by 4.5 kB in total.
Given that the Fetch API only supports the http/https protocols, worker-thread fetching of CMaps and Standard-fonts may thus fail in certain cases. To improve the default behaviour we'll now also check that the `cMapUrl` and `standardFontDataUrl` options are appropriate, except in Firefox where this should always work.
Note how, in the scripting initialization in the viewer, we only ever invoke `PDFPageProxy.getJSActions` *once* per page in order to improve overall performance; see a575aa13b9/web/pdf_scripting_manager.js (L372-L375)
Hence it really shouldn't be necessary to cache its result in the API, especially when that is done *manually* rather than using something like `shadow`.
When we're destroying a `PDFPageProxy`-instance, during full document destruction, we'll force-abort any worker-thread parsing of operatorLists. Hence we should make sure that any pending cancel timeout is always aborted, since a later `PDFPageProxy._abortOperatorList` call should always "replace" a previous one.
*Please note:* Technically this was always wrong, but with the changes in PR 15825 it became *ever so slightly* easier to trigger this thanks to the potentially longer timeout.
By moving this code the "pageviewer"-component example will become slightly more usable on its own, it may simplify a future addition of XFA Foreground document support, and finally also serves as preparation for the following patches.
This is done to support upcoming viewer-changes, and in order to prevent third-party users from outright breaking things we'll simply ignore too large values.
This was essentially done only to compensate for the viewer calling `PDFPageProxy.getAnnotations` unconditionally on every annotationLayer-rendering invocation. With the previous patch that's no longer happening, and this API-caching should thus no longer be necessary.
Compared to the recent PR 15722 for the `textLayer` this one should be a (comparatively) much a smaller win overall, since most documents don't have any structTree-data and the required parsing should be cheaper. However, it seems to me that it cannot hurt to improve this nonetheless.
Note that by moving the `structTreeLayer` initialization we remove the need for the "textlayerrendered" event listener, which thus simplifies the code a little bit.
Also, removes the API-caching of the structTree-data since this was basically done to offset the lack of caching in the viewer.
Add a deprecation notification for PDFDocumentLoadingTask.onUnsupportedFeature and PDFDocumentProxy.stats
which are likely useless.
The unsupported feature stuff have initially been added in (#4048) in order to be able to display a
warning bar and to help to have some numbers to know how a feature was used.
Those data are no more used in Firefox.
This can't be a particularly common feature, since we've supported Optional Content for over two years and this is the very first TilingPattern-case we've seen.
Note how we're currently skipping all main-thread cleanup when document destruction has started, but for some reason we're still dispatching the "Cleanup" message.
This seems like a simple oversight, since destruction will already invoke the `BasePdfManager.cleanup` method (on the worker-thread) to fully clear-out all caches.
Looking at the code on the worker-thread, there doesn't appear to be any particular reason for placing *some* of the properties in a `source`-object when sending them with "GetDocRequest".
As is often the case the explanation for this structure is rather "for historical reasons", since originally we simply sent the `source`-object as-is. Doing that was obviously a bad idea, for a couple of reasons:
- It makes it less clear what is/isn't actually needed on the worker-thread.
- Sending unused properties will unnecessarily increase memory usage.
- The `source`-object may contain unclonable data, which would break the library.
Rather than sending all of these parameters individually and then grouping them together on the worker-thread, we can simply handle that in the API instead.
This patch first of all makes `isOffscreenCanvasSupported` configurable, defaulting to `true` in browsers and `false` in Node.js environments, with a new `getDocument` parameter. While you normally want to use this, in order to improve performance, it should still be possible for users to control it (similar to e.g. `isEvalSupported`).
The specific problem, as reported in issue 14952, is that the SVG back-end doesn't support the new ImageMask data-format that's introduced in PR 14754. In particular:
- When the SVG back-end is used in Node.js environments, this patch will "just work" without the user needing to make any code changes.
- If the SVG back-end is used in browsers, this patch will require that `isOffscreenCanvasSupported: false` is added to the `getDocument`-call.
This is yet another small piece of clean-up of the `FontLoader`-code, since we've not used this `id`-property for anything ever since PR 6571 (which landed almost seven years ago). Furthermore, by default we're also not even using that code-path now since the Font Loading API will always be used when available.
*Please note:* This is tagged `[api-minor]` since it's technically observable from the outside, however no user ought to be directly interacting with these CSS font rules.
This patch updates a bunch of older code, that makes conditional function calls, to use optional chaining rather than `if`-blocks.
These mostly mechanical changes reduce the size of the `gulp mozcentral` build by a little over 1 kB.
There's three notable exceptions here:
- The `saveDocument` one is converted into a permanent `warn`, since it still works when the `annotationStorage` is empty although it's (obviously) less efficient than `getData`.
- The `fallbackWorkerSrc` functionality (for browsers), since just removing it would risk too much third-party breakage.
- The SVG back-end, since a final decision is yet to be made. (It might be completely removed, or left as-is in an essentially "frozen" state.)
- Remove the `typeof Worker` check, since all browsers have had `Worker` support for many years now; see https://developer.mozilla.org/en-US/docs/Web/API/Worker#browser_compatibility
Furthermore the `new Worker(...)` call is wrapped in try-catch, which means that we'll still fallback to "fake workers" if necessary.
- Limit the `fallbackWorkerSrc` handling, in the `PDFWorker.workerSrc` getter, to only GENERIC builds since that's the only place where it's defined anyway.
While this has always worked, as a consequence of the implementation, it's never been officially supported.
In addition to adding basic unit-tests, this patch also introduces a couple of new JSDoc `@typedef`s in the API to avoid overly long lines.
To improve performance of the sidebar we use the page-canvases to generate the thumbnails whenever possible, since that avoids unnecessary re-rendering when the sidebar is open. This works generally well, however there's an old problem in PDF documents that contain interactive forms (when those are enabled): Note how the thumbnails become partially (or fully) blank, since those Annotations are not included in the OperatorList.[1]
We obviously want to keep using the `PDFThumbnailView.setImage`-method for most documents, however we need a way to skip it only for those pages that contain interactive forms.
As it turns out it's unfortunately not all that simple to tell, after the fact, from looking only at the OperatorList that some Annotations were skipped. While it might have been possible to try and infer that in the viewer, it'd not have been pretty considering that at the time when rendering finishes the annotationLayer has not yet been built.
The overall simplest solution that I could come up with, was instead to include a *summary* of the interactive form-state when doing the final "flushing" of the OperatorList and expose that information in the API.
---
[1] Some examples from our test-suite: `annotation-tx2.pdf` where the thumbnail is completely blank, and `bug1737260.pdf` where the thumbnail is missing the "buttons" found on the page.
Given that printing is triggered *synchronously* in browsers, it's thus possible for scripting (in PDF documents) to modify the Annotation-data while printing is currently ongoing.
To work-around that we add a new printing-specific `AnnotationStorage`, where the serializable data is *frozen* upon initialization, which the viewer can thus create/utilize during printing.
This patch contains a small optimization specifically for the case when `getDocument` is called with TypedArray-data. In that case we'll still hold onto that data, which could obviously be large, even after the "GetDocRequest"-message has been sent to the worker-thread.
In practice this will most likely not affect memory usage in any noticeable way, since the application calling `getDocument` will probably also be keeping a reference to the TypedArray-data. However, it seems like a good idea to ensure that the PDF.js API *itself* won't unnecessarily keep this data alive.
- Use Canvas & CanvasText color when they don't have their default value
as background and foreground colors.
- The colors used to draw (stroke/fill) in a pdf are replaced by the bg/fg
ones according to their luminance.
The current `lastModified`-getter, which only contains a time-stamp, is a fairly crude way of detecting if the stored data has actually been changed. In particular, when the `getRawValue`-method is used, the `lastModified`-getter doesn't cope with data being modified from the "outside".
To fix these issues[1], and to prevent any future bugs in this code, this patch introduces a new `AnnotationStorage.hash`-getter which computes a hash of the currently stored data. To simplify things this re-uses the existing `MurmurHash3_64`-implementation, which required moving that file into the `src/shared/`-folder, since its performance should be good enough here.
---
[1] Given how the `AnnotationStorage.lastModified`-getter was used, this would have been limited to *printing* of forms.
Given that the `isNodeJS`-constant will, after PR 14858, *always* be `false` in non-GENERIC builds we can simplify a couple of `getDocument`-parameter default values slightly.
The old format, with inline `PDFJSDev`-checks, wasn't exactly a wonder of readability; which was my fault.
This first of all simplifies the file, since we no longer need dummy-classes and can instead *directly* define the actual classes.
Furthermore, and more importantly, this means that we no longer need to bundle this code in e.g. MOZCENTRAL-builds which reduces the size of *built* `pdf.js` file slightly.
Because of a bug in previous `core-js` versions, which caused an Error to be thrown if its `structuredClone` polyfill was called with an *explicit* `null`/`undefined` transfer-parameter, the `LoopbackPort`-class contained a work-around.
In the latest `core-js` version this has been fixed, and we can thus simplify our code ever so slightly; please see https://github.com/zloirock/core-js/releases/tag/v3.22.0
- it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw:
* an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread:
- this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them
for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330;
- or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so
it's slightly slower than using an OffscreenCanvas.
* it's transfered from the worker to the main thread by "reference";
* the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally
less than before.
- Use the localImageCache for the mask;
- Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image;
- Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set
as defined in operator_list.
There's a couple of `getDocument` parameters that should be numbers, but which are currently not *fully* validated to prevent issues elsewhere in the code-base.
Also, improves validation of the `ownerDocument` parameter since we currently accept more-or-less anything here.
Given that we now only use Workers when `postMessage` transfers are supported, there's really no point in trying to send a "test" message *without* transfers present.
Hence, if `postMessage` transfers are not supported by the browser, we'll now fallback to "fake" Workers immediately instead. The comment about Opera is also removed, since it was originally added back in PR 983 and mentions Opera `11.60` [which was released in 2011](https://en.wikipedia.org/wiki/History_of_the_Opera_web_browser#Version_11).