The `pageColors`-option was removed from the `CanvasGraphics`-constructor in PR 16075, hence the code in the API no longer needs to pass in that option; this is something that I missed during review.
The idea is to apply an overall filter on each page: the main advantage
is to have some filtered images which could help to make them visible for
some users.
With the previous commit this is now completely unused in API, hence it can be removed. This is done in a separate commit to make it easier to re-instate it, would the need ever arise.
This patch extends PR 16115 to work in all browsers, regardless of their `OffscreenCanvas` support, such that transfer functions will be applied to general rendering (and not just image data).
In order to do this we introduce the `BaseFilterFactory` that is then extended in browsers/Node.js environments, similar to all the other factories used in the API, such that we always have the necessary factory available in `src/display/canvas.js`.
These changes help simplify the existing `putBinaryImageData` function, and the new method can easily be stubbed-out in the Firefox PDF Viewer.
*Please note:* This patch removes the old *partial* transfer function support, which only applied to image data, from Node.js environments since the `node-canvas` package currently doesn't support filters. However, this should hopefully be fine given that:
- Transfer functions are not very commonly used in PDF documents.
- Browsers in general, and Firefox in particular, are the *primary* development target for the PDF.js library.
- The FAQ only lists Node.js as *mostly* supported, see https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#faq-support
In the general PDF.js library multiple PDF documents may be opened on the same web-page, which is why we many years ago started using document-specific identifiers to prevent issues with global data such e.g. with fonts.
Hence we need to treat the identifiers generated by the `FilterFactory` in the same way, since the SVG-filters for two separate PDF documents may otherwise get identical ids.
The current value originated in PR 2317, and in the decade that have passed the amount of RAM available in (most) devices should have increased a fair bit.
Nowadays we also do a much better job of detecting repeated images at both the page- and document-level, which helps reduce overall memory-usage in many documents.
Finally the constant is also moved into the `src/shared/util.js` file, since it was implicitly used on both the main- and worker-thread previously.
Currently in PDF documents with large images we immediately cleanup once rendering has finished, in order to reduce memory-usage.
Normally that shouldn't be a big problem, however when e.g. repeated zooming happens in the viewer that could easily lead to a lot of wasted resources (and waiting).
Hence this patch, which introduces a new `PDFPageProxy` method that will slightly delay cleanup after rendering.
This simply extends the approach in PR 10727 to also cover Patterns, which shouldn't be a common occurrence in Type3 fonts (since this is the first issue we've seen).
This was deprecated in PR 15758, which has now been included in three official PDF.js releases.
While PR 15880 did limit the bundle-size impact of this functionality on e.g. the Firefox PDF Viewer, it still leads to some unnecessary "bloat" that these changes remove.
Furthermore, with this being deprecated there'd also be no effort put into e.g. extending the `UNSUPPORTED_FEATURES` list when handling future error cases.
This was deprecated in PR 15943, which has now been included in two official PDF.js releases.
Given that `PDFDataRangeTransport` is somewhat unlikely to be used outside of the *built-in* Firefox PDF Viewer, it doesn't seem necessary to wait longer before removing this.
Also, removes the specific error-message for GENERIC builds to not unnecessarily "advertise" using non-objects when calling the `getDocument`-function.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
We introduced the use of OffscreenCanvas in #14754 and this patch aims
to use them for all kind of images.
It'll slightly improve performances (and maybe slightly decrease memory use).
Since an image can be rendered in using some transfer maps but because of
OffscreenCanvas we don't have the underlying pixels array the transfer maps
stuff is re-implemented in using the SVG filter feComponentTransfer.
Rather than repeatedly initializing a `canvasFactory`-instance for every page, move it to the document-level instead.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
I noticed several 'Path not found' errors because of a field called #subform[2].
From the XFA specs, the hash is used for a class of elements in the template tree.
When we're looking for a node in the datasets tree, it doesn't make sense to search
for a class. Hence the path element starting with a hash are just skipped.
With upcoming changes we'll potentially start to cache `ImageBitmap` data at the document-level, in addition to just at the page-level.
Hence we need to ensure that such data is actually released on clean-up, and rather than duplicating the existing *manual* handling this code is instead moved into the `PDFObjects.clear` method. (In my opinion, this is an overall improvement even without globally cached `ImageBitmap` data.)
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it's correct and makes sense.
The `Buffer`-object is Node.js specific functionality[1], thus (obviously) not found in browsers. Please note that the PDF.js library has never officially supported/documented that binary data can be passed as a `Buffer`, and that *internally* in the `src/core`-code we only work with standard `Uint8Array`s.
This means that if, in Node.js environments, a `Buffer` is passed to the API we need to wrap it into a `Uint8Array`, which essentially means creating a copy of the data and thus increasing memory usage.
---
[1] Refer to https://nodejs.org/api/buffer.html#buffer
By default we're using worker-thread fetching (in browsers) of this data nowadays, however in Node.js environments or if the user provides custom factories we still fallback to main-thread fetching.
Hence it makes sense, as far as I'm concerned, to move this initialization into the `getDocument` function to ensure that the factories can actually be initialized *before* attempting to load the document.
Also, this further reduces the amount of `getDocument` parameters that we need to pass into into the `WorkerTransport` class.
Currently we're passing all available parameters to this function respectively class, despite that not actually being necessary.
By splitting the parameters we not only improve the structure, and basically "document" the code a little bit, but we can also simplify the `_fetchDocument` function considerably.
This is very old code, where we loop through the user-provided options and build an internal parameter object. To prevent errors we also need to ensure that the parameters are correct/valid, which is especially important for the ones that are sent to the worker-thread such that structured cloning won't fail.[1]
Over the years this has led to more and more code being added in `getDocument` to validate the user-provided options, and at this point *most* of them have at least basic validation. However the way that this is implemented feels slightly backwards, since we first build the internal parameter object and only *afterwards* validate those parameters.[2]
Hence this patch changes the `getDocument` function to instead check/validate the supported options upfront, and then *explicitly* build the internal parameter object with only the needed properties.
---
[1] Note the supported types at https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types
[2] The internal parameter object may also, because of the loop, end up with lots of unnecessary properties since anything that the user provides is being copied.
A number of methods have their Promises cached, to avoid repeated worker round-trips, since they're expected to be called more than once from the default viewer. The way that the caching is currently implemented means that we need to remember to manually clear these Promises on document cleanup/destruction, and it'd be nice to avoid that.
With this patch the relevant Promises are now instead placed in just one `Map`, which is easy to clear, and a new helper method is also introduced to reduce duplication for *simple* `WorkerTransport` methods.
The only parameter that we actually need here is the `PDFDataRangeTransport`-instance, since the others are not necessary.
- The `url` parameter, as passed to the `getDocument` function in the API, is simply being ignored; see 2d87a2eb1c/src/display/api.js (L447-L458)
- The `length` parameter, as passed to the `getDocument` function in the API, is always being overwritten; see 2d87a2eb1c/src/display/api.js (L519-L525)
Until PR 12563 is deemed safe to land, I'd still like to be able to use worker-modules in the viewer during local development.
Hence this patch which *temporarily* adds a new `workerModules` hash-parameter, only available in non-PRODUCTION mode, that allows using worker-modules in the development viewer.
To enable this functionality, simply use http://localhost:8888/web/viewer.html#workerModules=true
The initial CMap support was added in PR 4259 using the "raw" Adobe files, however they were quickly deemed to be unnecessarily large. As a result PR 4470 introduced the more compact "binary" CMap format, with both of those PRs being included in the very same release (version `0.8.1334`) .
Please note that we've thus never shipped anything *except* the "binary" CMap files with the PDF library, and furthermore note that we've not even once updated the CMap files since they were originally added almost nine years ago.
Requiring users to remember that `cMapPacked = true` is necessary, in addition to setting the `cMapUrl` parameter, in order for CMap loading to work feels like a less than ideal API.
Hence this patch, which suggests that we simply let `cMapPacked` default to `true` now.
In general it's always recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
Unfortunately we cannot really remove this, since that code has existed since "forever", however we can limit it to only the GENERIC build to avoid completely unnecessary checks in e.g. the Firefox PDF Viewer.
Finally, note that the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
- Use a `URL`-instance directly, since it's by definition an absolute URL.
- Actually limit the "raw" url-string handling to Node.js environments, as intended.
- Skip the warning, since we're already throwing an Error if the `url`-parameter is invalid.
In general it's recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
However, the `getDocument`-function also accepts a direct `PDFDataRangeTransport`-instance which just seems unnecessary.
*Please note:* The `PDFDataRangeTransport`-implementation was added specifically for the *built-in* Firefox PDF Viewer, however it's most likely not commonly used by any third-party (given that it requires manual PDF-data loading).
Furthermore, the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
This patch removes the recently introduced `transferPdfData` API-option, and simply enables transferring of TypedArray data *by default* instead of copying it. This will help reduce main-thread memory usage, however it will take ownership of the TypedArrays. Currently this only applies to the following cases:
- TypedArrays passed to the `getDocument`-function in the API, in order to open PDF documents from binary data.
- TypedArrays passed to a `PDFDataRangeTransport`-instance, used to support custom PDF document fetching/loading (see e.g. the Firefox PDF Viewer).
*PLEASE NOTE:* To avoid being affected by this, please simply *copy* any TypedArray data before passing it to either of the functions/methods mentioned above.
Now that we transfer TypedArray data that we previously only copied, we need to be more careful with input validation. Given how the `{IPDFStreamReader, IPDFStreamRangeReader}.read` methods will always return ArrayBuffer data, which is then transferred to the worker-thread[1], the actual TypedArray data passed to the API thus need to have the same exact size as its underlying ArrayBuffer to prevent issues.
Hence we'll check for this and only allow transferring of *safe* TypedArray data, and fallback to simply copying the data just as before. This obviously shouldn't be an issue in the Firefox PDF Viewer, but for the general PDF.js library we need to be more careful here.
---
[1] See e09ad99973/src/display/api.js (L2492-L2506) respectively e09ad99973/src/display/api.js (L2578-L2590)
Note how in the API we're transferring the PDF data that's fetched over the network[1]:
- f28bf23a31/src/display/api.js (L2467-L2480)
- f28bf23a31/src/display/api.js (L2553-L2564)
To support that functionality we have the `PDFDataTransportStream`, `PDFFetchStream`, `PDFNetworkStream`, and `PDFNodeStream` implementations. Here these stream-implementations vary slightly in how they handle `ArrayBuffer`s internally, w.r.t. transferring or copying the data:
- In `PDFDataTransportStream` we optionally, after PR 15908, allow transferring of the PDF data as provided externally (used e.g. in the Firefox PDF Viewer).
- In `PDFFetchStream` we're currenly always copying the PDF data returned by the Fetch API, which seems unnecessary. As discussed in PR 15908, it'd seem very weird if this sort of browser API didn't allow transferring of the returned data.
- In `PDFNetworkStream` we're already, since many years, transferring the PDF data returned by the `XMLHttpRequest` functionality. Note how the `getArrayBuffer` helper function simply returns an `ArrayBuffer` response as-is.
- In `PDFNodeStream` we're currently copying the PDF data, however this is unfortunately necessary since Node.js returns data as a `Buffer` object[2].
Given that the `PDFNetworkStream` has been, indirectly, supporting transferring of PDF data for years it would seem really strange if this didn't also apply to the `PDFFetchStream`-implementation.
Hence this patch simply enables transferring of PDF data, when accessed using the Fetch API, unconditionally to help reduced main-thread memory usage since the `PDFFetchStream`-implementation is used *by default* in browsers (for the GENERIC build).
---
[1] As opposed to PDF data being provided as e.g. a TypedArray when calling `getDocument` in the API.
[2] This is a "special" Node.js object, see https://nodejs.org/api/buffer.html#buffer, which doesn't exist in browsers.
Also, removes the `initialData`-parameter JSDocs for the `getDocument`-function given that this parameter has been completely unused since PR 8982 (over five years ago). Note that the `initialData`-parameter is, and always was, intended to be provided when initializing a `PDFDataRangeTransport`-instance.