On my computer, it takes few tenths of a second to load a local font.
Since a font can be used several times in a document, the cache will
improve performances.
- Replace FoxitSans with LiberationSans: LiberationSans is already there (for XFA) and we can use
it as a good replacement of FoxitSans.
- For now we just try to substitue standard fonts, the strategy is the following:
* we try to find a font locally from a hardcoded list;
* if it fails then we use Liberation as fallback (only for Helvetica for the moment);
* else we just fallback on the system serif/sansserif/monospace font.
This patch updates the minimum supported browsers as follows:
- Safari 15.4, which was released on 2022-03-15; see https://en.wikipedia.org/wiki/Safari_version_history#Safari_15
Nowadays we usually we try, where feasible and possible, to support browsers that are about two years old. The reasons for limiting support to a *somewhat* more recent Safari version include:
- Throughout the history of the PDF.js project, Safari has always been the worst browser to attempt to support. Compared to other browsers there's a disproportionate number of bugs affecting Safari, especially on iOS, and in most cases those are browser-specific issues that we simply cannot address.[1]
- Safari has often been a lot slower, compared to other browsers, at implementing new web-platform features. Historically this has sometimes blocked usage of new features, for the benefit of the Firefox PDF Viewer, and it's very often meant having to include and maintain polyfills *only* for Safari.
- The current (minimum) supported Safari version lack enough functionality that polyfills placed in the `src/shared/compatibility.js` file are unfortunately not sufficient, but it also requires a bunch of special-cases in both the `gulpfile` and in the `web/`-code.
- Given that the *built-in* Firefox PDF Viewer is the primary development target for the PDF.js library, and the general development pace these days, we need to limit the maintenance "overhead" caused by other browsers.
---
[1] In a few cases a work-around might be possible, however it'd negatively affect e.g. performance, readability, and/or maintainability of the code.
Originally we only used the `structuredClone` polyfill in the `LoopbackPort`-implementation, and that obviously isn't used anywhere within the various image decoders.
At this point in time we've started to use `structuredClone` a little bit more, hence it seems overall simpler to just bundle the polyfill even in the `legacy`-version of the IMAGE_DECODERS built-target.
For some time these checks have only targeted Node.js environments, since the features in question exist in all supported browsers (even when a `legacy`-build is used).
Now that we've updated the minimum supported Node.js version to 18, a number of polyfills are thus (finally) no longer necessary in that environment. Hence for certain *basic* functionality, such as e.g. text-extraction, it's now possible to use either a modern- or a `legacy`-build of the PDF.js library in Node.js environments.
*Please note:* For e.g. canvas-rendering in Node.js environments it's still necessary to use a `legacy`-build, since that functionality requires various polyfills.
This patch updates the minimum supported environments as follows:
- Node.js 18, which was released on 2022-04-19; see https://en.wikipedia.org/wiki/Node.js#Releases
Note also that Node.js 16 will soon reach EOL, and thus no longer receive any security updates.
The /Decode-implementation in the our JPEG decoder, i.e. `src/core/jpg.js`, seems to only handle *inverting* of images properly. To support arbitrary /Decode-entries correctly we'll always use the `PDFImage.decodeBuffer` method, even for "simple" JPEG images, which should be fine since non-default /Decode-entries aren't a very common occurrence.
*Please note:* This patch will lead to a little bit of movement in some existing test-cases, however it should be virtually imperceivable to the naked eye.
This property was added in PR 12726 specifically for use in the `getFontType` function, indirectly used by the `PDFDocumentProxy.stats` getter in the API.
In PR 15880 that functionality was removed, but I forgot to remove this now unused font-property.
Now that we no longer depend on the old Babel version in SystemJS we can remove the `static get ...` work-arounds used to define constants, which leads to slightly more compact code.
Now that https://bugzilla.mozilla.org/show_bug.cgi?id=1247687 has landed in Firefox, we're able to use worker-modules during development :-)
This removes the final piece of SystemJS usage from the PDF.js library, thus allowing a fair bit of clean-up, and we now use *only* native `import`/`export` statements everywhere in development mode.
When the `GlobalImageCache` implementation originally landed, back in PR 11912, the image handling was slightly more complex (with e.g. browser-decoding of some JPEG images). At this point it no longer seems necessary to manually handle pageIndexes in this way, and we should be able to simply inline that in the `GlobalImageCache.shouldCache` method.
This method was added in PR 4938, almost nine years ago, however it doesn't appear to ever have been used.
Given the similarities between the `PDF17` and `PDF20` classes, and how they're used, if the `PDF20.hash` method was actually necessary you'd also expect a similiar method in the `PDF17` class.
The "binary" CMap-format is specific to the PDF.js library, and is used to reduce the size of the built-in CMap data-files.
By moving this code to its own file we can remove the nowadays unnecessary closures, which helps to slightly reduce the size of this code.
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized
when creating the search query. So to avoid to duplicate the normalization code,
everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it
has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size
by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into
account some RTL unicode ranges, the generated font wasn't embedding the mapping this
char and the unicode ranges in the OS/2 table weren't up-to-date.
When normalized some chars can be replaced by several ones and it induced to have
some extra chars in the text layer. To avoid any regression, when copying some text
from the text layer, a copied string is normalized (NFKC) before being put in the
clipboard (it works like this in either Acrobat or Chrome).
This *special* build-target is very old, and was introduced with the first pre-processor that only uses comments to enable/disable code.
When the new pre-processor was added `PRODUCTION` effectively became redundant, at least in JavaScript code, since `typeof PDFJSDev === "undefined"` checks now do the same thing.
This patch proposes that we remove `PRODUCTION` from the JavaScript code, since that simplifies the conditions and thus improves readability in many cases.
*Please note:* There's not, nor has there ever been, any gulp-task that set `PRODUCTION = false` during building.
This method was originally added in PR 1320, eleven years ago, however it doesn't appear to ever have been used (not even from the start).
Furthermore, this method also tries to access a property that doesn't exist (`this.out`) and then call a method that also doesn't exist (`writeByteArray`).
In looking at https://bugs.ghostscript.com/show_bug.cgi?id=706451 I noticed that bug2.pdf was pretty
slow to load for such a basic file.
In profiling I noticed that a lot of time is spent in Array.concat, hence this patch use Array.push when
it's possible (it's now ~3 times faster).
The changes in PR 16238 were intended specifically for Node.js environments, however they accidentally applied to older browsers as well.
*Please note:* In up-to-date browsers `Path2D` is available in Workers, which should be connected to the introduction of `OffscreenCanvas`.
Apparently the `structuredClone` polyfill doesn't handle transfers correctly, and `DOMException`s may thus be thrown. This is particularly problematical in Node.js environments, where that exception (obviously) isn't available.
To work-around these issues we'll simply ignore any transfers in `legacy`-builds, since those *may* use the `structuredClone` polyfill. This will obviously lead to slightly higher memory usage in those builds, however this really only affects Node.js environments. (Browsers are only affected if workers are disabled, however that's never been an officially recommended/supported configuration.)
Currently we have two separate image-caches on the worker-thread:
- A local one, which is unique to each `PartialEvaluator.getOperatorList` invocation. This one caches both names *and* references, since image-resources may be accessed in either way.
- A global one, which applies to the entire PDF documents and all its pages. This one only caches references, since nothing else would work.
This patch introduces a third image-cache, which essentially sits "between" the two existing ones. The new `RegionalImageCache`[1] will be usable throughout a `PartialEvaluator` instance, and consequently it *only* caches references, which thus allows us to keep track of repeated image-resources found in e.g. different /Form and /SMask objects.
---
[1] For lack of a better word, since naming things is hard...
*Please note:* This parameter has never been used within the PDF.js library/viewer itself, and it was only ever added for backwards compatibility reasons.
This parameter was added in PR 7475, over six years ago, to try and optionally maintain the previous *default* text-extraction behaviour.
However as part of the general text-extraction improvements in PR 13257, almost two years ago, the `disableCombineTextItems` functionality was accidentally "broken" in various ways. Note how the only (very basic) unit-test was updated in a way that doesn't really make sense, since generally speaking you'd expect that using the option should result in *more* (or at least the same number of) text-items. Furthermore there's also the recent issue 16209, where the option causes almost all textContent to be concatenated together.
Hence this patch proposes that we simply remove the `disableCombineTextItems` option since it's essentially unused/untested functionality, as evident from the fact that it took almost two years for someone to notice that it's broken.
Originally we used helper functions for checking if something was a Dictionary or Stream, and then having an initial `typeof` check probably made sense.
However, given that we're using `instanceof` nowadays the additional check longer seems necessary.
Currently we're *virtually* duplicating the same code, for validating quotation marks, twice in this helper function.
The size decrease is quite small (107 bytes) and this makes the code slightly harder to reader, hence I completely understand if this patch is rejected.
Given that this functionality only applies in the viewer, when `PDFBug` is being enabled and used, it can't hurt to slightly reduce the size of this code.
Having just reviewed a patch touching this code, I couldn't help noticing that an `Object` isn't really the optimal data-structure for this and nowadays we can do better by using a `Set` instead.
This is something that I completely overlooked in PR 16162, which in some cases cause the default viewer to incorrectly print warnings.
This can be reproduced with the PAGE scrolling-mode, and/or the PresentationMode, and this patch simply work-around it by checking the visibility as well (since the warning is a best-effort solution anyway).
The `pageColors`-option was removed from the `CanvasGraphics`-constructor in PR 16075, hence the code in the API no longer needs to pass in that option; this is something that I missed during review.
The idea is to apply an overall filter on each page: the main advantage
is to have some filtered images which could help to make them visible for
some users.
During review of PR 16151 this method was simplified, however I overlooked the fact that we now can (and really should) improve this by removing duplication.
Unfortunately I don't believe that we can simply add a default `--scale-factor` CSS-variable to the `container`-element, since that might not be entirely appropriate/correct in all cases.[1]
However, we can at least print a console-error to hopefully make this situation more apparent to users. (This is purposely not using the `warn` helper-function, since those messages can be disabled.)
---
[1] One example is in our reference-tests, where we don't need to add it to the `container`-element itself.
With the previous commit this is now completely unused in API, hence it can be removed. This is done in a separate commit to make it easier to re-instate it, would the need ever arise.
This patch extends PR 16115 to work in all browsers, regardless of their `OffscreenCanvas` support, such that transfer functions will be applied to general rendering (and not just image data).
In order to do this we introduce the `BaseFilterFactory` that is then extended in browsers/Node.js environments, similar to all the other factories used in the API, such that we always have the necessary factory available in `src/display/canvas.js`.
These changes help simplify the existing `putBinaryImageData` function, and the new method can easily be stubbed-out in the Firefox PDF Viewer.
*Please note:* This patch removes the old *partial* transfer function support, which only applied to image data, from Node.js environments since the `node-canvas` package currently doesn't support filters. However, this should hopefully be fine given that:
- Transfer functions are not very commonly used in PDF documents.
- Browsers in general, and Firefox in particular, are the *primary* development target for the PDF.js library.
- The FAQ only lists Node.js as *mostly* supported, see https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#faq-support
In the general PDF.js library multiple PDF documents may be opened on the same web-page, which is why we many years ago started using document-specific identifiers to prevent issues with global data such e.g. with fonts.
Hence we need to treat the identifiers generated by the `FilterFactory` in the same way, since the SVG-filters for two separate PDF documents may otherwise get identical ids.
PDF gradients do not have color stops but an arbitrary PDF function of
the type f(t) -> color. CSS gradients are only based on color stops.
Most PDF gradient functions are produced from color stop oriented
gradients.
Take advantage of this by sampling the PDF function at a higher
frequency but not converting any samples which could be interpolated to
color stops. The sampling frequency is chosen to be the least common
multiple of as many values as practical to exactly re-create the common
case of the PDF function implementing equally spaced linearly
interpolated stops in RGB color space. This also allows for better
approximation of other smooth PDF functions (non-linear, or non-equally
spaced, or in different color space).
Fixes: #10572, #14165
The current value originated in PR 2317, and in the decade that have passed the amount of RAM available in (most) devices should have increased a fair bit.
Nowadays we also do a much better job of detecting repeated images at both the page- and document-level, which helps reduce overall memory-usage in many documents.
Finally the constant is also moved into the `src/shared/util.js` file, since it was implicitly used on both the main- and worker-thread previously.
Currently in PDF documents with large images we immediately cleanup once rendering has finished, in order to reduce memory-usage.
Normally that shouldn't be a big problem, however when e.g. repeated zooming happens in the viewer that could easily lead to a lot of wasted resources (and waiting).
Hence this patch, which introduces a new `PDFPageProxy` method that will slightly delay cleanup after rendering.
The dimensions still need to be fixed (from times to times they're in px)
but it doesn't have to be postponed anymore.
To test it: draw something and when resizing look at the dimensions of the div
in devtools, the units must be %.
This simply extends the approach in PR 10727 to also cover Patterns, which shouldn't be a common occurrence in Type3 fonts (since this is the first issue we've seen).
This patch updates the minimum supported environments as follows:
- Node.js 16, which was released on 2021-04-20; see https://en.wikipedia.org/wiki/Node.js#Releases
Note also that Node.js 14 will very soon reach EOL, and thus no longer receive any security updates.
This was deprecated in PR 15758, which has now been included in three official PDF.js releases.
While PR 15880 did limit the bundle-size impact of this functionality on e.g. the Firefox PDF Viewer, it still leads to some unnecessary "bloat" that these changes remove.
Furthermore, with this being deprecated there'd also be no effort put into e.g. extending the `UNSUPPORTED_FEATURES` list when handling future error cases.
The idea is to encode large image in BMP format (which is very simple and doesn't
require to compute any checksums) and then use createImageBitmap with a BMP blob
(which doesn't suffer of the Canvas/ImageData limits).
From a performance point of view, it isn't crazy (generating a large blob + decoding
it on the main thread is really not ideal) but at least we've something to display
which is a way better than a blank page (and one can notice that most of the time is
spent in decoding the image from the pdf stream).
PDF 32000-1:2008 7.10.5.1 "Type 4 (PostScript Calculator) Functions"
defers to the PostScript Language Reference for the description of these
functions. The PostScript Language Reference, third edition chapter 8
"Operators" defines the `angle` type as a "number of degrees". Section
8.1 defines "angle `sin` real", "angle `cos` real", and "num den `atan`
angle". The documentation for `atan` further states that it will return
an angle in degrees between 0 and 360.
Handle these operators correctly in `PostScriptEvaluator.execute`.
Convert the inputs to `sin` and `cos` from degrees to radians for use
with `Math.sin` and `Math.cos`. Correctly pop two values from the stack
for `atan`, use `Math.atan2`, and convert from radians to (positive)
degrees.
This was deprecated in PR 15943, which has now been included in two official PDF.js releases.
Given that `PDFDataRangeTransport` is somewhat unlikely to be used outside of the *built-in* Firefox PDF Viewer, it doesn't seem necessary to wait longer before removing this.
Also, removes the specific error-message for GENERIC builds to not unnecessarily "advertise" using non-objects when calling the `getDocument`-function.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
We introduced the use of OffscreenCanvas in #14754 and this patch aims
to use them for all kind of images.
It'll slightly improve performances (and maybe slightly decrease memory use).
Since an image can be rendered in using some transfer maps but because of
OffscreenCanvas we don't have the underlying pixels array the transfer maps
stuff is re-implemented in using the SVG filter feComponentTransfer.
Rather than repeatedly initializing a `canvasFactory`-instance for every page, move it to the document-level instead.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
Currently we repeat the `FeatureTest.isOffscreenCanvasSupported` checks all over the worker-thread code, and with upcoming changes this will become even "worse".
Hence this patch, which changes the *worker-thread* default value for the `isOffscreenCanvasSupported`-parameter to `false` and moves the feature-testing into the `BasePdfManager`-constructor.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
Currently some `getCtx` calls will have `isOffscreenCanvasSupported === undefined` set, meaning that `OffscreenCanvas` isn't being used as intended, since no `TextLayerRenderTask._isOffscreenCanvasSupported` property exists.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
I noticed several 'Path not found' errors because of a field called #subform[2].
From the XFA specs, the hash is used for a class of elements in the template tree.
When we're looking for a node in the datasets tree, it doesn't make sense to search
for a class. Hence the path element starting with a hash are just skipped.
With upcoming changes we'll potentially start to cache `ImageBitmap` data at the document-level, in addition to just at the page-level.
Hence we need to ensure that such data is actually released on clean-up, and rather than duplicating the existing *manual* handling this code is instead moved into the `PDFObjects.clear` method. (In my opinion, this is an overall improvement even without globally cached `ImageBitmap` data.)
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it's correct and makes sense.
The `Buffer`-object is Node.js specific functionality[1], thus (obviously) not found in browsers. Please note that the PDF.js library has never officially supported/documented that binary data can be passed as a `Buffer`, and that *internally* in the `src/core`-code we only work with standard `Uint8Array`s.
This means that if, in Node.js environments, a `Buffer` is passed to the API we need to wrap it into a `Uint8Array`, which essentially means creating a copy of the data and thus increasing memory usage.
---
[1] Refer to https://nodejs.org/api/buffer.html#buffer
Currently we duplicate the same code more than once in the `test/driver.js` file, which we can avoid by adding a new `AnnotationStorage` helper method instead.
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the *built* `pdf.js` and `pdf.worker.js` files.
Currently these classes take a bunch of parameters (somewhat randomly ordered), probably because this is very old code that's been extended over the years.
Hence this patch changes the constructors to use parameter-objects instead, which improves consistency and (slightly) reduces the amount of code as well.
*Please note:* Also removes the `msgHandler`-property on these classes, since I cannot find a single call-site that accesses it.
Currently this helper function only has two call-sites, and both of them only pass in `ArrayBuffer` data. Given how it's implemented there's a couple of code-paths that are completely unused (e.g. the "string" one), and in particular the intended fast-paths don't actually work.
This patch re-factors and simplifies the helper function, and it'll no longer accept anything except `ArrayBuffer` data (hence why it's also re-named).
Note that at the time when `arraysToBytes` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
When printing the pdf in #12233 in Acrobat, we can see that the combo for country
is empty: it's because the V entry doesn't have to be one of the options.
We're using this helper function when reading data from the [`PDFWorkerStreamReader.read`](a49d1d1615/src/core/worker_stream.js (L90-L98)) and [`PDFWorkerStreamRangeReader.read`](a49d1d1615/src/core/worker_stream.js (L122-L128)) methods, and as can be seen they always return `ArrayBuffer` data. Hence we can simply get the `byteLength` directly, and don't need to use the helper function.
Note that at the time when `arrayByteLength` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
In the mac case we don't want to care about the scaleFactor threshold
because else if too big another move could start and then subsequent
events aren't considered as wheel events.
It isn't really ideal and at some point we'll need to find a way at
least for the Firefox case to get the real events instead of the fake
wheel ones.
By default we're using worker-thread fetching (in browsers) of this data nowadays, however in Node.js environments or if the user provides custom factories we still fallback to main-thread fetching.
Hence it makes sense, as far as I'm concerned, to move this initialization into the `getDocument` function to ensure that the factories can actually be initialized *before* attempting to load the document.
Also, this further reduces the amount of `getDocument` parameters that we need to pass into into the `WorkerTransport` class.
Currently we're passing all available parameters to this function respectively class, despite that not actually being necessary.
By splitting the parameters we not only improve the structure, and basically "document" the code a little bit, but we can also simplify the `_fetchDocument` function considerably.
This is very old code, where we loop through the user-provided options and build an internal parameter object. To prevent errors we also need to ensure that the parameters are correct/valid, which is especially important for the ones that are sent to the worker-thread such that structured cloning won't fail.[1]
Over the years this has led to more and more code being added in `getDocument` to validate the user-provided options, and at this point *most* of them have at least basic validation. However the way that this is implemented feels slightly backwards, since we first build the internal parameter object and only *afterwards* validate those parameters.[2]
Hence this patch changes the `getDocument` function to instead check/validate the supported options upfront, and then *explicitly* build the internal parameter object with only the needed properties.
---
[1] Note the supported types at https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types
[2] The internal parameter object may also, because of the loop, end up with lots of unnecessary properties since anything that the user provides is being copied.
A number of methods have their Promises cached, to avoid repeated worker round-trips, since they're expected to be called more than once from the default viewer. The way that the caching is currently implemented means that we need to remember to manually clear these Promises on document cleanup/destruction, and it'd be nice to avoid that.
With this patch the relevant Promises are now instead placed in just one `Map`, which is easy to clear, and a new helper method is also introduced to reduce duplication for *simple* `WorkerTransport` methods.
The only parameter that we actually need here is the `PDFDataRangeTransport`-instance, since the others are not necessary.
- The `url` parameter, as passed to the `getDocument` function in the API, is simply being ignored; see 2d87a2eb1c/src/display/api.js (L447-L458)
- The `length` parameter, as passed to the `getDocument` function in the API, is always being overwritten; see 2d87a2eb1c/src/display/api.js (L519-L525)
Until PR 12563 is deemed safe to land, I'd still like to be able to use worker-modules in the viewer during local development.
Hence this patch which *temporarily* adds a new `workerModules` hash-parameter, only available in non-PRODUCTION mode, that allows using worker-modules in the development viewer.
To enable this functionality, simply use http://localhost:8888/web/viewer.html#workerModules=true
The initial CMap support was added in PR 4259 using the "raw" Adobe files, however they were quickly deemed to be unnecessarily large. As a result PR 4470 introduced the more compact "binary" CMap format, with both of those PRs being included in the very same release (version `0.8.1334`) .
Please note that we've thus never shipped anything *except* the "binary" CMap files with the PDF library, and furthermore note that we've not even once updated the CMap files since they were originally added almost nine years ago.
Requiring users to remember that `cMapPacked = true` is necessary, in addition to setting the `cMapUrl` parameter, in order for CMap loading to work feels like a less than ideal API.
Hence this patch, which suggests that we simply let `cMapPacked` default to `true` now.
Until just recently the only existing `Path2D` polyfill didn't have support for Node.js and/or the `node-canvas` package. Given that this was just fixed, in the latest version, we can now finally remove our inline-checks at the relevant call-sites; please also see https://github.com/nilzona/path2d-polyfill#usage-with-node-canvas
In PR #15757, a value is automatically converted into a number when it's possible
but the case of numbers like "000123" has been overlooked and their format must
be preserved.
When a script is doing something like "foo.value + bar.value" and the values are
numbers then "foo.value" must return a number but the displayed value must be what
the user entered or what a script set, so this patch is just adding a a field
_orginalValue in order to track the value has it has defined.
Some people are used to use a comma as decimal separator, hence it must be considered
when a value is parsed into a number.
This patch is fixing a regression introduced by #15757.
There's really no need for these "complicated" default value assignments, since `GlobalWorkerOptions` is a local variable at this point, and this is rather a case of too much copy-and-paste.
Note that years ago, when all options were set using a global `PDFJS` object, it's possible that options had been set (from the outside) *before* the object had been properly initialized; see e.g. a89071bdef/src/display/global.js
In general it's always recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
Unfortunately we cannot really remove this, since that code has existed since "forever", however we can limit it to only the GENERIC build to avoid completely unnecessary checks in e.g. the Firefox PDF Viewer.
Finally, note that the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
- Use a `URL`-instance directly, since it's by definition an absolute URL.
- Actually limit the "raw" url-string handling to Node.js environments, as intended.
- Skip the warning, since we're already throwing an Error if the `url`-parameter is invalid.
*Please note:* I cannot reproduce the problem reported in bug 1811668, regarding the context menu, and in any case it's not clear that that part is even a PDF Viewer bug.
Looking at bug 1811668 I couldn't help but noticing that the textLayer isn't correct, and it's unfortunately once again a problem with the `adjustType1ToUnicode` function. That's intended to help improve text-selection for fonts without a /ToUnicode-entry, and in many cases it does help (the original PR fixed lots of issues) however it's also caused some problems.
In order to improve text-selection in bug 1811668, we'll now properly ignore fonts that have a predefined *named* encoding specified since that's really the intention with PR 14050.
The JBIG2 images in this PDF document are corrupt enough that even Adobe Reader warns about it when opening the file.
*Please note:* I don't really know the JBIG2 image format at all, however from a very brief look at the specification it seems that integers should be 32-bit.
In general it's recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
However, the `getDocument`-function also accepts a direct `PDFDataRangeTransport`-instance which just seems unnecessary.
*Please note:* The `PDFDataRangeTransport`-implementation was added specifically for the *built-in* Firefox PDF Viewer, however it's most likely not commonly used by any third-party (given that it requires manual PDF-data loading).
Furthermore, the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
The relevant TrueType font is missing both /ToUnicode *and* /Encoding entires, either of which would have prevented the (current) broken textLayer rendering.
My first idea was that we could use the `post` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6post.html, to get the actual glyphNames and amend the fallback ToUnicode-map that way. Unfortunately that didn't work, since the `post` table only contained ".notdef" and "" (i.e. empty string) entries.
Instead we try to use the `name` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6name.html, to determine if the platform is Windows and thus fallback to generate a ToUnicode-map from the `WinAnsiEncoding`.
Note how all over the `src/core/annotation.js`-code we're assuming that if an `appearance`-entry exists it's also a Stream. However, we're not actually checking that thoroughly enough which causes issues in some badly generated PDF documents.
This patch removes the recently introduced `transferPdfData` API-option, and simply enables transferring of TypedArray data *by default* instead of copying it. This will help reduce main-thread memory usage, however it will take ownership of the TypedArrays. Currently this only applies to the following cases:
- TypedArrays passed to the `getDocument`-function in the API, in order to open PDF documents from binary data.
- TypedArrays passed to a `PDFDataRangeTransport`-instance, used to support custom PDF document fetching/loading (see e.g. the Firefox PDF Viewer).
*PLEASE NOTE:* To avoid being affected by this, please simply *copy* any TypedArray data before passing it to either of the functions/methods mentioned above.
Now that we transfer TypedArray data that we previously only copied, we need to be more careful with input validation. Given how the `{IPDFStreamReader, IPDFStreamRangeReader}.read` methods will always return ArrayBuffer data, which is then transferred to the worker-thread[1], the actual TypedArray data passed to the API thus need to have the same exact size as its underlying ArrayBuffer to prevent issues.
Hence we'll check for this and only allow transferring of *safe* TypedArray data, and fallback to simply copying the data just as before. This obviously shouldn't be an issue in the Firefox PDF Viewer, but for the general PDF.js library we need to be more careful here.
---
[1] See e09ad99973/src/display/api.js (L2492-L2506) respectively e09ad99973/src/display/api.js (L2578-L2590)
Note how in the API we're transferring the PDF data that's fetched over the network[1]:
- f28bf23a31/src/display/api.js (L2467-L2480)
- f28bf23a31/src/display/api.js (L2553-L2564)
To support that functionality we have the `PDFDataTransportStream`, `PDFFetchStream`, `PDFNetworkStream`, and `PDFNodeStream` implementations. Here these stream-implementations vary slightly in how they handle `ArrayBuffer`s internally, w.r.t. transferring or copying the data:
- In `PDFDataTransportStream` we optionally, after PR 15908, allow transferring of the PDF data as provided externally (used e.g. in the Firefox PDF Viewer).
- In `PDFFetchStream` we're currenly always copying the PDF data returned by the Fetch API, which seems unnecessary. As discussed in PR 15908, it'd seem very weird if this sort of browser API didn't allow transferring of the returned data.
- In `PDFNetworkStream` we're already, since many years, transferring the PDF data returned by the `XMLHttpRequest` functionality. Note how the `getArrayBuffer` helper function simply returns an `ArrayBuffer` response as-is.
- In `PDFNodeStream` we're currently copying the PDF data, however this is unfortunately necessary since Node.js returns data as a `Buffer` object[2].
Given that the `PDFNetworkStream` has been, indirectly, supporting transferring of PDF data for years it would seem really strange if this didn't also apply to the `PDFFetchStream`-implementation.
Hence this patch simply enables transferring of PDF data, when accessed using the Fetch API, unconditionally to help reduced main-thread memory usage since the `PDFFetchStream`-implementation is used *by default* in browsers (for the GENERIC build).
---
[1] As opposed to PDF data being provided as e.g. a TypedArray when calling `getDocument` in the API.
[2] This is a "special" Node.js object, see https://nodejs.org/api/buffer.html#buffer, which doesn't exist in browsers.
Also, removes the `initialData`-parameter JSDocs for the `getDocument`-function given that this parameter has been completely unused since PR 8982 (over five years ago). Note that the `initialData`-parameter is, and always was, intended to be provided when initializing a `PDFDataRangeTransport`-instance.
Given that this is internal functionality, not exposed in the official API, it's not entirely clear (at least to me) why we can't just initialize this directly in `src/display/api.js` instead.
When testing both the development viewer and all the ways in which we run tests, everthing still appears to work just fine with this patch.
*Please note:* The reduced test-case is *not* a perfect reproduction of the original PDF document, since this one fails to open in e.g. Adobe Reader, but I do believe that it captures the most important points here.
For corrupt *and* encrypted PDF documents, it's possible that only some trailer dictionaries actually contain an /Encrypt-entry. Previously we'd could easily miss that, since we generally pick the first not obviously corrupt trailer dictionary, and the solution implemented here is to simply pre-parse all trailer dictionaries to see if there's any /Encrypt-entries.