In addition to the existing /Root and /Pages validation, also check that the /Pages-entry actually is a dictionary and that it has a valid /Count-entry.
This way we can avoid picking a trailer candidate which e.g. the `Catalog.numPages` getter will just end up rejecting, thus breaking PDF document loading completely.
* in some pdf, there are actions with "event.source.hidden = ..."
* in order to handle visibility when printing, annotationStorage is extended to store multiple properties (value, hidden, editable, ...)
* When no actions then set it to null instead of empty object
* Even if a field has no actions, it needs to listen to events from the sandbox in order to be updated if an action changes something in it.
The `PartialEvaluator.hasBlendModes` method is necessary to determine if there's any blend modes on a page, which unfortunately requires *synchronous* parsing of the /Resources of each page before its rendering can start (see the "StartRenderPage"-message).
In practice it's not uncommon for certain /Resources-entries to be found on more than one page (referenced via the XRef-table), which thus leads to unnecessary re-fetching/re-parsing of data in `PartialEvaluator.hasBlendModes`.
To improve performance, especially in pathological cases, we can cache /Resources-entries when it's absolutely clear that they do not contain *any* blend modes at all[1]. This way, subsequent `PartialEvaluator.hasBlendModes` calls can be made significantly more efficient.
This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf:
```
[
{ "id": "issue6961",
"file": "../web/pdfs/issue6961.pdf",
"md5": "a80e4357a8fda758d96c2c76f2980b03",
"rounds": 100,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, page, stat --
browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ---- | ------------ | ----- | ------------ | ----------- | ---- | ------ | -------------
firefox | 0 | Overall | 100 | 1034 | 555 | -480 | -46.39 | faster
firefox | 0 | Page Request | 100 | 489 | 7 | -482 | -98.67 | faster
firefox | 0 | Rendering | 100 | 545 | 548 | 2 | 0.45 |
firefox | 1 | Overall | 100 | 912 | 428 | -484 | -53.06 | faster
firefox | 1 | Page Request | 100 | 487 | 1 | -486 | -99.77 | faster
firefox | 1 | Rendering | 100 | 425 | 427 | 2 | 0.51 |
```
---
[1] In the case where blend modes *are* found, it becomes a lot more difficult to know if it's generally safe to skip /Resources-entries. Hence we don't cache anything in that case, however note that most document/pages do not utilize blend modes anyway.
* it's faster to generate the color code in using a table for components
* it's very likely a way faster to parse (when setting the color in the canvas)
Given that the `Map`-pattern apparently has undesirable performance characteristics, change this getter back to using an Object instead and check its size before returning it.
Rather than returning an *empty* Object[1] we should be returning `null` instead, since that's consistent with existing API-functionality.
To avoid having to *manually* track if the Object is empty, this patch also introduces a small helper function to check its size.
As can be seen in `src/core/fonts.js`, this method only accepts *one* parameter, hence it's somewhat difficult to understand what the Annotation-code is actually attempting to do here.
The only possible explanation that I can imagine, is that the intention was initially to call `Font.charToGlyph` *directly* instead. However, note that that'd would not actually have been correct, since that'd ignore one level of font-caching (see `this.charsCache`). Hence the unused arguments are removed, in `src/core/annotation.js`, and the `Font.charToGlyph` method is now marked as "private" as intended.
This patch removes unnecessary escape-sequence in (mostly) strings, as a first step, since the ones in regular expressions probably requires more careful testing (just in case).
The only exception is a regular expression in `src/core/annotation.js`, since we should have both unit- and reference-tests for this code *and* given [this information on MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types):
> Inside a character set, the dot loses its special meaning and matches a literal dot.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-useless-escape
This provides a work-around to avoid having to conditionally try to initialize the `openAction`-object in multiple places.
Given that `Object.fromEntries` doesn't seem to *guarantee* that a `null` prototype is used, we thus hack around that by using `Object.assign` with `Object.create(null)`.
Different fonts incorrectly end up with *identical* hashes, despite having different /ToUnicode data.
The issue, and it's very interesting that we've apparently not seen it before, appears to be caused by the fact that different /ToUnicode entries share the *same* underlying `ArrayBuffer`, which thus becomes problematic at the `const dataUint32 = new Uint32Array(data.buffer, 0, blockCounts);` line. The simplest solution thus seem to be to just *copy* the input, when it's an `ArrayBuffer`, rather than using it as-is. (Note that if we'd stringified the input, when calling `MurmurHash3_64.update`, the issue would also have been fixed. In this case, we're already creating an unique TypedArray.)
*Please note:* Once https://bugzilla.mozilla.org/show_bug.cgi?id=1247687 is implemented, and we've removed SystemJS completely, this entire patch can (and even should) be reverted.
This is similar to the existing `getLookupTableFactory` helper function, but is implemented as outlined in issue 6774.
The re-formatting of the tables were done automatically, by using find-and-replace with regular expressions.
For reasons that I don't even pretend to understand, using this particular structure for these *very* long lookup tables allow SystemJS to process the files correctly/quickly and the development viewer thus works as intended.
While the *built* `pdf.worker.js` file still works correctly with these changes, despite these two files being excluded by Babel[1], the development viewer does not because of issues with SystemJS[2] and/or its Babel-plugin (both of which are old).
Furthermore, note also that excluding these two files from Babel-processing isn't *generally* necessary since e.g. the `gulp mozcentral` command works anyway. The explanation is rather that it's actually the source-map generation which fails for these huge sequences when building the `pdf.worker.js` file.
However, not using standard `import`/`export` statements in all files means we also need to use SystemJS when e.e. running the unit-tests. This is very unfortunate, since SystemJS (or its old Babel-version) doesn't support modern ECMAScript features such as e.g. optional chaining and nullish coalescing.
Unfortunately it also seems that https://bugzilla.mozilla.org/show_bug.cgi?id=1247687, which tracks the implementation of worker-modules in Firefox, has stalled since there hasn't been any updates for six months now.
To hopefully address all of the above, this patch is the first in a series that attempts to further reduce our reliance on SystemJS.
---
[1] The only difference being how the dependencies are handled, in the Webpack-bundled file.
[2] Parsing takes way too long and consumes too much memory, thus rendering the development viewer essentially unusable.
- Handle the arguments correctly in `PartialEvaluator.handleColorN`.
For TilingPatterns with a base-ColorSpace, we're currently using the `args` when computing the color. However, as can be seen we're passing the Array as-is to the `ColorSpace.getRgb` method, which means that the `Name` is included as well.[1]
Thankfully this hasn't, as far as I know, caused any actual bugs, but that may be more luck than anything else given how the `ColorSpace` code is implemented. This can be easily fixed though, simply by popping the `Name`-object off of the `args` Array.
- Cache TilingPatterns using the `Name`-string, rather than the object directly.
This is not only consistent with other caches in `PartialEvaluator`, but importantly it also ensures that the cache lookup always works correctly. Note that since `Name`-objects, similar to other primitives, uses a cache themselves a *manually* triggered `cleanup`-call could thus (theoretically) cause the `LocalTilingPatternCache` to not find an existing entry. While the likelihood of this happening is *extremely* small, it's still something that we should fix.
---
[1] The `args` Array can e.g. look like this: `[0.043, 0.09, 0.188, 0.004, /P1]`, which means that we're passing in the `Name`-object to the `ColorSpace` method.
For an invalid Annotation, there's one code-path where `undefined` is returned from `AnnotationFactory._create`. That'd currently, incorrectly, trigger an error during the `PDFDocument._collectFieldObjects` parsing which thus seem good to avoid.
Along these lines, the filtering in `PDFDocument.fieldObjects` is also updated to handle both `null` and `undefined` the same way.
Some pdf softwares don't remove highlight annotations but make the QuadPoints array empty.
And the Rect for the annotation can be [-32768, -32768, 32768, 32768] so it leads to have a giant div which catches all the mouse events and make the pdf unusable when there are some forms elements.
Given that *all* fonts are, ever since PR 7347, now cached in the "normal" `fontCache` there's actually no reason for the special `font.translated` construction. (Given how Objects in JavaScript are references, rather than raw values, the old code shouldn't have caused any significant memory overhead.)
Instead we can simply store the `cacheKey`, which is a simple string, on only the Font `Dict`s where it's needed and thus look-up all fonts using the `fontCache` instead.
If this method is ever passed invalid/unexpected data, or if during the course of parsing (since it's used recursively) such data is found, it will fail in a non-graceful way.
Hence this patch, which ensures that we don't attempt to access non-existent properties and also that errors such as the one fixed in PR 12479 wouldn't have occured.
*This patch is based on a couple of smaller things that I noticed when working on PR 12479.*
- Don't store the /Fields on the `formInfo` getter, since that feels like overloading it with unintended (and too complex) data, and utilize a `hasFields` boolean instead.
This functionality was originally added in PR 12271, to help determine what kind of form data a PDF document contains, and I think that we should ensure that the return value of `formInfo` only consists of "simple" data.
With these changes the `fieldObjects` getter instead has to look-up the /Fields manually, however that shouldn't be a problem since the access is guarded by a `formInfo.hasFields` check which ensures that the data both exists and is valid. Furthermore, most documents doesn't even have any /AcroForm data anyway.
- Determine the `hasFields` property *first*, to ensure that it's always correct even if there's errors when checking e.g. the /XFA or /SigFlags entires, since the `fieldObjects` getter depends on it.
- Simplify a loop in `fieldObjects`, since the object being accessed is a `Map` and those have built-in iteration support.
- Use a higher logging level for errors in the `formInfo` getter, and include the actual error message, since that'd have helped with fixing PR 12479 a lot quicker.
- Update the JSDoc comment in `src/display/api.js` to list the return values correctly, and also slightly extend/improve the description.
This allows us to make a slight simplification in `PartialEvaluator.loadFont`, which thus removes an old TODO-comment from the method.
Furthermore, in `PartialEvaluator.translateFont`, the CMap-handling is now limited to only *composite* fonts to avoid having to wait for a "dummy"-Promise for most fonts.
- Actually register/unregister the `WorkerTask`s, used when saving each page, correctly.
To prevent issues when terminating the Worker, we purposely wait for all running `WorkerTask`s to complete first. Hence we need to actually handle `WorkerTask`s the same way in "SaveDocument" as in the rest of this file, see e.g. "GetOperatorList" and "GetTextContent".
- Access `PDFDocument` properties in a generally safe/consistent way.
While the current code works fine, given how the PDF document is being loaded, it still seems like a very good idea to be *consistent* in how we access these kind of properties (since in general you need to avoid `MissingDataException` everywhere in this file).
- Change a variable name, since there's essentially no precedent in the code-base for *local* variable names to start with an underscore.
Support for the `scope` parameter, in `MessageHandler.on`, was removed in PR 11110 however this particular case was unused/unnecessary for years prior to that change. (From a quick look through the history, I'm not even sure if it was actually needed in the first place.)
In practice it's not uncommon for PDF documents to re-use the same TilingPatterns more than once, and parsing them is essentially equal to parsing of a (small) page since a `getOperatorList` call is required.
By caching the internal TilingPattern representation we can thus avoid having to re-parse the same data over and over, and there's also *less* asynchronous parsing required for repeated TilingPatterns.
Initially I had intended to include (standard) benchmark results with this patch, however it's not entirely clear that this is actually necessary here given the preliminary results.
When testing this manually in the development viewer, using `pdfBug=Stats`, the following (approximate) reduction in rendering times were observed when comparing `master` against this patch:
- http://pubs.usgs.gov/sim/3067/pdf/sim3067sheet-2.pdf (from issue 2765): `6800 ms` -> `4100 ms`.
- https://github.com/mozilla/pdf.js/files/1046131/stepped.pdf (from issue 8473): `54000 ms` -> `13000 ms`
- https://github.com/mozilla/pdf.js/files/1046130/proof.pdf (from issue 8473): `5900 ms` -> `2500 ms`
As always, whenever you're dealing with documents which are "slow", there's usually a certain level of subjectivity involved with regards to what's deemed acceptable performance.
Hence it's not clear to me that we want to regard any of the referenced issues as fixed, however the improvements are significant enough to warrant caching of TilingPatterns in my opinion.