*Please note:* The majority of this patch was written by Yury, and it's simply been rebased and slightly extended to prevent issues when dealing with `RenderingCancelledException`.
By leveraging streams this (finally) provides a simple way in which parsing can be aborted on the worker-thread, which will ultimately help save resources.
With this patch worker-thread parsing will *only* be aborted when the document is destroyed, and not when rendering is cancelled. There's a couple of reasons for this:
- The API currently expects the *entire* OperatorList to be extracted, or an Error to occur, once it's been started. Hence additional re-factoring/re-writing of the API code will be necessary to properly support cancelling and re-starting of OperatorList parsing in cases where the `lastChunk` hasn't yet been seen.
- Even with the above addressed, immediately cancelling when encountering a `RenderingCancelledException` will lead to worse performance in e.g. the default viewer. When zooming and/or rotation of the document occurs it's very likely that `cancel` will be (almost) immediately followed by a new `render` call. In that case you'd obviously *not* want to abort parsing on the worker-thread, since then you'd risk throwing away a partially parsed Page and thus be forced to re-parse it again which will regress perceived performance.
- This patch is already *somewhat* risky, given that it touches fundamentally important/critical code, and trying to keep it somewhat small should hopefully reduce the risk of regressions (and simplify reviewing as well).
Time permitting, once this has landed and been in Nightly for awhile, I'll try to work on the remaining points outlined above.
Co-Authored-By: Yury Delendik <ydelendik@mozilla.com>
Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>
Furthermore, it's possible to re-use the same Array for all `textDiv`s on the page and the resulting padding string also becomes a lot more compact.
Please note that the `paddingLeft` branch was moved, since the padding values need to be ordered as `top, right, bottom, left`.
Finally, with this re-factoring it's no longer necessary to cache the original `style` string for every `textDiv` when `enhanceTextSelection` is enabled.
Given that browsers will reject padding values smaller than zero (which may be caused by limited numerical precision during calculations in the `expand` code), it makes no sense to include those when expanding the `textDiv`s.
Given that the different types of `Stream`s will never be cached, this thus implies that the `XRef.cache` Array will *always* be more-or-less sparse.
Generally speaking, the longer the document the more sparse the `XRef.cache` will thus become. For example, looking at the `pdf.pdf` file from the test-suite: The length of the `XRef.cache` Array will be a few hundred thousand elements, with approximately 95% of them being empty.
Hence it seems pretty clear that an Array isn't really the best data-structure for this kind of cache, and this patch thus changes it to a Map instead.
This patch-series was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 200,
"type": "eq"
}
]
```
which gave the following results when comparing this patch-series against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 200 | 2736 | 2736 | 1 | 0.02 |
Firefox | Page Request | 200 | 2 | 2 | 0 | -8.26 | faster
Firefox | Rendering | 200 | 2733 | 2734 | 1 | 0.03 |
```
The relevant methods are usually not hot enough for these changes to have an easily measurable effect, however there's been a lot of other cases where similiar inlining has helped performance. (And these changes may help offset the changes made in the next patch.)
For very large and complex PDF files this will help performance *slightly*, since `Parser.getObj` is called *a lot* during parsing in the worker.
This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 200,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 200 | 2847 | 2830 | -17 | -0.60 | faster
Firefox | Page Request | 200 | 2 | 2 | 0 | -7.14 |
Firefox | Rendering | 200 | 2844 | 2827 | -17 | -0.60 | faster
```
Looking at this again, it struck me that added functionality in `Util.intersect` is probably more confusing than helpful in general; sorry about the churn in this code!
Based on the parameter name you'd probably expect it to only match when the intersection is `[0, 0, 0, 0]` and not when only one component is zero, hence the `skipEmpty` parameter thus feels too tightly coupled to the `Page.view` getter.
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".
The patch makes the following notable changes:
- Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
- Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
- Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
- Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
- Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.
---
[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
The current code will only consider the `cropBox` and `mediaBox` as equal when they both point to the *same* underlying Array. In the case where a PDF file actually specifies both boxes independently, with the exact same values in each, the comparison will currently fail and lead to an unneeded intersection computation.
With the changes to the `StreamType`/`FontType` "enums" in PR 11029, one unfortunate result is that `getStats` now *always* returns empty Arrays. Something that everyone, myself included, apparently missed is that you obviously cannot index an Array with Strings :-)
I wrongly assumed that the unit-tests would catch any bugs, but they apparently suffered from the same issue as the code in `src/core/`.
Another possible option could perhaps be to use `Set`s, rather than objects, but that will require larger changes since `LoopbackPort` (in `src/display/api.js`) doesn't support them.
Firefox telemetry supports using string labels now. Convert our integers
that we used for categories to just use strings.
The upstream work will happen in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1566882
There's a number of spots in the current code, and tests, where `cancel` methods are not called with appropriate arguments (leading to Promises not being rejected with Errors as intended).
In some cases the cancel `reason` is implicitly set to `undefined`, and in others the cancel `reason` is just a plain String. To address this inconsistency, the patch changes things such that cancelling is done with `AbortException`s everywhere instead.
Add a work-around, in `glyphlist.js`, for bad PDF generators which use a non-standard `/f_f` string in the `Encoding` dictionary when referring to the ff ligature (issue 11016)
This patch will not incur any (measurable) overhead, since the glyphlist is already quite long and one more entry won't really matter, which is important given that this sort of PDF corruption ought to be very rare.
Furthermore, this patch purposely does *not* add a bunch of similarly modified ligature names on pure speculation. Any similar additions, for other ligatures, should only be made if there's real-world examples of PDF files where that's actually necessary.
For very large and complex PDF files this will help performance slightly, since `EvaluatorPreprocessor.read` is called a lot during parsing in the worker.
This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, using the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 200,
"type": "eq"
}
]
```
This gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 200 | 3402 | 3358 | -43 | -1.28 | faster
Firefox | Page Request | 200 | 1 | 1 | 0 | 26.71 |
Firefox | Rendering | 200 | 3401 | 3357 | -44 | -1.28 | faster
```
For very large and complex PDF files this will help performance slightly, since `Parser.shift` is called *a lot* during parsing.
This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471 (with well over *four million* `Parser.shift` calls for just the one page), using the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 100,
"type": "eq"
}
]
```
This gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 100 | 3386 | 3322 | -65 | -1.92 | faster
Firefox | Page Request | 100 | 1 | 1 | 0 | -8.08 |
Firefox | Rendering | 100 | 3385 | 3321 | -65 | -1.92 | faster
```
The way that this method handles documents without an `ID` entry in the Trailer dictionary feels overly complicated to me. Hence this patch adds `getByteRange` methods to the various Stream implementations[1], and utilize that rather than manually calling `ensureRange` when computing a fallback `fingerprint`.
---
[1] Note that `PDFDocument` is only ever initialized with either a `Stream` or a `ChunkedStream`, hence why the `DecodeStream.getByteRange` method isn't implemented.
The `finalize` helper function has only a *single* call-site, and furthermore it's just a one-liner too. Furthermore it's only ever called with a `Promise` as its argument, meaning that it's unnecessarily convoluted as well (i.e. the `Promise.resolve()` part shouldn't be necessary).
Hence this code can be both simplified *and* inlined at its only call-site instead.
Currently `wrapReason` is manually called at *every* `resolveOrReject` call-site, despite it being completely unnecessary unless there's an actual error being handled. This is obviously inefficient, and it's easy enough to avoid by having `resolveOrReject` handle this only when actually needed.