Note how `XRef.fetchUncompressed`, which is used *a lot* for most PDF documents, is calling the `makeSubStream` method without providing a `length` argument.
In practice this results in the `makeSubStream` method, on the `ChunkedStream` instance, calling the `ensureRange` method with `NaN` as the end position, thus resulting in no data being requested despite it possibly being necessary.
This may be quite bad, since in this particular case it will lead to a new `ChunkedStream` being created *and* also a new `Parser`/`Lexer` instance. Given that it's quite possible that even the very first `Parser.getObj` call could throw `MissingDataException`, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives).
You obviously need to be very careful to not have `ChunkedStream.makeSubStream` accidentally requesting the *entire* file, hence its `this.end` property is of no use here, but it should be possible to at least check that the `start` of the data is present before any potentially expensive parsing occurs.
Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
Note that `PartialEvaluator.preEvaluateFont` will return an empty string when no hash was computed. This will complete short-circuit the `fontAlias` comparison in `PartialEvaluator.loadFont`, since fonts which are totally different will then match if their `hash`es are empty.
This function is currently called with the `OperatorList` instance as its argument, hence I cannot think of any good reason for not just moving it into the `OperatorList` properly. (This will also help with other planned changes regarding the `ImageCache` functionality.)
By transfering `ArrayBuffer`s you can avoid having two copies of the same data, i.e. one copy on each of the worker/main-thread, for data that's used only *once* on the worker-thread.
Note how the code in [`PDFImage.createMask`](80135378ca/src/core/image.js (L284-L285)) goes to great lengths to actually enable tranfering of the image data. However in [`PartialEvaluator.buildPaintImageXObject`](80135378ca/src/core/evaluator.js (L336)) the `cached` property is always set to `true`, which disqualifies the image data from being transfered; see [`getTransfers`](80135378ca/src/core/operator_list.js (L552-L554)).
For most ImageMask data this patch won't matter, since images found in the `/Resources -> /XObject` dictionary will always be indexed by name. However for *inline* images which contains ImageMask data, where only "small" images are cached (in both `parser.js` and `evaluator.js`), the current code will result in some unnecessary memory usage.
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.
The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
Notable changes:
- Remove the `return this;` from the `MurmurHash3_64.update` method, since it's completely unused and doesn't make a lot of sense.
- Remove the loop(s) from the `MurmurHash3_64.hexdigest` method, since creating a temporary array and then looping over it is wasteful given how simple this can be written with modern JavaScript.
Currently for every single parsed/rendered page there's no less than *four* `Date.now()` calls being made on the worker-side. This seems totally unnecessary, since the result of these calls are, by default, not used for anything *unless* the verbosity level is set to `INFO`.
The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated.
Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used *only* on the worker-thread.
Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus *cannot* possibly ever be used on the main-thread.
After PR 9340 all glyphs are now re-mapped to a Private Use Area (PUA) which means that if a font fails to load, for whatever reason[1], all glyphs in the font will now render as Unicode glyph outlines.
This obviously doesn't look good, to say the least, and might be seen as a "regression" since previously many glyphs were left in their original positions which provided a slightly better fallback[2].
Hence this patch, which implements a *general* fallback to the PDF.js built-in font renderer for fonts that fail to load (i.e. are rejected by the sanitizer). One caveat here is that this only works for the Font Loading API, since it's easy to handle errors in that case[3].
The solution implemented in this patch does *not* in any way delay the loading of valid fonts, which was the problem with my previous attempt at a solution, and will only require a bit of extra work/waiting for those fonts that actually fail to load.
*Please note:* This patch doesn't fix any of the underlying PDF.js font conversion bugs that's responsible for creating corrupt font files, however it does *improve* rendering in a number of cases; refer to this possibly incomplete list:
[Bug 1524888](https://bugzilla.mozilla.org/show_bug.cgi?id=1524888)
Issue 10175
Issue 10232
---
[1] Usually because the PDF.js font conversion code wasn't able to parse the font file correctly.
[2] Glyphs fell back to some default font, which while not accurate was more useful than the current state.
[3] Furthermore I'm not sure how to implement this generally, assuming that's even possible, and don't really have time/interest to look into it either.
pdf.js had a problem when copying characters on supplementary planes
(0xPPXXXX where PP is nonzero). This is because certain methods of
PartialEvaluator use classic String.fromCharCode instead of ES6's
String.fromCodePoint.
Despite the fact that readToUnicode method *tried* to parse out-of-UCS2
code points by parsing UTF-16BE, it was inadequate because
String.fromCharCode only supports UCS-2 range of Unicode.
All objects in the PDF document follow this pattern:
```
0000000001 0 obj
<<
% Some content here...
>>
endobj
0000000002 0 obj
<<
% More content here...
endobj
```
In many cases in the code you don't actually care about the index itself, but rather just want to know if something exists in a String/Array or if a String starts in a particular way. With modern JavaScript functionality, it's thus possible to remove a number of existing `indexOf` cases.
The `toString` method always creates two string objects (for the 'R'
character and for the `num` concatenation) and in the worst case
creates three string objects (one more for the `gen` concatenation).
For the Tracemonkey paper alone, this resulted in 12000 string
objects when scrolling from the top to the bottom of the document.
Since this is a hot function, it's worth minimizing the number of string
objects, especially for large documents, to reduce peak memory usage.
This commit refactors the `toString` method to always create only one
string object.
For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].
Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.
Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2].
Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:
```javascript
console.time('first');
getDocument(...).promise.then((pdfDocument) => {
console.timeEnd('first');
});
console.time('second');
getDocument(...).promise.then((pdfDocument) => {
pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
console.timeEnd('second');
});
});
```
The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.
---
[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.
[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page.
[3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.
Note that the OpenAction dictionary may contain other information besides just a destination array, e.g. instructions for auto-printing[1].
Given first of all that an arbitrary `Dict` cannot be sent from the Worker (since cloning would fail), and second of all that the data obviously needs to be validated, this patch purposely only adds support for fetching a destination from the OpenAction entry[2].
---
[1] This information is, currently in PDF.js, being included through the `getJavaScript` API method.
[2] This significantly reduces the complexity of the implementation, which seems fine for now. If there's ever need for other kinds of OpenAction to be fetched, additional API methods could/should be implemented as necessary (could e.g. follow the `getOpenActionWhatever` naming scheme).