Looking at this again, it struck me that added functionality in `Util.intersect` is probably more confusing than helpful in general; sorry about the churn in this code!
Based on the parameter name you'd probably expect it to only match when the intersection is `[0, 0, 0, 0]` and not when only one component is zero, hence the `skipEmpty` parameter thus feels too tightly coupled to the `Page.view` getter.
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".
The patch makes the following notable changes:
- Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
- Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
- Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
- Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
- Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.
---
[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
Firefox telemetry supports using string labels now. Convert our integers
that we used for categories to just use strings.
The upstream work will happen in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1566882
The `finalize` helper function has only a *single* call-site, and furthermore it's just a one-liner too. Furthermore it's only ever called with a `Promise` as its argument, meaning that it's unnecessarily convoluted as well (i.e. the `Promise.resolve()` part shouldn't be necessary).
Hence this code can be both simplified *and* inlined at its only call-site instead.
Currently `wrapReason` is manually called at *every* `resolveOrReject` call-site, despite it being completely unnecessary unless there's an actual error being handled. This is obviously inefficient, and it's easy enough to avoid by having `resolveOrReject` handle this only when actually needed.
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.
The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
Given that the function is (purposely) independent of the verbosity level and that its message is worded to only apply on the main-thread, there's no reason to duplicate this across the built `pdf.js`/`pdf.worker.js` files.
The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated.
Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used *only* on the worker-thread.
Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus *cannot* possibly ever be used on the main-thread.
After PR 9340 all glyphs are now re-mapped to a Private Use Area (PUA) which means that if a font fails to load, for whatever reason[1], all glyphs in the font will now render as Unicode glyph outlines.
This obviously doesn't look good, to say the least, and might be seen as a "regression" since previously many glyphs were left in their original positions which provided a slightly better fallback[2].
Hence this patch, which implements a *general* fallback to the PDF.js built-in font renderer for fonts that fail to load (i.e. are rejected by the sanitizer). One caveat here is that this only works for the Font Loading API, since it's easy to handle errors in that case[3].
The solution implemented in this patch does *not* in any way delay the loading of valid fonts, which was the problem with my previous attempt at a solution, and will only require a bit of extra work/waiting for those fonts that actually fail to load.
*Please note:* This patch doesn't fix any of the underlying PDF.js font conversion bugs that's responsible for creating corrupt font files, however it does *improve* rendering in a number of cases; refer to this possibly incomplete list:
[Bug 1524888](https://bugzilla.mozilla.org/show_bug.cgi?id=1524888)
Issue 10175
Issue 10232
---
[1] Usually because the PDF.js font conversion code wasn't able to parse the font file correctly.
[2] Glyphs fell back to some default font, which while not accurate was more useful than the current state.
[3] Furthermore I'm not sure how to implement this generally, assuming that's even possible, and don't really have time/interest to look into it either.
Unsurprisingly IE11 doesn't support this, so a polyfill is needed since otherwise the sidebar can no longer be opened.
Also, simplifies the existing `classList.toggle` polyfill.
This polyfill is currently used in only *one* file, i.e. `src/display/api.js`, and only when trying to build a *fallback* `workerSrc` path.
Given that the global `workerSrc` should *always* be set[1] when using the PDF.js library[2], and that the fallback `workerSrc` should only be regarded as a best-effort solution anyway, there isn't a particularily strong reason to keep the compatibility code in my opinion.
---
[1] Other supported options include setting the global `workerPort`, or passing in a `PDFWorker` instance as part of the `getDocument` call.
[2] Which is clearly mentioned in the JSDocs in `src/display/worker_options.js`.
Based on the discussion in https://bugzilla.mozilla.org/show_bug.cgi?id=1521413, this patch simply removes the `ReadableStream` polyfill completely from MOZCENTRAL builds.
With this patch, the size of the `gulp mozcentral` build target is thus further reduced (building on PR 10470):
| | `build/mozcentral`
|-------|-------------------
|master | 3 339 666
|patch | 3 209 572
With https://bugzilla.mozilla.org/show_bug.cgi?id=1505122 landing in Firefox 65, the native `ReadableStream` implementation is now enabled by default in Firefox.
Obviously it would be nice to simply stop bundling the polyfill in MOZCENTRAL builds altogether, however given that it's still possible to disable[1] `ReadableStream` this is probably not a good idea just yet.
Nonetheless, now that native support is available, it seems unnecessary (and wasteful) to keep bundling the polyfill twice[2] in MOZCENTRAL builds. Hence this patch, which contains a suggest approach for packing the polyfill in a *separate* file which is then *only* loaded if/when needed.
With this patch, the size of the `gulp mozcentral` build target is thus reduced accordingly:
| | `build/mozcentral`
|-------|-------------------
|master | 3 461 089
|patch | 3 340 268
Besides the PDF.js files taking up less space in Firefox this way, the additional benefit is that there's (by default) less code that needs to be loaded and parsed when the PDF Viewer is used which also cannot hurt.
---
[1] In `about:config`, by toggling the `javascript.options.streams` preference.
[2] Once in the `build/pdf.js` file, and once in the `build/pdf.worker.js` file.
In many cases in the code you don't actually care about the index itself, but rather just want to know if something exists in a String/Array or if a String starts in a particular way. With modern JavaScript functionality, it's thus possible to remove a number of existing `indexOf` cases.
For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].
Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.
Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2].
Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:
```javascript
console.time('first');
getDocument(...).promise.then((pdfDocument) => {
console.timeEnd('first');
});
console.time('second');
getDocument(...).promise.then((pdfDocument) => {
pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
console.timeEnd('second');
});
});
```
The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.
---
[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.
[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page.
[3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.
Note how nowhere in the code `canvasInRendering.get()` is ever called, and that this structure is really only used to store references to `<canvas>` DOM elements.
The reason for this being a `WeakMap` is probably because at the time we weren't using `core-js` polyfills yet, and since there already existed a manually implemented `WeakMap` polyfill it was probably simpler to use that.
There have been lots of problems with trying to map glyphs to their unicode
values. It's more reliable to just use the private use areas so the browser's
font renderer doesn't mess with the glyphs.
Using the private use area for all glyphs did highlight other issues that this
patch also had to fix:
* small private use area - Previously, only the BMP private use area was used
which can't map many glyphs. Now, the (much bigger) PUP 16 area can also be
used.
* glyph zero not shown - Browsers will not use the glyph from a font if it is
glyph id = 0. This issue was less prevalent when we mapped to unicode values
since the fallback font would be used. However, when using the private use
area, the glyph would not be drawn at all. This is illustrated in one of the
current test cases (issue #8234) where there's an "ä" glyph at position
zero. The PDF looked like it rendered correctly, but it was actually not
using the glyph from the font. To properly show the first glyph it is always
duplicated and appended to the glyphs and the maps are adjusted.
* supplementary characters - The private use area PUP 16 is 4 bytes, so
String.fromCodePoint must be used where we previously used
String.fromCharCode. This is actually an issue that should have been fixed
regardless of this patch.
* charset - Freetype fails to load fonts when the charset size doesn't match
number of glyphs in the font. We now write out a fake charset with the
correct length. This also brought up the issue that glyphs with seac/endchar
should only ever write a standard charset, but we now write a custom one.
To get around this the seac analysis is permanently enabled so those glyphs
are instead always drawn as two glyphs.
For proof-of-concept, this patch converts a couple of `Promise` returning methods to use `async` instead.
Please note that the `generic` build, based on this patch, has been successfully testing in IE11 (i.e. the viewer loads and nothing is obviously broken).
Being able to use modern JavaScript features like `async`/`await` is a huge plus, but there's one (obvious) side-effect: The size of the built files will increase slightly (unless `SKIP_BABEL == true`). That's unavoidable, but seems like a small price to pay in the grand scheme of things.
Finally, note that the `chromium` build target was changed to no longer skip Babel translation, since the Chrome extension still supports version `49` of the browser (where native `async` support isn't available).
Compared to all the other (static) methods in `Util`, the `toRoman` one looks slightly out of place. Even more so considering that `Util` is being exposed through `pdfjsLib`, where access to a Roman numerals conversion method doesn't make much sense.
This moves/exposes the `URL` polyfill similarily to the existing `ReadableStream` polyfill, rather than exposing it globally, to avoid interfering with any "outside" code.
Both the `URL` and `ReadableStream` polyfills are now exposed on the `pdfjsLib` object, such that they are accessible to the viewer components.
Furthermore, the `no-restricted-globals` ESLint rule is also enabled to prevent accidental usage of the native `URL`/`ReadableStream` implementations directly in the `src/` and `web/` folders; see also https://eslint.org/docs/rules/no-restricted-globals
Addresses the remaining TODO in https://github.com/mozilla/pdf.js/projects/6
Please note that the standalone `pdf.image_decoders.js` file will be including the complete `src/shared/util.js` file, despite only using parts of it.[1] This was done *purposely*, to not negatively impact the readability/maintainability of the core PDF.js code.
Furthermore, to ensure that the compatibility is the same in the regular PDF.js library *and* in the the standalone image decoders, `src/shared/compatibility.js` was included as well.
To (hopefully) prevent future complaints about the size of the built `pdf.image_decoders.js` file, a few existing async-related polyfills are being skipped (since all of the image decoders are completely synchronous).
Obviously this required adding a couple of pre-processor statements, but given that these are all limited to "compatibility" code, I think this might be OK!?
---
[1] However, please note that previous commits moved `PageViewport` and `MessageHandler` out of `src/shared/util.js` which reduced its size.
Not only is the `Util.loadScript` helper function unused on the Worker side, even trying to use it there would throw an Error (since `document` isn't defined/available in Workers).
Hence this helper function is moved, and its code modernized slightly by having it return a Promise rather than needing a callback function.
Finally, to reduced code duplication, the "new" loadScript function is exported and used in the viewer.
The `MessageHandler` itself, and its assorted helper functions, are currently the single largest[1] piece of code in the `src/shared/util.js` file. By moving this code into its own file, `src/shared/util.js` thus becomes smaller and more manageable.
This function combines the logic of two separate methods into one.
The loop limit is also a good thing to have for the calls in
`src/core/annotation.js`.
Moreover, since this is important functionality, a set of unit tests and
documentation is added.
Unfortunately, as far as I can tell, we still need the ability to adjust certain API options depending on the browser environment in PDF.js version `2.0`. However, we should be able to separate this from the general compatibility code in the `src/shared/compatibility.js` file.