The old `update`-signature started to annoy me back when I added optional content support to the viewer, since we're (often) forced to pass in a bunch of arguments that we don't care about whenever these methods are called.
This is tagged `api-minor` since `PDFPageView` is being used in the `pageviewer` component example, and it's thus possible that these changes could affect some users; the next commit adds fallback handling for the old format.
In the referenced PDF document the /Contents stream contains MarkedContent-operators, however no optional content dictionary exists; according to [the specification](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.3883825):
> Null values or references to deleted objects shall be ignored. If this entry is
not present, is an empty array, or contains references only to null or deleted
objects, the membership dictionary shall have no effect on the visibility of
any content.
In the PDF document some of the glyphs have bogus `differences`-entries[1] that cannot be resolved to valid glyph names, thus causing the glyph mapping to fail.
My initial idea was to use a similar approach as in the `PartialEvaluator._simpleFontToUnicode`-method, to extract the charCodes from those entries, however it turned out that that didn't actually help in this case (the mapping was still wrong).
To fix this I'm thus proposing that we fallback to the /ToUnicode map when no other useable data exists (e.g. no post-table), since it *hopefully* shouldn't make things any worse than leaving parts of the glyph map empty (which currently happens).
---
[1] As can be seem below, some of the entries are completely normal while others are non-standard:
```
Differences (array)
0 = 65
1 = /g5167
2 = /space
3 = /g11927
4 = /g17737
5 = /g11540
6 = /g2180
7 = /K
8 = /P
9 = /two
10 = /zero
11 = /one
12 = /five
13 = /four
14 = /g6932
15 = /g7246
16 = /g1691
17 = /g2343
18 = /g14792
19 = /g3325
20 = /g4280
21 = /g20383
22 = /g18166
23 = /g16988
24 = /g17943
25 = /g19223
26 = /g10830
27 = 97
28 = /g982
29 = /g1226
30 = /g5059
31 = /g2677
32 = /g1042
33 = /g11568
34 = /L
35 = /three
36 = /seven
37 = /g2364
38 = /g12063
39 = /g5356
40 = /g2173
41 = /g17877
42 = /g7273
43 = /g7647
44 = /g7224
45 = /g19327
46 = /g5054
47 = /g2342
48 = /g10136
49 = /g6856
50 = /g13381
51 = /g7257
52 = /g12093
53 = /g2359
```
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1720179
- in some pdfs the XFA array in AcroForm dictionary doesn't contain an entry for 'datasets' (which contains saved data), so basically this patch allows to overwrite the AcroForm dictionary with an updated XFA array when doing an incremental update.
- when some named nodes in the template don't have their counterpart in datasets we create some nodes: the main node mustn't belong to the datasets namespace because it doesn't make sense and Acrobat Reader isn't able to read pdf with such nodes.
- so created nodes under a datasets node have a namespaceId set to -1 and consequently when serialized no namespace prefix will appear.
There's a fair number of regular expressions througout the code-base which are slightly more verbose than strictly necessary, in particular:
- We have a lot of regular expressions that use `[0-9]` explicitly, and those can be simplified to use `\d` instead.
- We have one instance of a regular expression containing a `A-Za-z0-9_` sequence, which can be simplified to use `\w` instead.
Despite its name, the fonts in ItcSymbol-family are "regular" fonts and not Symbol ones. However, given that the font name contains the word "Symbol" we ended up picking the wrong code-path in the `Font.fallbackToSystemFont`-method.
*Please note:* While this patch ensures that the text becomes readable, by falling back a standard font, the rendering will obviously not be perfect. However, that's the PDF generators "fault" since non-embedded fonts cannot be guaranteed to render correctly in all environments.
Given that the Fetch API is normally being used now, these changes are probably less important now than they used to be. However, given that it's simple enough to implement this I figured why not just fix issue 9883 (better late than never I suppose).
Testing:
- delete the pdf file while the initial request is inflight
- delete the pdf file after the initial request has finished
Repeat for a small file and large file, exercising both one-off and
chunked transports.
This patch changes the `PDFDocumentLoadingTask.destroy`-method and the `_fetchDocument`-function to be `async`, which slightly simplifies the relevant code.
Furthermore, remove the catch-handler from the `WorkerTransport.getPageIndex`-method since it's no longer needed. Given that the `MessageHandler` is nowadays wrapping every possible Exception, it's no longer necessary to try and re-wrap the reason here.
While e.g. the `simpleviewer` and `singlepageviewer` examples work, since they're based on the `BaseViewer`-class, the standalone `pageviewer` example currently doesn't support either XFA- or StructTree-layers. This seems like an obvious oversight, which can be easily addressed simply by exporting the necessary functionality through `pdf_viewer.component.js`, similar to the existing Text/Annotation-layers.
While working on, and testing, these changes I happened to notice a number of smaller things that's also fixed in this patch:
- Ensure that `XfaLayerBuilder.render` always have a *consistent* return type, to prevent possible run-time failures in `PDFPageView`; PR 13908 follow-up.
- Change the order of the options in the `XfaLayerBuilder`-constructor to agree with the parameter order in the `DefaultXfaLayerFactory.createXfaLayerBuilder`-method.
- Add a missing `textHighlighterFactory`-option, in the JSDocs for the `PDFPageView`-class.
- A couple of small tweaks in the `TextLayerBuilder.render`-method: Re-use an existing Array rather than creating a new one, and replace an `if` with optional chaining instead.
*Please note:* For now XFA-support is currently disabled by default, similar to the regular viewer.
While running the unit-tests with some logging statements added to this code, I noticed that `PasswordException` was missing from the list of potential Errors that could be passed to the `wrapReason` function.
A lot of the code in this getter has existed ever since the initial PresentationMode-implementation was first added all the way back in PR 1938 (which is nine years ago now).
At this point in time however, there's now a simpler way detect if a browser supports the FullScreen API and we should thus be able to simplify this getter; please refer to https://developer.mozilla.org/en-US/docs/Web/API/Document/fullscreenEnabled#browser_compatibility
This command was added all the way back when basic CI-support was first introduced (using Travis at the time), however it's never really intended to be used e.g. for local development.
By having a `npm test`-command listed in the `package.json` file, there's a very real risk that someone unfamiliar with the code-base would only run that one and thus miss all the other (more important) test-suites[1].
Hence this patch which removes the `npm test`-command, and instead simply calls the relevant gulp-task[2] directly in the GitHub Actions configuration.
---
[1] Which consist of the unit-tests (run in browsers), the font-tests (potentially), the reference-tests, and the integration-tests.
[2] Which is also renamed slightly, to better fit its current usage.
Something that I *just* realized is that while PR 13854 fixed an issue as reported, it could still cause bugs in other similarily broken documents since we'll not insert a matching endMarkedContent-operator in the operatorList.
Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.
The Viewer API definitions do not compile because of missing imports and
anonymous objects are typed as `Object`. These issues were not caught
during CI because the test project was not compiling anything from the
Viewer API.
As an example of the first problem:
```
/**
* @implements MyInterface
*/
export class MyClass {
...
}
```
will generate a broken definition that doesn’t import MyInterface:
```
/**
* @implements MyInterface
*/
export class MyClass implements MyInterface {
...
}
```
This can be fixed by adding a typedef jsdoc to specify the import:
```
/** @typedef {import("./otherFile").MyInterface} MyInterface */
```
See https://github.com/jsdoc/jsdoc/issues/1537 and
https://github.com/microsoft/TypeScript/issues/22160 for more details.
As an example of the second problem:
```
/**
* Gets the size of the specified page, converted from PDF units to inches.
* @param {Object} An Object containing the properties: {Array} `view`,
* {number} `userUnit`, and {number} `rotate`.
*/
function getPageSizeInches({ view, userUnit, rotate }) {
...
}
```
generates the broken definition:
```
function getPageSizeInches({ view, userUnit, rotate }: Object) {
...
}
```
The jsdoc should specify the type of each nested property:
```
/**
* Gets the size of the specified page, converted from PDF units to inches.
* @param {Object} options An object containing the properties: {Array} `view`,
* {number} `userUnit`, and {number} `rotate`.
* @param {number[]} options.view
* @param {number} options.userUnit
* @param {number} options.rotate
*/
```
- Use `Node.TEXT_NODE` rather than a magical constant, in `TextHighlighter._convertMatches`, to improve readability. According to MDN, this has been supported since "forever": https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType#browser_compatibility
- Remove the `pageIdx`-property, on `TextLayerBuilder`-instances, since the re-factoring in PR 13908 meant that it's now unused.
- Remove the `matches`-property, on `TextLayerBuilder`-instances, since the re-factoring in PR 13908 meant that it's now unused.