While some of the output looks worse to my eye, this behavior more
closely matches what I see when I open the PDFs in Adobe acrobat.
Fixes: #4706, #9713, #8245, #1344
In the referenced PDF document the /Contents stream contains MarkedContent-operators, however no optional content dictionary exists; according to [the specification](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.3883825):
> Null values or references to deleted objects shall be ignored. If this entry is
not present, is an empty array, or contains references only to null or deleted
objects, the membership dictionary shall have no effect on the visibility of
any content.
In the PDF document some of the glyphs have bogus `differences`-entries[1] that cannot be resolved to valid glyph names, thus causing the glyph mapping to fail.
My initial idea was to use a similar approach as in the `PartialEvaluator._simpleFontToUnicode`-method, to extract the charCodes from those entries, however it turned out that that didn't actually help in this case (the mapping was still wrong).
To fix this I'm thus proposing that we fallback to the /ToUnicode map when no other useable data exists (e.g. no post-table), since it *hopefully* shouldn't make things any worse than leaving parts of the glyph map empty (which currently happens).
---
[1] As can be seem below, some of the entries are completely normal while others are non-standard:
```
Differences (array)
0 = 65
1 = /g5167
2 = /space
3 = /g11927
4 = /g17737
5 = /g11540
6 = /g2180
7 = /K
8 = /P
9 = /two
10 = /zero
11 = /one
12 = /five
13 = /four
14 = /g6932
15 = /g7246
16 = /g1691
17 = /g2343
18 = /g14792
19 = /g3325
20 = /g4280
21 = /g20383
22 = /g18166
23 = /g16988
24 = /g17943
25 = /g19223
26 = /g10830
27 = 97
28 = /g982
29 = /g1226
30 = /g5059
31 = /g2677
32 = /g1042
33 = /g11568
34 = /L
35 = /three
36 = /seven
37 = /g2364
38 = /g12063
39 = /g5356
40 = /g2173
41 = /g17877
42 = /g7273
43 = /g7647
44 = /g7224
45 = /g19327
46 = /g5054
47 = /g2342
48 = /g10136
49 = /g6856
50 = /g13381
51 = /g7257
52 = /g12093
53 = /g2359
```
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1720179
- in some pdfs the XFA array in AcroForm dictionary doesn't contain an entry for 'datasets' (which contains saved data), so basically this patch allows to overwrite the AcroForm dictionary with an updated XFA array when doing an incremental update.
- when some named nodes in the template don't have their counterpart in datasets we create some nodes: the main node mustn't belong to the datasets namespace because it doesn't make sense and Acrobat Reader isn't able to read pdf with such nodes.
- so created nodes under a datasets node have a namespaceId set to -1 and consequently when serialized no namespace prefix will appear.
There's a fair number of regular expressions througout the code-base which are slightly more verbose than strictly necessary, in particular:
- We have a lot of regular expressions that use `[0-9]` explicitly, and those can be simplified to use `\d` instead.
- We have one instance of a regular expression containing a `A-Za-z0-9_` sequence, which can be simplified to use `\w` instead.
Despite its name, the fonts in ItcSymbol-family are "regular" fonts and not Symbol ones. However, given that the font name contains the word "Symbol" we ended up picking the wrong code-path in the `Font.fallbackToSystemFont`-method.
*Please note:* While this patch ensures that the text becomes readable, by falling back a standard font, the rendering will obviously not be perfect. However, that's the PDF generators "fault" since non-embedded fonts cannot be guaranteed to render correctly in all environments.
Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.
The Viewer API definitions do not compile because of missing imports and
anonymous objects are typed as `Object`. These issues were not caught
during CI because the test project was not compiling anything from the
Viewer API.
As an example of the first problem:
```
/**
* @implements MyInterface
*/
export class MyClass {
...
}
```
will generate a broken definition that doesn’t import MyInterface:
```
/**
* @implements MyInterface
*/
export class MyClass implements MyInterface {
...
}
```
This can be fixed by adding a typedef jsdoc to specify the import:
```
/** @typedef {import("./otherFile").MyInterface} MyInterface */
```
See https://github.com/jsdoc/jsdoc/issues/1537 and
https://github.com/microsoft/TypeScript/issues/22160 for more details.
As an example of the second problem:
```
/**
* Gets the size of the specified page, converted from PDF units to inches.
* @param {Object} An Object containing the properties: {Array} `view`,
* {number} `userUnit`, and {number} `rotate`.
*/
function getPageSizeInches({ view, userUnit, rotate }) {
...
}
```
generates the broken definition:
```
function getPageSizeInches({ view, userUnit, rotate }: Object) {
...
}
```
The jsdoc should specify the type of each nested property:
```
/**
* Gets the size of the specified page, converted from PDF units to inches.
* @param {Object} options An object containing the properties: {Array} `view`,
* {number} `userUnit`, and {number} `rotate`.
* @param {number[]} options.view
* @param {number} options.userUnit
* @param {number} options.rotate
*/
```
*This is a follow-up to PRs 13867 and 13899.*
This patch is tagged `api-minor` for the following reasons:
- It replaces the `renderInteractiveForms`/`includeAnnotationStorage`-options, in the `PDFPageProxy.render`-method, with the single `annotationMode`-option that controls which annotations are being rendered and how. Note that the old options were mutually exclusive, and setting both to `true` would result in undefined behaviour.
- For improved consistency in the API, the `annotationMode`-option will also work together with the `PDFPageProxy.getOperatorList`-method.
- It's now also possible to disable *all* annotation rendering in both the API and the Viewer, since the other changes meant that this could now be supported with a single added line on the worker-thread[1]; fixes 7282.
---
[1] Please note that in order to simplify the overall implementation, we'll purposely only support disabling of *all* annotations and that the option is being shared between the API and the Viewer. For any more "specialized" use-cases, where e.g. only some annotation-types are being rendered and/or the API and Viewer render different sets of annotations, that'll have to be handled in third-party implementations/forks of the PDF.js code-base.
Moves the logic out of TextLayerBuilder to handle
highlighting matches into a new separate class `TextHighlighter`
that can be used with regular PDFs and XFA PDFs.
To mimic the current find functionality in XFA, two arrays
from the XFA rendering are created to get the text content
and map those to DOM nodes.
Fixes#13878
This patch removes the only remaining closure in the `src/display/api.js` file, utilizing a similar approach as used in lots of other parts of the code-base, which results in a small decrease in the size of the *build* `pdf.js` file.
Given that `PDFWorker` is exposed through the *public* API, this complicates things somewhat since there's a couple of worker-related properties that really should stay *private*. Initially, while working on PR 13813, I believed that we'd need support for private (static) class fields in order to get rid of this closure, however I've managed to come up with what's hopefully deemed an acceptable work-around here.
Furthermore, some helper functions were simply moved into the `PDFWorker` class as static methods, thus simplifying the overall implementation (e.g. we don't need to manually cache the Promise in the `PDFWorker._setupFakeWorkerGlobal`-method).
Finally, as part of this re-factoring a number of missing JSDoc-comments were added which *together* with the removal of the closure significantly improves the `gulp jsdoc` output for the `PDFWorker` class.
*Please note:* This patch is tagged with `api-minor` since it deprecates `PDFWorker.getWorkerSrc()` in favor of the shorter `PDFWorker.workerSrc`, with the fallback limited to `GENERIC` builds.
It shouldn't be necessary to assign these variables to the global scope (as far as I can tell), either explicitly with `window` or implicitly with `var`, and this way we don't need to disable the ESLint `no-undef` rule; fixes another small part of issue 13862.
*Please note:* I wasn't going to put additional work into this code after PR 13869, however these changes looked so simple that I figured trying to get rid of the few remaining "Code scanning alerts" wouldn't hurt.
However, this file would still very much benefit from additional clean-up and re-factoring work, since it's quite old and currently contains some dead code (commented out).
With the changes made in PR 13746 the *internal* renderingIntent handling became somewhat "messy", since we're now having to do string-matching in various spots in order to handle the "oplist"-intent correctly.
Hence this patch, which implements the idea from PR 13746 to convert the `intent`-strings, used in various API-methods, into an *internal* renderingIntent that's implemented using a bit-field instead. *Please note:* This part of the patch, in itself, does *not* change the public API (but see below).
This patch is tagged `api-minor` for the following reasons:
1. It changes the *default* value for the `intent` parameter, in the `PDFPageProxy.getAnnotations` method, to "display" in order to be consistent across the API.
2. In order to get *all* annotations, with the `PDFPageProxy.getAnnotations` method, you now need to explicitly set "any" as the `intent` parameter.
3. The `PDFPageProxy.getOperatorList` method will now also support the new "any" intent, to allow accessing the operatorList of all annotations (limited to those types that have one).
4. Finally, for consistency across the API, the `PDFPageProxy.render` method also support the new "any" intent (although I'm not sure how useful that'll be).
Points 1 and 2 above are the significant, and thus breaking, changes in *default* behaviour here. However, unfortunately I cannot see a good way to improve the overall API while also keeping `PDFPageProxy.getAnnotations` unchanged.
Given that issue 13862 tracks updating/modernizing the code, this patch purposely limits the scope of the changes. In particular, the following things are still left to address:
- The ESLint `no-undef` errors; for now the rule is simply disabled globally in this file.
- A couple of unused variables are commented out for now, but could perhaps just be removed.
The new command is a *variation* of the standard `gulp test` command and will run all unit/font/integration-tests just as normal, while *only* running ref-tests for XFA-documents to speed up development.
Given that we currently have (some) unit-tests for XFA-documents, and that we may also (in the future) want to add integration-tests, it thus makes sense to run all test-suites in my opinion.
*Please note:* Once this patch has landed, I'll submit a follow-up patch to https://github.com/mozilla/botio-files-pdfjs such that we can also run the new command on the bots.