Given that we're using modules, meaning that only explicitly `export`ed things are visible to the outside, it's no longer necessary to wrap all of the code in a closure.
This is done automatically with `gulp lint --fix` and the following
manual changes:
```diff
diff --git a/src/core/image.js b/src/core/image.js
index 35c06b8ab..e718b9937 100644
--- a/src/core/image.js
+++ b/src/core/image.js
@@ -97,7 +97,7 @@ class PDFImage {
if (isName(filter)) {
switch (filter.name) {
case "JPXDecode":
- var jpxImage = new JpxImage();
+ const jpxImage = new JpxImage();
jpxImage.parseImageProperties(image.stream);
image.stream.reset();
```
Using `for...of` is a modern and generally much nicer pattern, since it gets rid of unnecessary callback-functions. (In a couple of spots, a "regular" `for` loop had to be used.)
This patch first of all moves all checking/validation into the `appendIfJavaScriptDict` function, to avoid duplicating it in multiple places. Secondly, also removes what's now an outdated/incorrect comment since we have implemented scripting support.
Given that we're (almost) always iterating through the result of the `getAll`-calls, using a `Map` seems nicer overall since it's more suited to iteration compared to a regular Object.
Also, add a couple of `Dict`-checks in existing code touched by this patch, since it really cannot hurt to prevent *potential* errors in a corrupt PDF document.
First of all, while it should be very unlikely that the /ID-entry is an *indirect* object, note how we're using `Dict.get` when parsing it e.g. in `PDFDocument.fingerprint`. Hence we definitely should be consistent here, since if the /ID-entry is an *indirect* object the existing code in `src/core/writer.js` would already fail.
Secondly, to fix the referenced issue, we also need to check that the /ID-entry actually is an Array before attempting to access its contents in `src/core/writer.js`.
*Drive-by change:* In the `xrefInfo` object passed to the `incrementalUpdate` function, re-name the `encrypt` property to `encryptRef` since its data is fetched using `Dict.getRaw` (given the names of the other properties fetched similarly).
For CFF fonts without proper `ToUnicode`/`Encoding` data, utilize the "charset"/"Encoding"-data from the font file to improve text-selection (issue 13260)
By not waiting for the /Properties to load, before parsing of the operatorList/textContent starts, there's a very real risk that a `MissingDataException` will be thrown when trying to access the data in the `PartialEvaluator.parseMarkedContentProps` method.
If this ever happens it will thus lead to incomplete and/or outright broken rendering, and with e.g. `disableAutoFetch=true` set the likelihood of this occuring would increase quite a bit.
*Please note:* While I've not yet seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.
- Use `PDFManager.ensureDoc`, rather than `PDFManager.ensure`, in a couple of spots in the code. If there exists a short-hand format, we should obviously use it whenever possible.
- Fix a unit-test helper, to account for the previous changes. (Also, converts a function to be `async` instead.)
- Add one more exists-check in `PDFDocument.loadXfaFonts`, which I missed to suggest in PR 13146, to prevent any possible errors if the method is ever called in a situation where it shouldn't be.
Also, print a warning if the actual font-loading fails since that could help future debugging. (Finally, reduce overall indentation in the loop.)
- Slightly unrelated, but make a small tweak of a comment in `src/core/fonts.js` to reduce possible confusion.
- Different fonts can be used in xfa and some of them are embedded in the pdf.
- Load all the fonts in window.document.
Update src/core/document.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Update src/core/worker.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `XRef` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves `NameTree`/`NumberTree` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `FileSpec` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `ObjectLoader` into its own file.
Remove the unused "GetIsPureXfa" message handler; and avoid unnecessary parsing when no structTree is available (PR 13069 follow-up, PR 13221 follow-up)
With the current implementation of `PDFDocument.hasJSActions`, in the worker-thread, we're not actually handling not-yet-loaded data correctly. This can thus fail in *two* different ways:
- The `PDFDocument.fieldObjects` getter (and its helper method), while it may *return* a Promise, still fetches all of its data synchronously and it can thus throw a `MissingDataException` during parsing.
- The `Catalog.jsActions` getter, which is completely synchronous, can obviously throw a `MissingDataException` during parsing.
If either of these cases occur currently, the `PDFDocumentProxy.hasJSActions` method in the API can either return a *rejected* Promise (which it never should) or possibly "hang" and never resolve.
*Please note:* While I've not *yet* seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.
This patch is thus based on code-inspection *and* on manually throwing a `MissingDataException` on the first access of `Catalog.jsActions` to simulate this situation.
Finally, this patch adds a couple of *API* unit-tests for this (since none existed).
Given that this only an internal helper method, used by the `Catalog.{javaScript, jsActions}` getters, this change simplifies iteration of the returned data.
We can also (slightly) re-factor the code of the `jsActions` getter, and remove an obsolete[1] JSDoc-comment from the `openAction` getter.
---
[1] Not really relevant now that we've got proper scripting support.
Similar to all other data accesses, note e.g. the "GetDocJSActions" handler just above, we need to ensure that a `MissingDataException` isn't propagated to the main-thread if this data is accessed while the PDF document is still loading.
It's obviously (a bit) more efficient to return early in `Page.getStructTree`, rather than trying to first "parse" an *empty* structTree-root.
*Somehow I didn't think of this yesterday, but this feels like a much better solution overall; sorry about the churn here!*
Looking at the API, there's no code which actually sends this message. Most likely it's a left-over from a previous version of PR 13069, since the `isPureXfa` parameter is being included in the "GetDoc" message.
This is first of all consistent with existing API-methods, where we return `null` when the data in question doesn't exist. Secondly, it should also be (slightly) more efficient since there's less dummy-data that we need to transfer between threads.
Finally, this prevents us from adding an empty/unnecessary span to *every* single page even in documents without any structure tree data.
- but don't validate them for now;
- Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315)
- almost all (all ?) pdf readers display signatures;
- validation is done in edge but for now it's behind a pref.
When a PDF is "marked" we now generate a separate DOM that represents
the structure tree from the PDF. This DOM is inserted into the <canvas>
element and allows screen readers to walk the tree and have more
information about headings, images, links, etc. To link the structure
tree DOM (which is empty) to the text layer aria-owns is used. This
required modifying the text layer creation so that marked items are
now tracked.
The fontName, as defined in the PDF document, cannot be found in *any* of the "name"-tables in the TrueType Collection font. To work-around that, this patch adds a *fallback* code-path to allow using an approximately matching fontName rather than outright failing.
Given that what we actually want is only to keep track of the loadedFont-names, rather than storing any actual data, using an object isn't really necessary here. Furthermore, in the current code, we're also using `in` when checking if the data exists, which is generally less efficient than just checking for the value directly.
Currently the `fontName`-property contains an actual /Name-instance, which is a problem given that its fallback value is an empty string; see ca7f546828/src/core/default_appearance.js (L35)
The reason that this is a problem can be seen in ca7f546828/src/core/primitives.js (L30-L34), since an empty string short-circuits the cache. Essentially, in PDF documents, a /Name-instance cannot be empty and the way that the `DefaultAppearanceEvaluator` does things is unfortunately not entirely correct.
Hence the `fontName`-property is changed to instead contain a string, rather than a /Name-instance, which simplifies the code overall.
*Please note:* I'm tagging this patch with "[api-minor]", since PR 12831 is included in the current pre-release (although we're not using the `fontName`-property in the display-layer).
Fixes#13107
In the issue, some TrueType glyph names have the format `uniXXXX`.
Font's `Encoding` dictionary has the entry `Differences` but no
`BaseEncoding`. `uniXXXX` names are converted to glyph indices
using font's `post` table but currently that is done only when
`BaseEncoding` exists. We must enable the conversion also when only
`Differences` exists.
The reasons for making this change are:
- This property is not, nor has it ever been, used anywhere in the PDF.js display-layer.
- Related to the previous point, the format of the `defaultAppearance`-string is such that it'd be difficult to use it as-is in the display-layer anyway.
- It (usually) contains the "raw" appearance-string, from the PDF document, which is neither parsed nor validated and could thus be bogus.
- We now expose a `defaultAppearanceData`-property, which is first of all used in the display-layer and secondly contains actually parsed/validated data.
- In the event that a third-party implementation needs the `defaultAppearance`-string, it could be easily constructed from the recently added `defaultAppearanceData`-property.
All-in-all, I'm thus suggesting that we stop exposing an unused and unnecessary property on all Annotation-instances.
* JS - Handle correctly hierarchy of fields
- it aims to fix#13132;
- annotations can inherit their actions from the parent field;
- there are some fields which act as a container for other fields:
- they can be access through js so need to add them with an empty type (nothing in the spec about that but checked in Acrobat);
- calculation order list (CO) can reference them so need make them through this.getField;
- getArray method must return kids.
- field values are number, string, ... depending of their type but nothing in the spec on how to know what's the type:
- according to the comment for Canonical Format: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=461
- it seems that this "type" can be guessed from js action Format (when setting a type in Acrobat DC, the only affected thing is this action).
- util.scand with an empty string returns the current date.
- implement few positioning properties: position, width, height, anchor;
- implement font element;
- implement fill element (used by font) and its children (linear, radial, ...);
- font property is inherited from ancestor container (see https://www.pdfa.org/wp-content/uploads/2020/07/XFA-3_3.pdf#page=43) so let CSS handles that stuff;
- in order to reduce the number of properties to set, only set non default properties and put the default in CSS;
- set a background to some containers to be able to see them (will be removed in a future commit).
- add an option to enable XFA rendering if any;
- for now, let the canvas layer: it could be useful to implement XFAF forms (embedded pdf in xml stream for the background and xfa form for the foreground);
- ui elements in template DOM are pretty close to their html counterpart so we generate a fake html DOM from template one:
- it makes easier to translate template properties to html ones;
- it makes faster the creation of the html element in the main thread.
While there is nothing *outright* wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.
To address this, this patch makes the following changes:
- Ignore any non-string in the `docBaseUrl` option passed to `getDocument`, since those are unsupported anyway, already on the main-thread.
- Ignore "data:"-URLs in the `docBaseUrl` option passed to `getDocument`, to avoid having to send what could potentially be a *very* long string to the worker-thread.
- Parse the `docBaseUrl` option *directly* in the `BasePdfManager`-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.
This extends PR 13033 slightly, with a heuristic to support corrupt PDF documents where the `LineAnnotation`s have an empty /Rect-entry. Please note that while I have no idea if this is "correct", this patch at least makes us output the same /BBox as re-saving in Adobe Reader does.
This is mostly done using `gulp lint --fix` with a few manual changes in
the following diff:
```diff
diff --git a/src/core/pattern.js b/src/core/pattern.js
index 365491ed3..eedd8b686 100644
--- a/src/core/pattern.js
+++ b/src/core/pattern.js
@@ -105,7 +105,7 @@ const Pattern = (function PatternClosure() {
return Pattern;
})();
-var Shadings = {};
+const Shadings = {};
// A small number to offset the first/last color stops so we can insert ones to
// support extend. Number.MIN_VALUE is too small and breaks the extend.
@@ -597,16 +597,15 @@ Shadings.Mesh = (function MeshClosure() {
if (!(0 <= f && f <= 3)) {
throw new FormatError("Unknown type6 flag");
}
- var i, ii;
const pi = coords.length;
- for (i = 0, ii = f !== 0 ? 8 : 12; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 8 : 12; i < ii; i++) {
coords.push(reader.readCoordinate());
}
const ci = colors.length;
- for (i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
colors.push(reader.readComponents());
}
- var tmp1, tmp2, tmp3, tmp4;
+ let tmp1, tmp2, tmp3, tmp4;
switch (f) {
// prettier-ignore
case 0:
@@ -729,16 +728,15 @@ Shadings.Mesh = (function MeshClosure() {
if (!(0 <= f && f <= 3)) {
throw new FormatError("Unknown type7 flag");
}
- var i, ii;
const pi = coords.length;
- for (i = 0, ii = f !== 0 ? 12 : 16; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 12 : 16; i < ii; i++) {
coords.push(reader.readCoordinate());
}
const ci = colors.length;
- for (i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
colors.push(reader.readComponents());
}
- var tmp1, tmp2, tmp3, tmp4;
+ let tmp1, tmp2, tmp3, tmp4;
switch (f) {
// prettier-ignore
case 0:
@@ -897,7 +895,7 @@ Shadings.Mesh = (function MeshClosure() {
decodeType4Shading(this, reader);
break;
case ShadingType.LATTICE_FORM_MESH:
- var verticesPerRow = dict.get("VerticesPerRow") | 0;
+ const verticesPerRow = dict.get("VerticesPerRow") | 0;
if (verticesPerRow < 2) {
throw new FormatError("Invalid VerticesPerRow");
}
```
A significant portion of the code-base has now been converted to use `let`/`const`, rather than `var`, hence it should be possible to simply enable the ESLint `no-var` rule globally.
This way we can ensure that new code won't accidentally use `var`, and it also removes the need to manually enable the rule in various folders.
Obviously it makes sense to continue the efforts to replace `var`, but that should probably happen on a file and/or folder basis.
Please note that this patch excludes the following code:
- The `extensions/` folder, since that seemed easiest for now (and I don't know exactly what the support situation is for the Chromium-extension).
- The entire `external/` folder is ignored, since most of it's currently excluded from linting.
For the code that isn't imported from elsewhere (and should be ignored), we should probably (at some point) bring the code up to the same linting/formatting standard as the rest of the code-base.
- Various files in the `test/` folder are ignored, as necessary, since the way that a lot of this code is loaded will require some care (or perhaps larger re-factoring) when removing `var` usage.
Note that the majority of these changes were done automatically, by using `gulp lint --fix`, and the manual changes were limited to the following diff:
```diff
diff --git a/src/core/cff_parser.js b/src/core/cff_parser.js
index d684c200e..2e2b811e4 100644
--- a/src/core/cff_parser.js
+++ b/src/core/cff_parser.js
@@ -555,7 +555,7 @@ const CFFParser = (function CFFParserClosure() {
stackSize %= 2;
validationCommand = CharstringValidationData[value];
} else if (value === 10 || value === 29) {
- var subrsIndex;
+ let subrsIndex;
if (value === 10) {
subrsIndex = localSubrIndex;
} else {
@@ -886,15 +886,15 @@ const CFFParser = (function CFFParserClosure() {
format = bytes[pos++];
switch (format & 0x7f) {
case 0:
- var glyphsCount = bytes[pos++];
+ const glyphsCount = bytes[pos++];
for (i = 1; i <= glyphsCount; i++) {
encoding[bytes[pos++]] = i;
}
break;
case 1:
- var rangesCount = bytes[pos++];
- var gid = 1;
+ const rangesCount = bytes[pos++];
+ let gid = 1;
for (i = 0; i < rangesCount; i++) {
const start = bytes[pos++];
const left = bytes[pos++];
@@ -938,7 +938,7 @@ const CFFParser = (function CFFParserClosure() {
}
break;
case 3:
- var rangesCount = (bytes[pos++] << 8) | bytes[pos++];
+ const rangesCount = (bytes[pos++] << 8) | bytes[pos++];
for (i = 0; i < rangesCount; ++i) {
let first = (bytes[pos++] << 8) | bytes[pos++];
if (i === 0 && first !== 0) {
@@ -1173,7 +1173,7 @@ class CFFDict {
}
}
-var CFFTopDict = (function CFFTopDictClosure() {
+const CFFTopDict = (function CFFTopDictClosure() {
const layout = [
[[12, 30], "ROS", ["sid", "sid", "num"], null],
[[12, 20], "SyntheticBase", "num", null],
@@ -1229,7 +1229,7 @@ var CFFTopDict = (function CFFTopDictClosure() {
return CFFTopDict;
})();
-var CFFPrivateDict = (function CFFPrivateDictClosure() {
+const CFFPrivateDict = (function CFFPrivateDictClosure() {
const layout = [
[6, "BlueValues", "delta", null],
[7, "OtherBlues", "delta", null],
@@ -1265,11 +1265,12 @@ var CFFPrivateDict = (function CFFPrivateDictClosure() {
return CFFPrivateDict;
})();
-var CFFCharsetPredefinedTypes = {
+const CFFCharsetPredefinedTypes = {
ISO_ADOBE: 0,
EXPERT: 1,
EXPERT_SUBSET: 2,
};
+
class CFFCharset {
constructor(predefined, format, charset, raw) {
this.predefined = predefined;
@@ -1695,7 +1696,7 @@ class CFFCompiler {
// For offsets we just insert a 32bit integer so we don't have to
// deal with figuring out the length of the offset when it gets
// replaced later on by the compiler.
- var name = dict.keyToNameMap[key];
+ const name = dict.keyToNameMap[key];
// Some offsets have the offset and the length, so just record the
// position of the first one.
if (!offsetTracker.isTracking(name)) {
```
Now that we have scripting support, warning about e.g. JavaScript actions doesn't seem necessary anymore. Especially considering that scripting-related actions are/will not be parsed by the `Catalog.parseDestDictionary` method anyway, since it's intended for handling "simple" actions.
All of this code predates the existence of native JS classes, however we can now clean this up a bit. This patch thus let us remove some variable "shadowing" from the code.
This helper function is first of all only called *twice*, and secondly it also leads to unnecessary intermediate allocations given how the `TypedArray`s are handled.
Hence we can simply inline this small function, and thus directly allocate the combined `TypedArray` instead.
The `compareByteArrays` is first of all duplicated in multiple closures in the `src/core/crypto.js` file. Secondly, despite its name, it's also functionally equivalent to the now existing `isArrayEqual` helper function.
The `isArrayEqual` helper function is changed to use a standard `for`-loop, rather than `Array.prototype.every`, since that ought to be slightly more efficient given that we're now using it with (potentially) larger data.
All of this code predates the existence of native JS classes, however we can now clean this up a bit. This patch thus let us remove some variable "shadowing" from the code.
Rather than converting the `AnnotationStorage`-data to an Object, before sending it to the worker-thread, we should be able to simply send the internal `Map` directly.
The "structured clone algorithm" doesn't have a problem with `Map`s, however the `LoopbackPort` used when workers are *disabled* (e.g. in Node.js environments) didn't use to support them. With PR 12997 having lifted that restriction, we should now be able to simply send the `AnnotationStorage`-data as-is rather than having to iterate through it to first create an Object.
*Please note:* The changes in `src/core/annotation.js` could have been a lot more compact if we were able to use optional chaining in the `src/core` folder. Unfortunately that's still not possible, since SystemJS is being used in the development viewer (i.g. `gulp server`) and fixing that is *still* blocked by [bug 1247687](https://bugzilla.mozilla.org/show_bug.cgi?id=1247687).
With the previous patch this function is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
With the previous patch this functionality is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
The only reason, as far as I can tell, for parsing the Metadata on the main-thread is how it was originally implemented. When Metadata support was first implemented, it utilized the [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) which isn't available in workers.
Today, with the custom XML-parser being used, that's no longer an issue and it seems reasonable to move the Metadata parsing to the worker-thread[1], since that's where all parsing should happen (for performance reasons).
Based on these changes, we'll be able to reduce the now unnecessary duplication of the XML-parser (and related code) in both of the *built* `pdf.js`/`pdf.worker.js` files.
Finally, this patch changes the `_repair` method to use "Array + join" rather than string concatenation.
---
[1] This needed the previous patch, to enable sending of `Map`s between threads with workers disabled.
The following checks are all unneeded, and could easily cause confusion when reading the code. (All of them are my fault as well, since I've sometimes added those checks without really thinking about the surrounding code.)
- In `PartialEvaluator.hasBlendModes` there cannot be any `MissingDataException`s thrown, given that the `Page.getOperatorList` method waits for all the necessary /Resources to load first. Furthermore, note also that if an error is thrown from `PartialEvaluator.hasBlendModes` then it'd completely break rendering of that page, since any errors thrown from `Page.getOperatorList` are simply sent to the main-thread.
- In `PartialEvaluator.handleColorN` there cannot be any `MissingDataException`s thrown, given that again the `Page.getOperatorList` method waits for all the necessary /Resources to load before operatorList parsing starts.
- In `XRef.readXRef` there cannot be any `MissingDataException`s thrown, given that we're *explicitly* requesting (and waiting for) the entire document in `pdfManagerReady` (in `src/core/worker.js`) before re-parsing of a corrupt document starts.
* don't set a value in annotationStorage by default:
- having an undefined when the annotation is rendered for saving/printing means nothing has changed so use normal appearance
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1681687
* change the way to compute font size when this one is null in DA:
- make fontSize proportional to line height
- in multiline case, take into account the number of lines for text entered to adapt the font size
The *third* page of the referenced PDF document currently fails to render completely, since one of its font files fail to load.
Since that error isn't handled, a large part of the text is thus missing which looks quite bad. By "replacing" the font data with an *empty* stream, we'll thus be able to fallback to rendering the text with a standard font (instead of using `ErrorFont`). While there's obviously no guarantee that things will look perfect, actually rendering the text at all should be an improvement in general.
Also, print a warning in `PartialEvaluator.loadFont` when the `PartialEvaluator.translateFont` method rejects, since that'd have helped debug/fix the issue faster.
*As far as I can tell, this has been broken ever since PR 3289 (back in 2013) without anyone noticing.*
For any non-`MissingDataException` errors encountered in `ObjectLoader._walk`, we're simply throwing immediately which thus has the potential to *completely* break rendering of an entire page.
In practice this is obviously only an issue for PDF documents which are in one way or another corrupt, since that's the only way that `XRef.fetch` will throw non-`MissingDataException` errors. To make matters worse these errors are *intermittent*, since they can only occur if the document is still loading when the `ObjectLoader`-code runs (note the early return in `ObjectLoader.load`).
Please note that we cannot simply catch the error and let "normal" parsing continue in `ObjectLoader._walk`, since that could lead to errors elsewhere given that resources "below" the current one (in the graph) might not be checked as intended then.
All-in-all, the only way to make absolutely sure that we won't cause *unexpected* `MissingDataException`s somewhere else in the code-base is to fallback to fetching the *entire* document in this edge-case.
- in order to evaluate SOM expressions nodes and their attributes must be checked in the same order as in the xml;
- add an object XFAObjectArray with a parameter max to handle multiple children with the same name.
- the parser is base on a class extending XMLParserBase
- it handle xml namespaces:
* each namespace is assocated with a builder
* builder builds nodes belonging to the namespace
* when a node is inserted in the parent namespace compatibility is checked (if required)
- to avoid name collision between xml names and object properties, use Symbol.
Given that we'll only cache `/XObject`s of the `Image`-type globally, we can utilize that in `PartialEvaluator.getTextContent` as well. This way, in cases such as e.g. issue 12098, we can avoid having to fetch/parse `/XObject`s that we already know to be `Image`s. This is helpful, since `Stream`s are not cached on the `XRef` instance (given their potential size) and the lookup can thus be somewhat expensive in general.
Also, skip a redundant `RefSetCache.has` check in the `GlobalImageCache.getData` method.
The widths property should be an object to match what metrics returns.
In ZapfDingbats.pdf I was getting a data clone error with pdfBug enabled.
In buildCharCodeToWidth() there was an encoding with the name "at" which
is also the name of a method on an array. buildCharCodeToWidth assumes an
object is passed in, so when it checked for the "at" property, it found the
method and copied it over.
This only seemed to affect Firefox.
When implementing the `GlobalImageCache` functionality I was mostly worried about the effect of *very large* images, hence the maximum number of cached images were purposely kept quite low[1].
However, there's one fairly obvious problem with that approach: In documents with hundreds, or even thousands, of *small* images the `GlobalImageCache` as implemented becomes essentially pointless.
Hence this patch, where the `GlobalImageCache`-implementation is changed in the following ways:
- We're still guaranteed to be able to cache a *minimum* number of images, set to `10` (similar as before).
- If the *total* size of all the cached image data is below a threshold[2], we're allowed to cache additional images.
This patch thus *improve*, but doesn't completely fix, issue 12098. Note that that document is created by a *very poor* PDF generator, since every single page contains the *entire* document (with all of its /Resources) and to create the individual pages clipping is used.[3]
---
[1] Currently set to `10` images; imagine what would happen to overall memory usage if we encountered e.g. 50 images each 10 MB in size.
[2] This value was chosen, somewhat randomly, to be `40` megabytes; basically five times the [maximum individual image size per page](6249ef517d/src/display/api.js (L2483-L2484)).
[3] This surely has to be some kind of record w.r.t. how badly PDF generators can mess things up...
Note how, in the `if (this.stateManager.stateStack.length !== 0) {` branch, we're attempting to access the not yet defined variable[1] `args`. If this code-path is ever hit, an Error will be thrown and parsing will thus be aborted immediately (likely leading to e.g. rendering bugs).
Note that I found this purely by accident, since I happened to glance at the LGTM report. However, I've since found that the error is also present during the unit-test[2] and with this patch we're actually testing the *intended* thing here.
As part of fixing this, and to avoid re-introducing a similar bug in the future, we'll now instead always reset `args.length` *before* attempting to read the next operator.
Also, we can use the existing `EvaluatorPreprocessor.savedStatesDepth` getter to simplify the save/restore detection a tiny bit.
---
[1] The ESLint rule `no-use-before-define` would have helped catch this problem, but unfortunately we cannot enable that without quite a bit of refactoring all over the code-base.
[2] The unit-test was updated such that it would fail in the `master`-branch.
With the changes in PR 12831, it's no longer necessary to keep track of the `fontName`-string separately since it's available through the `_defaultAppearanceData`-property as well.
As can be seen in 2cba290361/src/core/evaluator.js (L986) the `gStateObj` (which is actually an Array despite its name), is wrapped in Array when it's inserted into the OperatorList. Hence we obviously need to take this into account when accessing it in `TranslatedFont._removeType3ColorOperators`; this mistake happened because we don't have any test-cases for this particular code-path as far as I know.
By changing this a `shadow`ed getter, we can simply access it directly and not worry about it being initialized. I have no idea why I didn't just implement it this way in the first place.
* Add a parser to get font data from the default appearance
- pdfium & poppler use a special parser too to get these info.
* Update src/core/default_appearance.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
While PR 12725 fixed bug 1671312 as reported, i.e. the "In the upper right corner "Purposes' has bad kerning."-part, it however broke other parts of the text rendering.
Note in particular the tables, e.g. on page 2 and beyond, where the glyphs are now rendered too close together. The reason for this is that the fonts in question are non-embedded ArialNarrow, which we just replace with Helvetica which obviously is not narrow. Given that the font replacement isn't a perfect fit for non-embedded ArialNarrow, we still need to re-measure the glyph widths in this case.
There's built-in ESLint rule, see `sort-imports`, to ensure that all `import`-statements are sorted alphabetically, since that often helps with readability.
Unfortunately there's no corresponding rule to sort `export`-statements alphabetically, however there's an ESLint plugin which does this; please see https://www.npmjs.com/package/eslint-plugin-sort-exports
The only downside here is that it's not automatically fixable, but the re-ordering is a one-time "cost" and the plugin will help maintain a *consistent* ordering of `export`-statements in the future.
*Note:* To reduce the possibility of introducing any errors here, the re-ordering was done by simply selecting the relevant lines and then using the built-in sort-functionality of my editor.
Given that the PDF document in the issue contains the same very large JPEG image *three* times, this patch includes a test-case where only the first page has been extracted from it.
Currently any errors thrown in `preEvaluateFont`, which is a *synchronous* method, will not be handled at all in the `loadFont` method and we were thus failing to return an `ErrorFont`-instance as intended here.
Also, add an *explicit* check in `PartialEvaluator.preEvaluateFont` to ensure that Type0-fonts always have a *valid* dictionary.
Similar to other markers that we currently skip, by ignoring unsupported Coding style default (COD) options we'll at least render *something* here (although some JPEG 2000 images may look slightly wrong).
Note that if the unsupported COD options lead to additional errors, during parsing, we'll still abort parsing of the JPEG 2000 image.
* the goal is to execute actions like Open or OpenAction
* can be tested with issue6106.pdf (auto-print)
* once #12701 is merged, we can add page actions
Similar to other markers that we currently skip, by ignoring the Coding style component (COC) marker we'll at least prevent outright errors (although some JPEG 2000 images may look slightly wrong).
There doesn't seem to be anything definitive about this in
the spec, but from experimenting, it seems acrobat lets
PDFs override the widths of the standard fonts.
In addition to the existing /Root and /Pages validation, also check that the /Pages-entry actually is a dictionary and that it has a valid /Count-entry.
This way we can avoid picking a trailer candidate which e.g. the `Catalog.numPages` getter will just end up rejecting, thus breaking PDF document loading completely.
* in some pdf, there are actions with "event.source.hidden = ..."
* in order to handle visibility when printing, annotationStorage is extended to store multiple properties (value, hidden, editable, ...)
* When no actions then set it to null instead of empty object
* Even if a field has no actions, it needs to listen to events from the sandbox in order to be updated if an action changes something in it.
The `PartialEvaluator.hasBlendModes` method is necessary to determine if there's any blend modes on a page, which unfortunately requires *synchronous* parsing of the /Resources of each page before its rendering can start (see the "StartRenderPage"-message).
In practice it's not uncommon for certain /Resources-entries to be found on more than one page (referenced via the XRef-table), which thus leads to unnecessary re-fetching/re-parsing of data in `PartialEvaluator.hasBlendModes`.
To improve performance, especially in pathological cases, we can cache /Resources-entries when it's absolutely clear that they do not contain *any* blend modes at all[1]. This way, subsequent `PartialEvaluator.hasBlendModes` calls can be made significantly more efficient.
This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf:
```
[
{ "id": "issue6961",
"file": "../web/pdfs/issue6961.pdf",
"md5": "a80e4357a8fda758d96c2c76f2980b03",
"rounds": 100,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, page, stat --
browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ---- | ------------ | ----- | ------------ | ----------- | ---- | ------ | -------------
firefox | 0 | Overall | 100 | 1034 | 555 | -480 | -46.39 | faster
firefox | 0 | Page Request | 100 | 489 | 7 | -482 | -98.67 | faster
firefox | 0 | Rendering | 100 | 545 | 548 | 2 | 0.45 |
firefox | 1 | Overall | 100 | 912 | 428 | -484 | -53.06 | faster
firefox | 1 | Page Request | 100 | 487 | 1 | -486 | -99.77 | faster
firefox | 1 | Rendering | 100 | 425 | 427 | 2 | 0.51 |
```
---
[1] In the case where blend modes *are* found, it becomes a lot more difficult to know if it's generally safe to skip /Resources-entries. Hence we don't cache anything in that case, however note that most document/pages do not utilize blend modes anyway.
* it's faster to generate the color code in using a table for components
* it's very likely a way faster to parse (when setting the color in the canvas)
Given that the `Map`-pattern apparently has undesirable performance characteristics, change this getter back to using an Object instead and check its size before returning it.
Rather than returning an *empty* Object[1] we should be returning `null` instead, since that's consistent with existing API-functionality.
To avoid having to *manually* track if the Object is empty, this patch also introduces a small helper function to check its size.
As can be seen in `src/core/fonts.js`, this method only accepts *one* parameter, hence it's somewhat difficult to understand what the Annotation-code is actually attempting to do here.
The only possible explanation that I can imagine, is that the intention was initially to call `Font.charToGlyph` *directly* instead. However, note that that'd would not actually have been correct, since that'd ignore one level of font-caching (see `this.charsCache`). Hence the unused arguments are removed, in `src/core/annotation.js`, and the `Font.charToGlyph` method is now marked as "private" as intended.
This patch removes unnecessary escape-sequence in (mostly) strings, as a first step, since the ones in regular expressions probably requires more careful testing (just in case).
The only exception is a regular expression in `src/core/annotation.js`, since we should have both unit- and reference-tests for this code *and* given [this information on MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes#Types):
> Inside a character set, the dot loses its special meaning and matches a literal dot.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-useless-escape
This provides a work-around to avoid having to conditionally try to initialize the `openAction`-object in multiple places.
Given that `Object.fromEntries` doesn't seem to *guarantee* that a `null` prototype is used, we thus hack around that by using `Object.assign` with `Object.create(null)`.
Different fonts incorrectly end up with *identical* hashes, despite having different /ToUnicode data.
The issue, and it's very interesting that we've apparently not seen it before, appears to be caused by the fact that different /ToUnicode entries share the *same* underlying `ArrayBuffer`, which thus becomes problematic at the `const dataUint32 = new Uint32Array(data.buffer, 0, blockCounts);` line. The simplest solution thus seem to be to just *copy* the input, when it's an `ArrayBuffer`, rather than using it as-is. (Note that if we'd stringified the input, when calling `MurmurHash3_64.update`, the issue would also have been fixed. In this case, we're already creating an unique TypedArray.)
*Please note:* Once https://bugzilla.mozilla.org/show_bug.cgi?id=1247687 is implemented, and we've removed SystemJS completely, this entire patch can (and even should) be reverted.
This is similar to the existing `getLookupTableFactory` helper function, but is implemented as outlined in issue 6774.
The re-formatting of the tables were done automatically, by using find-and-replace with regular expressions.
For reasons that I don't even pretend to understand, using this particular structure for these *very* long lookup tables allow SystemJS to process the files correctly/quickly and the development viewer thus works as intended.
While the *built* `pdf.worker.js` file still works correctly with these changes, despite these two files being excluded by Babel[1], the development viewer does not because of issues with SystemJS[2] and/or its Babel-plugin (both of which are old).
Furthermore, note also that excluding these two files from Babel-processing isn't *generally* necessary since e.g. the `gulp mozcentral` command works anyway. The explanation is rather that it's actually the source-map generation which fails for these huge sequences when building the `pdf.worker.js` file.
However, not using standard `import`/`export` statements in all files means we also need to use SystemJS when e.e. running the unit-tests. This is very unfortunate, since SystemJS (or its old Babel-version) doesn't support modern ECMAScript features such as e.g. optional chaining and nullish coalescing.
Unfortunately it also seems that https://bugzilla.mozilla.org/show_bug.cgi?id=1247687, which tracks the implementation of worker-modules in Firefox, has stalled since there hasn't been any updates for six months now.
To hopefully address all of the above, this patch is the first in a series that attempts to further reduce our reliance on SystemJS.
---
[1] The only difference being how the dependencies are handled, in the Webpack-bundled file.
[2] Parsing takes way too long and consumes too much memory, thus rendering the development viewer essentially unusable.
- Handle the arguments correctly in `PartialEvaluator.handleColorN`.
For TilingPatterns with a base-ColorSpace, we're currently using the `args` when computing the color. However, as can be seen we're passing the Array as-is to the `ColorSpace.getRgb` method, which means that the `Name` is included as well.[1]
Thankfully this hasn't, as far as I know, caused any actual bugs, but that may be more luck than anything else given how the `ColorSpace` code is implemented. This can be easily fixed though, simply by popping the `Name`-object off of the `args` Array.
- Cache TilingPatterns using the `Name`-string, rather than the object directly.
This is not only consistent with other caches in `PartialEvaluator`, but importantly it also ensures that the cache lookup always works correctly. Note that since `Name`-objects, similar to other primitives, uses a cache themselves a *manually* triggered `cleanup`-call could thus (theoretically) cause the `LocalTilingPatternCache` to not find an existing entry. While the likelihood of this happening is *extremely* small, it's still something that we should fix.
---
[1] The `args` Array can e.g. look like this: `[0.043, 0.09, 0.188, 0.004, /P1]`, which means that we're passing in the `Name`-object to the `ColorSpace` method.
For an invalid Annotation, there's one code-path where `undefined` is returned from `AnnotationFactory._create`. That'd currently, incorrectly, trigger an error during the `PDFDocument._collectFieldObjects` parsing which thus seem good to avoid.
Along these lines, the filtering in `PDFDocument.fieldObjects` is also updated to handle both `null` and `undefined` the same way.
Some pdf softwares don't remove highlight annotations but make the QuadPoints array empty.
And the Rect for the annotation can be [-32768, -32768, 32768, 32768] so it leads to have a giant div which catches all the mouse events and make the pdf unusable when there are some forms elements.
Given that *all* fonts are, ever since PR 7347, now cached in the "normal" `fontCache` there's actually no reason for the special `font.translated` construction. (Given how Objects in JavaScript are references, rather than raw values, the old code shouldn't have caused any significant memory overhead.)
Instead we can simply store the `cacheKey`, which is a simple string, on only the Font `Dict`s where it's needed and thus look-up all fonts using the `fontCache` instead.
If this method is ever passed invalid/unexpected data, or if during the course of parsing (since it's used recursively) such data is found, it will fail in a non-graceful way.
Hence this patch, which ensures that we don't attempt to access non-existent properties and also that errors such as the one fixed in PR 12479 wouldn't have occured.
*This patch is based on a couple of smaller things that I noticed when working on PR 12479.*
- Don't store the /Fields on the `formInfo` getter, since that feels like overloading it with unintended (and too complex) data, and utilize a `hasFields` boolean instead.
This functionality was originally added in PR 12271, to help determine what kind of form data a PDF document contains, and I think that we should ensure that the return value of `formInfo` only consists of "simple" data.
With these changes the `fieldObjects` getter instead has to look-up the /Fields manually, however that shouldn't be a problem since the access is guarded by a `formInfo.hasFields` check which ensures that the data both exists and is valid. Furthermore, most documents doesn't even have any /AcroForm data anyway.
- Determine the `hasFields` property *first*, to ensure that it's always correct even if there's errors when checking e.g. the /XFA or /SigFlags entires, since the `fieldObjects` getter depends on it.
- Simplify a loop in `fieldObjects`, since the object being accessed is a `Map` and those have built-in iteration support.
- Use a higher logging level for errors in the `formInfo` getter, and include the actual error message, since that'd have helped with fixing PR 12479 a lot quicker.
- Update the JSDoc comment in `src/display/api.js` to list the return values correctly, and also slightly extend/improve the description.
This allows us to make a slight simplification in `PartialEvaluator.loadFont`, which thus removes an old TODO-comment from the method.
Furthermore, in `PartialEvaluator.translateFont`, the CMap-handling is now limited to only *composite* fonts to avoid having to wait for a "dummy"-Promise for most fonts.
- Actually register/unregister the `WorkerTask`s, used when saving each page, correctly.
To prevent issues when terminating the Worker, we purposely wait for all running `WorkerTask`s to complete first. Hence we need to actually handle `WorkerTask`s the same way in "SaveDocument" as in the rest of this file, see e.g. "GetOperatorList" and "GetTextContent".
- Access `PDFDocument` properties in a generally safe/consistent way.
While the current code works fine, given how the PDF document is being loaded, it still seems like a very good idea to be *consistent* in how we access these kind of properties (since in general you need to avoid `MissingDataException` everywhere in this file).
- Change a variable name, since there's essentially no precedent in the code-base for *local* variable names to start with an underscore.
Support for the `scope` parameter, in `MessageHandler.on`, was removed in PR 11110 however this particular case was unused/unnecessary for years prior to that change. (From a quick look through the history, I'm not even sure if it was actually needed in the first place.)
In practice it's not uncommon for PDF documents to re-use the same TilingPatterns more than once, and parsing them is essentially equal to parsing of a (small) page since a `getOperatorList` call is required.
By caching the internal TilingPattern representation we can thus avoid having to re-parse the same data over and over, and there's also *less* asynchronous parsing required for repeated TilingPatterns.
Initially I had intended to include (standard) benchmark results with this patch, however it's not entirely clear that this is actually necessary here given the preliminary results.
When testing this manually in the development viewer, using `pdfBug=Stats`, the following (approximate) reduction in rendering times were observed when comparing `master` against this patch:
- http://pubs.usgs.gov/sim/3067/pdf/sim3067sheet-2.pdf (from issue 2765): `6800 ms` -> `4100 ms`.
- https://github.com/mozilla/pdf.js/files/1046131/stepped.pdf (from issue 8473): `54000 ms` -> `13000 ms`
- https://github.com/mozilla/pdf.js/files/1046130/proof.pdf (from issue 8473): `5900 ms` -> `2500 ms`
As always, whenever you're dealing with documents which are "slow", there's usually a certain level of subjectivity involved with regards to what's deemed acceptable performance.
Hence it's not clear to me that we want to regard any of the referenced issues as fixed, however the improvements are significant enough to warrant caching of TilingPatterns in my opinion.
This simplifies/consolidates the ESLint configuration slightly in the `src/` folder, and prevents the addition of any new files where `var` is being used.[1]
Hence we no longer need to manually add `/* eslint no-var: error */` in files, which is easy to forget, and can instead disable the rule in the `src/core/` files where `var` is still in use.
---
[1] Obviously the `no-var` rule can, in the same way as every other rule, be disabled on a case-by-case basis where actually necessary.
- Check that the "Info"-entry, in the XRef-trailer, is actually a dictionary before accessing it. This is similar to the `PDFDocument.documentInfo` method and follows the general principal of validating data carefully before accessing it, given how often PDF-software may create corrupt PDF files.
- Slightly simplify the "XFA"-lookup, since there's no point in trying to fetch something from the empty dictionary.
Currently there's nothing that prevents modification of the `Dict.empty` primitive, which obviously needs to be *truly* empty to prevent any future (hard to find) bugs.
This patch contains a possible approach for fixing issue 12294, which compared to other PRs is purposely limited to the affected `WidgetAnnotation` code.
As mentioned elsewhere, considering that we're (at least for now) trying to fix *one specific* case, I think that we should avoid modifying the `Dict` primitive[1] and/or avoid a solution that (indirectly) modifies an existing `Dict`-instance[2].
This patch simply fixes the issue at hand, since that seems easiest for now, and I'd suggest that we worry about a more general approach if/when that actually becomes necessary.
Hence the solution implemented here, for `WidgetAnnotation`, is to simply use a combination of the local *and* AcroForm /DR resources during OperatorList-parsing to ensure that things work correctly regardless of where a particular /Font resource is found.
For saving of form-data, on the other hand, we want to avoid increasing the file-size unnecessarily and need to be smarter than just merging all of the available resources. To achive this, a new `WidgetAnnotation._getSaveFieldResources` method will when necessary produce a combined resources `Dict` with only the minimum amount of data from the AcroForm /DR resources included.
---
[1] You want to avoid anything that could cause the general `Dict` implementation to become slower, or more complex, just for handling an edge-case in my opinion.
[2] If an existing `Dict`-instance is modified unexpectedly, that could very easily lead to problems elsewhere since e.g. `Dict`-instances created during parsing are not expected to be changed.
For these streams, compared to `Stream` and `ChunkedStream`, there's no well defined concept of length and consequently no `length` getter.[1] However, attempting to access the non-existent `length` won't currently error, but just return `undefined`, which could thus easily lead to bugs elsewhere in the code-base.
---
[1] However, note that *all* stream implementations have an `isEmpty` getter which can be used instead.