The `sizes` property doesn't appear to have been used ever since the code was first split into main/worker-threads, which is so many years ago that I wasn't able to easily find exactly in which PR/commit it became unused.
The `encoding` property is always assigned the `properties.baseEncoding` value, however the `PartialEvaluator` doesn't actually compute/set that value any more. Again it was difficult to determine when it became unused, but it's been that way for years.
Given the way that "classes" were previously implemented in PDF.js, using regular functions and closures, there's a fair number of false positives when the `no-shadow` ESLint rule was enabled.
Note that while *some* of these `eslint-disable` statements can be removed if/when the relevant code is converted to proper `class`es, we'll probably never be able to get rid of all of them given our naming/coding conventions (however I don't really see this being a problem).
This rule is *not* currently enabled in mozilla-central, but it appears commented out[1] in the ESLint definition file; see https://searchfox.org/mozilla-central/rev/c80fa7258c935223fe319c5345b58eae85d4c6ae/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#238-239
Unfortunately this rule is, for fairly obvious reasons, impossible to `--fix` automatically (even partially) and each case thus required careful manual analysis.
Hence this ESLint rule is, by some margin, probably the most difficult one that we've enabled thus far. However, using this rule does seem like a good idea in general since allowing variable shadowing could lead to subtle (and difficult to find) bugs or at the very least confusing code.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-shadow
---
[1] Most likely, a very large number of lint errors have prevented this rule from being enabled thus far.
Fixes#11718 in which the `ff` ligature glyph is at index zero in a CFF font. Beacuse this is a CIDFont, glyph names are CIDs, which are integers. Thus the string `".notdef"` is not correct. The rest of the charset data is already parsed correctly as integers when the boolean argument `cid` is true.
*This is part of a series of patches that will try to split PR 11566 into smaller chunks, to make reviewing more feasible.*
Once all the code has been fixed, we'll be able to eventually enable the ESLint no-shadow rule; see https://eslint.org/docs/rules/no-shadow
Please note that the `setPDFNetworkStreamFactory` functionality isn't exposed in the public API, i.e. not listed among the exports in the `src/pdf.js` file, and that even if it were it wouldn't really be useful considering that none of the `PDFNetworkStream`/`PDFFetchStream`/`PDFNodeStream` classes are exported either.
These methods were deprecated already in PDF.js version `2.1.266`, see PRs 10246 and 10369, and were converted to throw `Error`s upon invocation in PDF.js version `2.4.456`, see PR 11219.
Hence it ought to be possible to remove these methods now.
*This is part of a series of patches that will try to split PR 11566 into smaller chunks, to make reviewing more feasible.*
Once all the code has been fixed, we'll be able to eventually enable the ESLint no-shadow rule; see https://eslint.org/docs/rules/no-shadow
When JPEG images are decoded by the browser, on the main-thread, there's a handful of short-lived copies of the image data; see c3f4690bde/src/display/api.js (L2364-L2408)
That code thus becomes quite problematic for very big JPEG images, since it increases peak memory usage a lot during decoding. In the referenced issue there's a couple of JPEG images whose dimensions are `10006 x 7088` (i.e. ~68 mega-pixels), which causes the *peak* memory usage to increase by close to `1 GB` (i.e. one giga-byte) in my testing.
By letting the PDF.js JPEG decoder, rather than the browser, handle very large images the *peak* memory usage is considerably reduced and the allocated memory also seem to be reclaimed faster.
*Please note:* This will lead to movement in some existing `eq` tests.
This property has never been documented and/or *intentionally* exposed through the API, instead the `PDFPageProxy.pageNumber` property is the documented/intended API to use here.
Hence pageIndex is changed to a "private" property on `PDFPageProxy` instances, and internal API functionality is also updated to *consistently* use `this._pageIndex` rather than a mix of formats.
Don't accidentally accept invalid glyphNames which *appear* to follow the Cdd{d}/cdd{d} format in `PartialEvaluator._buildSimpleFontToUnicode` (issue 11697)
The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a *possible* Cdd{d}/cdd{d} glyphName by the existing heuristics.
Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point.
To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds *additional* validation of the charCode found by the heuristics.
Trying to enable the ESLint rule `no-shadow`, against the `master` branch, would result in a fair number of errors in the `Glyph` class in `src/core/fonts.js`.
Since the glyphs are exposed through the API, we can't very well change the `isSpace` property on `Glyph` instances. Thus the best approach seems, at least to me, to simply rename the `isSpace` helper function to `isWhiteSpace` which shouldn't cause any issues given that it's only used in the `src/core/` folder.
Rather than duplicating the lookup and caching in multiple files, it seems easier to simply move all of this functionality into `src/shared/util.js` instead.
This will also help avoid a bunch of ESLint errors once the `no-shadow` rule is eventually enabled.
*This patch fixes something that's annoyed me every now and then over the years, when debugging/fixing corrupt PDF documents.*
For corrupt PDF documents where `Lexer.getHexString` encounters invalid characters, there's very rarely just a handful of them. In practice it's not uncommon for there to be many hundreds, or even many thousands, invalid hex characters found.
Not only is the resulting console warning spam utterly useless in these cases, there's often enough of it that performance may even suffer; hence this patch which limits the amount of messages that any one `Lexer.getHexString` invocation may print.
The PDF document in question is *corrupt*, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.
The only reason for the `return undefined;` lines was to appease the ESLint `consistent-return` rule, but that's not actually necessary if you make use of the fact that the method is `async` and that we can thus await the Promise rather than returning it.
Note that `Dict.set` will only be called with values returned through `Parser.getObj`, and thus indirectly via `Lexer.getObj`. Since neither of those methods will ever return `undefined`, we can simply assert that that's the case when inserting data into the `Dict` and thus get rid of `in` checks when doing the data lookups.
In this case, since `Dict.set` is fairly hot, the patch utilizes an *inline check* and when necessary a direct call to `unreachable` to not affect performance of `gulp server/test` too much (rather than always just calling `assert`).
For very large and complex PDF files this will help performance *slightly*, since `Dict.{get, getAsync, has}` is called *a lot* during parsing in the worker.
This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 250,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 250 | 2838 | 2820 | -18 | -0.65 | faster
Firefox | Page Request | 250 | 1 | 2 | 0 | 11.92 | slower
Firefox | Rendering | 250 | 2837 | 2818 | -19 | -0.65 | faster
```
This patch deprecates the existing `getOpenActionDestination` API method, in favor of a better and more general `getOpenAction` method instead. (For now JavaScript actions, related to printing, are still handled as before.)
By clearly separating "regular" Print actions from the JavaScript handling, it's thus possible to get rid of the somewhat annoying and strictly incorrect warning when the viewer loads.
Fixes#11477
The PDF draws many space characters but the embedded fonts don't have a glyph named `space`, so `.notdef` should be drawn instead. PDF.js assumed that Type1 fonts define `.notdef` as the first glyph (index 0). However, now the fonts have the glyph `A` at index 0 and `.notdef` is the last one, so `A` appears where spaces are expected.
Because the rest of the font machinery in `core/fonts.js` assumes `.notdef` is at index zero, it's easiest to modify `core/type1_parser.js` so that it "repairs" fonts and makes sure `.notdef` is at index 0.
The PDF document in question is *corrupt*, since it contains multiple instances of incorrect operators.
We obviously don't want to slow down parsing of *all* documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.
It's no longer necessary to special-case this getter in the `GenericFontLoader` case, since the GENERIC build hasn't been using `mozPrintCallback` for years now (furthermore Firefox 63 is really old as well).
This patch extends the existing heuristics, which are really the best that we can do in general for these kinds of non-embedded *and* non-standard fonts.
Furthermore, this patch also tries to improve the copy-and-paste behaviour for non-embedded Wingdings fonts by also using the `ZapfDingbatsEncoding` in this case.
*Note:* I'm not sure that adding additional tests for Wingdings fonts matters that much, given how limited our "support" for them really is.
Given that all of these primitives implement caching, to avoid unnecessarily duplicating those objects *a lot* during parsing, it would thus be good to actually enforce usage of `Cmd.get()`/`Name.get()`/`Ref.get()` in the code-base.
Luckily it turns out that there's an ESLint rule, which is fairly easy to use, that can be used to disallow arbitrary JavaScript syntax.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-restricted-syntax
*This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.*
To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.
In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter.
Currently, for "simple" JPEG images, we're relying on native image decoding to *fail* before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render.
In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing.
The only way to implement this is unfortunately to parse the *beginning* of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks.
---
*Slightly off-topic:* Working on this *really* makes me start questioning if native rendering/decoding of JPEG images is actually a good idea.
There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker.
Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing.
In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.
Given that this is completely unused, and that a "normal" function call may be a *tiny* bit more efficient, there's no good reason as far as I can tell to keep it.
Please note that these changes do *not* affect the *public* interface of the `Metadata` class, but only touches internal structures.[1]
These changes were prompted by looking at the `getAll` method, which simply returns the "private" metadata object to the consumer. This seems wrong conceptually, since it allows way too easy/accidental changes to the internal parsed metadata.
As part of fixing this, the internal metadata was changed to use a `Map` rather than a plain Object.
---
[1] Basically, we shouldn't need to worry about someone depending on internal implementation details.
- Remove the "capturing group" in the regular expression that removes leading "junk" from the raw metadata, since it's not necessary here (it's simply a case of too much copy-pasting in a prior patch).
According to [MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#Groups_and_ranges) you want to, for performance reasons, avoid "capturing groups" unless actually needed.
- Add inline comments to document a bunch of magic values in the code.
Given that all `TypedArray` polyfills were removed in PDF.js version `2.0`, since native support is now required, this branch has been dead code for awhile.
In practice it's extremely rare[1] for the padding to be zero in *all* components, hence it seems better to just set it directly rather than creating a temporary variable and checking for the "no padding"-case.
---
[1] In the `tracemonkey.pdf` file that only happens with `0.08%` of all text elements.
Over the years there's been a fair number of issues/PRs opened, where people have wanted to add `hasOwnProperty` checks in (hot) loops in the font parsing code. This has always been rejected, since we don't want to risk reducing performance in the Firefox PDF viewer simply because some users of the general PDF.js library are *incorrectly* extending the `Array.prototype` with enumerable properties.
With this patch the general PDF.js library will now fail immediately with a hopefully useful Error message, rather than having (some) fonts fail to render, when the `Array.prototype` is incorrectly extended.
Note that I did consider making this a warning, but ultimately decided against it since it's first of all possible to disable those (with the `verbosity` parameter). Secondly, even when printed, warnings can be easy to overlook and finally a warning may also *seem* OK to ignore (as opposed to an actual Error).
This patch makes the following changes, to improve these API methods:
- Let `PDFPageProxy.cleanup` return a boolean indicating if clean-up actually happened, since ongoing rendering will block clean-up.
Besides being used in other parts of this patch, it seems that an API user may also be interested in the return value given that clean-up isn't *guaranteed* to happen.
- Let `PDFDocumentProxy.cleanup` return the promise indicating when clean-up is finished.
- Improve the JSDoc comment for `PDFDocumentProxy.cleanup` to mention that clean-up is triggered on *both* threads (without going into unnecessary specifics regarding what *exactly* said data actually is).
Add a note in the JSDoc comment about not calling this method when rendering is ongoing.
- Change `WorkerTransport.startCleanup` to throw an `Error` if it's called when rendering is ongoing, to prevent rendering from breaking.
Please note that this won't stop *worker-thread* clean-up from happening (since there's no general "something is rendering"-flag), however I'm not sure if that's really a problem; but please don't quote me on that :-)
All of the caches that's being cleared in `Catalog.cleanup`, on the worker-thread, *should* be re-filled automatically even if cleared *during* parsing/rendering, and the only thing that probably happens is that e.g. font data would have to be re-parsed.
On the main-thread, on the other hand, clearing the caches is more-or-less guaranteed to cause rendering errors, since the rendering code in `src/display/canvas.js` isn't able to re-request any image/font data that's suddenly being pulled out from under it.
- Last, but not least, add a couple of basic unit-tests for the clean-up functionality.
While it would be nice to change the `PDFFormatVersion` property, as returned through `PDFDocumentProxy.getMetadata`, to a number (rather than a string) that would unfortunately be a breaking API change.
However, it does seem like a good idea to at least *validate* the PDF header version on the worker-thread, rather than potentially returning an arbitrary string.
This should hopefully be useful in environments where restrictive CSPs are in effect.
In most cases the replacement is entirely straighforward, and there's only a couple of special cases:
- For the `src/display/font_loader.js` and `web/pdf_outline_viewer.js `cases, since the elements aren't appended to the document yet, it shouldn't matter if the style properties are set one-by-one rather than all at once.
- For the `web/debugger.js` case, there's really no need to set the `padding` inline at all and the definition was simply moved to `web/viewer.css` instead.
*Please note:* There's still *a single* case left, in `web/toolbar.js` for setting the width of the zoom dropdown, which is left intact for now.
The reasons are that this particular case shouldn't matter for users of the general PDF.js library, and that it'd make a lot more sense to just try and re-factor that very old code anyway (thus fixing the `setAttribute` usage in the process).
Based on the PDF spec, with `v` operator, current point should be used as the first control point of the curve.
Do not overwrite current point before an SVG curve is built, so it can b actually used as first control point.
As can be seen in the code, the `xScaleBlockOffset` typed array doesn't depend on the actual image data but only on the width and x-scale. The width is obviously consistent for an image, and it turns out that in practice the `componentScaleX` is quite often identical between two (or more) adjacent image components.
All-in-all it's thus not necessary to *unconditionally* re-compute the `xScaleBlockOffset` when getting the JPEG image data.
While avoiding, in many cases, one or more loops can never be a bad thing these changes are unfortunately completely dominated by the rest of the JpegImage code and consequently doesn't really show up in benchmark results. *Hence I'd understand if this patch is ultimately deemed not necessary.*
*Note:* This is inspired by PR 5473, which made similar changes for another kind of JPEG data.
Since the implementation in `src/core/jpg.js` only supports 8-bit data, as opposed to similar code in `src/core/colorspace.js`, the computations can be further simplified since the `scale` is always constant.
By updating the coefficients, effectively inlining the `scale`, we'll thus avoid *four* multiplications for each loop iteration.
Unfortunately I wasn't able, based on a quick look through the test-files, to find a sufficiently *large* CMYK JPEG image in order for these changes to really show up in benchmark results. However, when testing the `cmykjpeg.pdf` manually there's a total of `120 000` fewer multiplication with this patch.
In the current `AnnotationLayer` implementation, Popup annotations require that the parent annotation have already been rendered (otherwise they're simply ignored).
Usually the annotations are ordered, in the `/Annots` array, in such a way that this isn't a problem, however there's obviously no guarantee that all PDF generators actually do so. Hence we simply ensure, when rendering the `AnnotationLayer`, that the Popup annotations are handled last.
Interestingly the viewer already seem to work correctly as-is, with workers disabled and a non-standard `verbosity` level.
Hence this is possibly Node.js specific, but given that the issue is lacking *both* the PDF file in question and a runnable test-case, so this patch is essentially a best-effort guess at what the problem could be.
- Re-factor the "incorrect encoding" check, since this can be easily achieved using the general `findNextFileMarker` helper function (with a suitable `startPos` argument).
- Tweak a condition, to make it easier to see that the end of the data has been reached.
- Add a reference test for issue 1877, since it's what prompted the "incorrect encoding" check.
The other image decoders, i.e. the JBIG2 and JPEG 2000 ones, are using the common helper function `readUint16`. Most likely, the only reason that the JPEG decoder is doing it this way is because it originated *outside* of the PDF.js library.
Hence we can simply re-factor `src/core/jpg.js` to use the common `readUint16` helper function, which is especially nice given that the functionality was essentially *duplicated* in the code.
This covers cases that the `--fix` command couldn't deal with, and in a few cases (notably `src/core/jbig2.js`) the code was changed to use block-scoped variables instead.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/prefer-const
With the recent introduction of Prettier this sort of mass enabling of ESLint rules becomes a lot easier, since the code will be automatically reformatted as necessary to account for e.g. changed line lengths.
Note that this patch is generated automatically, by using the ESLint `--fix` argument, and will thus require some additional clean-up (which is done separately).
These two cases should have been whitelisted prior to re-formatting respectively had the comments fixed afterwards, however I unfortunately missed them because of the massive size of the diff.
Fixes#11403
The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this.
In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`.
The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.
This removes a couple of, thanks to preceeding code, unnecessary `typeof PDFJSDev` checks, and also fixes a couple of incorrectly implemented (my fault) checks intended for `TESTING` builds.
After PR 9566, which removed all of the old Firefox extension code, the `FIREFOX` build flag is no longer used for anything.
It thus seems to me that it should be removed, for a couple of reasons:
- It's simply dead code now, which only serves to add confusion when looking at the `PDFJSDev` calls.
- It used to be that `MOZCENTRAL` and `FIREFOX` was *almost* always used together. However, ever since PR 9566 there's obviously been no effort put into keeping the `FIREFOX` build flags up to date.
- In the event that a new, Webextension based, Firefox addon is created in the future you'd still need to audit all `MOZCENTRAL` (and possibly `CHROME`) build flags to see what'd make sense for the addon.
This particular JSDoc comment is fairly old and it also contains some now unrelated/confusing information.
The only way to *guarantee* that the PDF.js library works as expected is to correctly set the global `workerSrc`[1], hence giving the impression that the option isn't strictly necessary is thus incorrect.
---
[1] Since advertising the fallbackWorkerSrc functionality definitely seems like the *wrong* thing to do.
This way we'll benefit from the existing font caching, and can thus avoid re-creating a fallback font over and over again during parsing.
(Thece changes necessitated the previous patch, since otherwise breakage could occur e.g. with fake workers.)