In the referenced PDF document the fonts have /CIDToGIDMap-entries that cannot be loaded. Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /CIDToGIDMap-entries and fallback to simply assume that no such data is available.
Given that this is *clearly* a case of a corrupt PDF document, there's no guarantee that this will "fix" things in the general case since a /CIDToGIDMap may be *required* in order for some composite fonts to render correctly. However, attempting to render *something* is surely better than skipping a font altogether.
Note how we currently throw a "raw" Error, which is problematical since all of the `PartialEvaluator.loadFont` call-sites expect a Promise to be returned. Furthermore, this also means that we don't benefit from the fallback code-path that now exists below.
*Please note:* Unfortunately I don't have a test-case that fails without this patch, since it's something I happened to notice when reading the code while working on another patch.
This extends PR 13461, by also building a fallback bounding box for Type3 fonts that contain a much too small /FontBBox-entry.
*Please note:* While this patch improves things overall, copy-and-pasting still doesn't work perfectly for this document. In particular the lowercase letter "c" cannot be selected/copied, however this can be reproduced in both Adobe Reader and PDFium (in Google Chrome) too, which is caused by a lack of proper /ToUnicode-data in the PDF document.
In the `src/display/canvas.js` code the `d1` operator will be used to set the clipping region, and it obviously cannot be empty since that prevents the Type3-glyph from rendering.
Also, the patch removes an outdated comment; refer to PR 12718.
This limits the heuristics for handling of incomplete path operators, see PR 9838, to only apply to *sequences* of such operators. In practice a couple of invalid path operators are (hopefully) unlikely to completely break rendering, whereas a sequence of them will easily lead to fairly chaotic rendering artifacts.
The current `lastModified`-getter, which only contains a time-stamp, is a fairly crude way of detecting if the stored data has actually been changed. In particular, when the `getRawValue`-method is used, the `lastModified`-getter doesn't cope with data being modified from the "outside".
To fix these issues[1], and to prevent any future bugs in this code, this patch introduces a new `AnnotationStorage.hash`-getter which computes a hash of the currently stored data. To simplify things this re-uses the existing `MurmurHash3_64`-implementation, which required moving that file into the `src/shared/`-folder, since its performance should be good enough here.
---
[1] Given how the `AnnotationStorage.lastModified`-getter was used, this would have been limited to *printing* of forms.
As it turns out, most of the code-paths in the `PDFImage`-class won't actually pass the TypedArray (containing the image-data) to the `ColorSpace`-code. Hence we *generally* don't need to force the image-data to be a `Uint8ClampedArray`, and can just as well directly use a `Uint8Array` instead.
In the following cases we're returning the data without any `ColorSpace`-parsing, and the exact TypedArray used shouldn't matter:
- b72a448327/src/core/image.js (L714)
- b72a448327/src/core/image.js (L751)
In the following cases the image-data is only used *internally*, and again the exact TypedArray used shouldn't matter:
- b72a448327/src/core/image.js (L762) with the actual image-data being defined (as `Uint8ClampedArray`) further below
- b72a448327/src/core/image.js (L837)
*Please note:* This is tagged `api-minor` because it's API-observable, given that *some* image/mask-data will now be returned as `Uint8Array` rather than using `Uint8ClampedArray` unconditionally. However, that seems like a small price to pay to (slightly) reduce memory usage during image-conversion.
While working on PR 14825, I couldn't help noticing that the code to increment the `count` for cached ImageMasks was repeated multiple times. Hence it makes sense, as far as I'm concerned, to move this into a helper function instead.
Currently we only insert optionalContent-data into the operatorList the first time that an image is parsed, which will (in hindsight) obviously cause problems for cached images.
Hence we also need to insert the optionalContent-data in the various worker-thread image caches, such that it can be accessed in the fast-paths that are used to skip re-parsing of images.
In order to reduce the amount of repeated code, this patch also adds a new `OperatorList`-method that takes care of inserting the necessary data in the operatorList.
In the referenced PDF document the fonts have /Encoding-entries that are Streams (containing completely bogus data), which are thus obviously not valid here.
Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /Encoding-entries and fallback to the existing code to try and infer a usable encoding.
Given that this is *clearly* a case of corrupt PDF documents, there's no guarantee that this will "fix" all such cases, however it's the best that we do here and shouldn't really be worse than ignoring an entire font.
- most of the time the current transform is a scaling one (modulo translation),
hence it's possible to avoid to apply the transform on each bbox and then apply
it a posteriori;
- compute the bbox when it's possible in the worker.
- it's the second part of the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- some image masks can be used several times but at different positions;
- an image need to be pre-process before to be rendered:
* rescale it;
* use the fill color/pattern.
- the two operations above are time consuming so we can cache the generated canvas;
- the cache key is based on the current transform matrix (without the translation part)
and the current fill color when it isn't a pattern.
- the rendering of the pdf in the above bug is really faster than without this patch.
- it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw:
* an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread:
- this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them
for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330;
- or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so
it's slightly slower than using an OffscreenCanvas.
* it's transfered from the worker to the main thread by "reference";
* the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally
less than before.
- Use the localImageCache for the mask;
- Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image;
- Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set
as defined in operator_list.
- it aims to fix issue #14627;
- the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
So basically, this patch removes the constraint to have the chars in the same Tj
(in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
too small (hence `NOT_A_SPACE_FACTOR`).
Note that the Prettier update made it possible to move a couple of comments after `default:`-cases back to their original/intended positions, please see https://prettier.io/blog/2022/03/16/2.6.0.html
This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.
The situation described in issue 14626 seems like a fairly special case, and it thus seem reasonable that we simply follow the same pattern as elsewhere in the `PartialEvaluator` when the `stopAtErrors` API-option is being used.
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isString`-calls.
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isNum`-calls.
These changes were *mostly* done using regular expression search-and-replace, with two exceptions:
- In `Font._charToGlyph` we no longer unconditionally update the `width`, since that seems completely unnecessary.
- In `PDFDocument.documentInfo`, when parsing custom entries, we now do the `typeof`-check once.
Unless you actually need to check that something is both a `Name` and also of the *correct* type, using `instanceof Name` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.
This patch uses ESLint to enforce this, since we obviously still want to keep the `isName` helper function for where it makes sense.
Unless you actually need to check that something is both a `Dict` and also of the *correct* type, using `instanceof Dict` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.
This patch uses ESLint to enforce this, since we obviously still want to keep the `isDict` helper function for where it makes sense.
This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.
At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.
This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
- it aims to fix#14497;
- previously, only rotations with an angle 0, 90, 180 or 270 were taken into account;
- so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.
After the changes in PR 14428 we can *directly*, and more efficiently, handle whitespace conversion in `PartialEvaluator.getTextContent` when the `normalizeWhitespace` option is being used.
This way we no longer need a separate helper function for this, and can avoid having to (again) iterate through the text and checking each character. Finally, this also removes the need for using a regular expression on e.g. all non-ASCII text.
In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing.
Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs
*Please note:* That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents.
Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.
The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.
Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.
Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).
---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)
[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.
[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
- real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
- for example, an extra spacing can cancel a space so visually there are no space.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.
This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
- PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
- the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
- no space are "drawn": it just moves the cursor but they aren't added in the chunk;
- so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
- to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
- it was a pretty good idea in general but it fails with some fonts where space was too big:
- in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
This is unfortunately *yet another* bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-(
I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802
In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings.
*Please note:* This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.