There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.
Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.
---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
- real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
- for example, an extra spacing can cancel a space so visually there are no space.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.
This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502;
- when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created).
- the pdf shows an alignment issue on page 4:
- the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout;
- there is an extra empty page at the end of the pdf:
- there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility).
- so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.
Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0>
If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf
- The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it.
However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause:
- eliminate unmatched data with occur min=0
we are checking our empty data and sending it to uselessNode array where at the end it gets removed;
Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour.
In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can *obviously* cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).
Trying to render these Annotation-types, when the borderWidth is `0`, causes a "hairline" border to appear. If these Annotations included an appearance stream, as they are supposed to, this wouldn't have happened and the simplest solution here seem to be to just ignore these particular Annotations.
- PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
- the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
- no space are "drawn": it just moves the cursor but they aren't added in the chunk;
- so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
- to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
- it was a pretty good idea in general but it fails with some fonts where space was too big:
- in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
*Please note:* This is a tentative patch, since I don't have the necessary a11y-software to actually test it.
To avoid having to add a new API-method just for a single string, I figured that adding the new property to the existing `documentInfo`-data (accessed via `PDFDocumentProxy.getMetadata` in the API) will hopefully be deemed acceptable.
In these cases there's no good reason, in my opinion, to duplicate the `shadow`-lines since that unnecessarily increases the risk of simple typos (see the previous patch).
With this typo the shadowing doesn't actually work, which causes these checks to be unnecessarily repeated. In this particular case it didn't have a significant performance impact, however we should definately fix this nonetheless.
If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
This is unfortunately *yet another* bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-(
I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802
- the exact sentence from the spec:
"The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters."
- so just remove the warning.
Note how both the annotationLayer and the document outline will apply various URL-related options when creating the link-elements.
For consistency the `xfaLayer`-rendering should obviously use the same options, to ensure that the existing options are indeed applied to all URLs regardless of where they originate.
- it aims to fix#12721.
- Thanks to PR #14023, we've now the fieldObjects in the annotation layer so we can easily map fields names on their id if needed.
- Reset values in the storage, in the JS sandbox and in the visible html elements.
In the referenced bug, the embedded fonts contain custom CMap-data that only include strings. Note how for embedded composite TrueType fonts we're using the CMap-data when building the glyph mapping, and currently we end up with a completely empty map because the code expects only CID *numbers*.
Furthermore, just fixing the glyph mapping alone isn't sufficient to fully address the bug, since we also need to consider this "special" kind of CMap-data when looking up glyph widths.
Having recently worked with, and reviewed patches touching, this code it seemed that it's probably not a bad idea to move that functionality into `createValidAbsoluteUrl` as new options instead.
For the `addDefaultProtocolToUrl` functionality in particular, the existing helper function was not only moved but slightly improved as well. Looking at the code, I realized that there's a small risk that it would incorrectly match a *relative* URL-string too.
With these changes, the `createValidAbsoluteUrl` call-sites in the `src/core/`-code can be simplified a little bit.
*Please note:* This patch may, indirectly, change the format of the `unsafeUrl`-property returned with relevant Annotations and OutlineItems; hence the `api-minor` tag.
However, I'd argue that it's actually more correct this way since the whole purpose of `unsafeUrl` is/was to return the URL data as-is without any parsing done.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716758;
- some buttons have a JS action with the pattern `app.launchURL(...)` (or similar) so extract when it's possible the url and generate a <a> element with the href equals to the found url;
- pdf.js already had some code to handle that so this patch slightly refactor that.
- it aims to fix#14021;
- the N dict is empty here so just create a default one;
- it implies that the checked checkbox has no appearance so create a default one too in order to print it;
- in the pdf in the issue, a checked box is not printed because it has no default appearance so we need to guess its appearance from its state.
In order to implement this, we utilize the existing `bidi` function to infer the text-direction of /T and /Contents entries. While this may not be perfect in cases where one PopupAnnotation mixes LTR and RTL languages, it should work well enough in most cases.
To avoid having to add *two new* properties in lots of annotations, supplementing the existing `title`/`contents`-properties, this patch instead re-factors the existing code such that the properties are replaced by Objects (containing `str` and `dir`).
*Please note:* In order avoid breaking existing third-party implementations, `GENERIC`-builds of the PDF.js library will still provide the old `title`/`contents`-properties on annotations returned by `PDFPageProxy.getAnnotations`.
- In the pdf in issue #14071, some select fields don't contain any values;
- the corresponding node has a bindItems and a bind elements and _bindItems function was just not called.