Commit Graph

2495 Commits

Author SHA1 Message Date
Calixte Denizet
cfdaa57353 Handle sub/super-scripts in rich text
- it aims to fix #14317;
 - change the fontSize and the verticalAlign properties according to the position of the text.
2021-11-27 16:06:09 +01:00
Jonas Jenwald
4c56214ab4 Convert PDFDocument._getLinearizationPage to an async method
This, ever so slightly, simplifies the code and reduces overall indentation.
2021-11-26 19:57:47 +01:00
Jonas Jenwald
080996ac68 Change the _pagePromises cache, in the worker, from an Array to a Map
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.
2021-11-26 19:53:57 +01:00
Jonas Jenwald
ca8d2bdce4 Abort parsing when the XRef /W-array contain bogus entries (issue 14303)
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.

While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.

Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
2021-11-25 18:35:08 +01:00
Jonas Jenwald
ae4f1ae3e7 Ensure that ChunkedStream won't attempt to request data *beyond* the document size (issue 14303)
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).

The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.

Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
2021-11-24 19:19:43 +01:00
Jonas Jenwald
6da0944fc7 [api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).

---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)

[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
2021-11-20 12:20:55 +01:00
Tim van der Meij
41ac3f0c07
Merge pull request #14291 from Snuffleupagus/force-postMessageTransfers
[api-minor] Only use Workers when `postMessage` transfers are supported (PR 11123 follow-up)
2021-11-19 20:02:51 +01:00
Brendan Dahl
c6cb39ef30
Merge pull request #14262 from Snuffleupagus/issue-14261
Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)
2021-11-19 07:51:21 -08:00
Jonas Jenwald
6f22327e61 [api-minor] Only use Workers when postMessage transfers are supported (PR 11123 follow-up)
Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available.
This patch is a follow-up to PR 11123, which made it impossible to *manually* disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1]

Hence we'll now only support *proper* Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise.

---
[1] At the time of that PR we still "supported" IE, which is why this code was left intact.
2021-11-19 16:47:58 +01:00
Brendan Dahl
3209c013c4
Merge pull request #14247 from calixteman/button
[api-minor] Render pushbuttons on their own canvas (bug 1737260)
2021-11-16 08:10:40 -08:00
Jonas Jenwald
971ac8e993 Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)
*Please note:* This is a tentative patch, since I don't have the necessary a11y-software to actually test it.
2021-11-14 12:37:41 +01:00
Jonas Jenwald
a54bed4963 Enable the ESLint no-loss-of-precision rule
Please refer to https://eslint.org/docs/rules/no-loss-of-precision
2021-11-14 10:48:50 +01:00
Jonas Jenwald
afcc99a86d When parsing corrupt documents without any trailer-dictionary, fallback to the "top"-dictionary (issue 14269)
There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.

Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.

---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
2021-11-13 13:21:38 +01:00
Jonas Jenwald
28fb3975eb
Merge pull request #14266 from calixteman/bug931481
Don't consider space as real space when there is an extra spacing (bug 931481)
2021-11-12 21:42:32 +01:00
Calixte Denizet
a88ff34eb7 Don't consider space as real space when there is an extra spacing (bug 931481)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
 - real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
 - for example, an extra spacing can cancel a space so visually there are no space.
2021-11-12 18:53:48 +01:00
Calixte Denizet
5b7e1f5232 XFA - Avoid an exception when looking for a font in a parent node
- it aims to fix issue https://github.com/mozilla/pdf.js/issues/14150;
  - a parent can be null in case the root has been reached, so just add a check.
2021-11-12 16:27:08 +01:00
Calixte Denizet
33ea817b20 [api-minor] Render pushbuttons on their own canvas (bug 1737260)
- First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260;
 - several interactive pdfs use the possibility to hide/show buttons to show different icons;
 - render pushbuttons on their own canvas and then insert it the annotation_layer;
 - update test/driver.js in order to convert canvases for pushbuttons into images.
2021-11-12 15:37:33 +01:00
Jonas Jenwald
ea1c348c67 Always prefer abbreviated keys, over full ones, when doing any dictionary lookups (issue 14256)
Note that issue 14256 was specifically about *inline* images, please refer to:
 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.1852045
 - https://www.pdfa.org/safedocs-unearths-pdf-inline-image-issue/
 - https://pdf-issues.pdfa.org/32000-2-2020/clause08.html#H8.9.7

However, during review of the initial PR in https://github.com/mozilla/pdf.js/pull/14257#issuecomment-964469710, it was suggested that we instead do this *unconditionally for all* dictionary lookups.
In addition to re-ordering the existing call-sites in the `src/core`-code, and adding non-PRODUCTION/TESTING asserts to catch future errors, for consistency a number of existing `if`/`switch`-blocks were re-factored to also check the abbreviated keys first.
2021-11-10 11:56:18 +01:00
calixteman
4bb9de4b00
Merge pull request #14239 from calixteman/1739502
XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502)
2021-11-08 03:14:42 -08:00
Calixte Denizet
13ae6d493a XFA - Encode tag names in UTF-8 when saving (fix #14249) 2021-11-07 21:41:37 +01:00
calixteman
efb4455749
Merge pull request #14240 from calixteman/14014
XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014)
2021-11-06 13:21:43 -07:00
Calixte Denizet
1681e25008 XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014) 2021-11-06 13:25:03 +01:00
Brendan Dahl
8161d3f29d Don't double apply a group xobject's bbox.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.

This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
2021-11-05 15:40:58 -07:00
Calixte Denizet
a08763f4aa XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502;
 - when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created).
 - the pdf shows an alignment issue on page 4:
   - the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout;
 - there is an extra empty page at the end of the pdf:
   - there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility).
   - so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.
2021-11-05 18:59:55 +01:00
calixteman
e136afbabc
Merge pull request #14218 from janekotovich/subform_min_0
XFA subform with occur min=0 and no bound data displaying.
2021-11-05 04:12:34 -07:00
Jane-Kotovich
56b502391c XFA subform with occur min=0 and no bound data displaying
Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0>
If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf
- The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it.

However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause:
- eliminate unmatched data with occur min=0
we are checking our empty data and sending it to uselessNode array where at the end it gets removed;
2021-11-04 20:22:05 +10:00
Jonas Jenwald
5f77d3719b Tweak the Bidi-detection heuristics for very short RTL strings (issue 11656)
Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour.

In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can *obviously* cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).
2021-11-03 20:31:57 +01:00
Jonas Jenwald
8c70258065
Merge pull request #14182 from calixteman/richtext
Support rich content in markup annotation
2021-10-31 14:41:56 +01:00
Calixte Denizet
cf8dc750d6 Support rich content in markup annotation
- use the xfa parser but in the xhtml namespace.
2021-10-31 13:44:51 +01:00
Tim van der Meij
ec1633c33c
Merge pull request #14201 from Snuffleupagus/bug-1219400
Use the correct border-style for Annotations, when a dash array is specified (bug 1219400)
2021-10-30 12:39:46 +02:00
Tim van der Meij
0e7614df7f
Merge pull request #14180 from Snuffleupagus/bug-1627427
Handle ranges that "overflow" the last byte in `CMap.mapBfRange` (bug 1627427)
2021-10-27 20:06:09 +02:00
Jonas Jenwald
884caf602e Use the correct border-style for Annotations, when a dash array is specified (bug 1219400)
Even though we cannot use the dash array in the display layer, at least ensure that we use the correct border-style.
2021-10-27 13:20:21 +02:00
Jane-Kotovich
91fc643ff9 [api-minor] Implement securityHandler in the scripting API (bug 1731578) 2021-10-26 23:42:04 +10:00
Jonas Jenwald
aa1b78684f Handle ranges that "overflow" the last byte in CMap.mapBfRange (bug 1627427) 2021-10-24 13:48:38 +02:00
Brendan Dahl
b66239d6dc
Merge pull request #14114 from Snuffleupagus/issue-14110
[api-minor] Include the /Lang-property in the `documentInfo`, and use it in the viewer (issue 14110)
2021-10-19 08:08:08 -07:00
Jonas Jenwald
68e6622c57 Ignore Square/Circle-annnotations with a zero borderWidth when creating a fallback appearance stream (issue 14164)
Trying to render these Annotation-types, when the borderWidth is `0`, causes a "hairline" border to appear. If these Annotations included an appearance stream, as they are supposed to, this wouldn't have happened and the simplest solution here seem to be to just ignore these particular Annotations.
2021-10-19 15:27:42 +02:00
calixteman
bbb64369f1
Merge pull request #13424 from calixteman/chunks2
[api-minor] Fix issues in text selection
2021-10-18 06:14:15 -07:00
Calixte Denizet
61d1063276 Fix issues in text selection
- PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
  - the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
    - no space are "drawn": it just moves the cursor but they aren't added in the chunk;
    - so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
  - to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
    - it was a pretty good idea in general but it fails with some fonts where space was too big:
    - in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
2021-10-17 16:27:05 +02:00
Jonas Jenwald
00720d059a [api-minor] Include the /Lang-property in the documentInfo, and use it in the viewer (issue 14110)
*Please note:* This is a tentative patch, since I don't have the necessary a11y-software to actually test it.

To avoid having to add a new API-method just for a single string, I figured that adding the new property to the existing `documentInfo`-data (accessed via `PDFDocumentProxy.getMetadata` in the API) will hopefully be deemed acceptable.
2021-10-16 14:27:47 +02:00
Jonas Jenwald
0041230072 Re-name the XFAFactory.numberPages getter to XFAFactory.numPages for consistency
All other similar getters are called `numPages` throughout the code-base, and improved consistency should always be a good thing.
2021-10-16 12:56:21 +02:00
Jonas Jenwald
0e5348180e Fix the inconsistent return type of the PDFDocument.isPureXfa getter
Also (slightly) simplifies a couple of small getters/methods related to the `XFAFactory`-instance.
2021-10-16 12:56:20 +02:00
Jonas Jenwald
cd94a44ca1 Remove some duplication in *simple* shadowed getters in src/core/-code
In these cases there's no good reason, in my opinion, to duplicate the `shadow`-lines since that unnecessarily increases the risk of simple typos (see the previous patch).
2021-10-16 12:56:17 +02:00
Jonas Jenwald
1450da4168 Fix a xfaFaxtory typo in the shadowing in the PDFDocument.xfaFactory getter
With this typo the shadowing doesn't actually work, which causes these checks to be unnecessarily repeated. In this particular case it didn't have a significant performance impact, however we should definately fix this nonetheless.
2021-10-16 11:54:12 +02:00
Jane-Kotovich
c2af309917 XFA - Embedded image is missing 2021-10-15 21:12:29 +10:00
Jay Berkenbilt
586295fad6 Implement TrueType character map "format 2" (fixes #14117)
If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.

https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
2021-10-13 07:37:14 -04:00
Tim van der Meij
56e3ef68d4
Merge pull request #14106 from calixteman/names
Empty name is allowed in ISO 32000
2021-10-09 14:29:10 +02:00
Jonas Jenwald
69a97bcba7 Take the /CIDToGIDMap data into account when computing the hash, in PartialEvaluator.preEvaluateFont, for composite fonts (bug 1734802)
This is unfortunately *yet another* bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-(
I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents.

Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802
2021-10-08 13:15:21 +02:00
Calixte Denizet
f384ad2356 Empty name is allowed in ISO 32000
- the exact sentence from the spec:
    "The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters."
  - so just remove the warning.
2021-10-06 20:50:39 +02:00
Jonas Jenwald
bb9c905c5d Ensure that various URL-related options are applied in the xfaLayer too
Note how both the annotationLayer and the document outline will apply various URL-related options when creating the link-elements.
For consistency the `xfaLayer`-rendering should obviously use the same options, to ensure that the existing options are indeed applied to all URLs regardless of where they originate.
2021-10-02 09:32:23 +02:00
Jonas Jenwald
284d259054
Merge pull request #14057 from Snuffleupagus/bug-920426
Support CMap-data with only strings, when parsing TrueType composite fonts (bug 920426)
2021-10-01 23:22:25 +02:00
Calixte Denizet
aecbd7cd89 AcroForm: Add support for ResetForm action
- it aims to fix #12721.
  - Thanks to PR #14023, we've now the fieldObjects in the annotation layer so we can easily map fields names on their id if needed.
  - Reset values in the storage, in the JS sandbox and in the visible html elements.
2021-09-30 22:02:33 +02:00
Jonas Jenwald
d3ca28bc34 Support CMap-data with only strings, when parsing TrueType composite fonts (bug 920426)
In the referenced bug, the embedded fonts contain custom CMap-data that only include strings. Note how for embedded composite TrueType fonts we're using the CMap-data when building the glyph mapping, and currently we end up with a completely empty map because the code expects only CID *numbers*.
Furthermore, just fixing the glyph mapping alone isn't sufficient to fully address the bug, since we also need to consider this "special" kind of CMap-data when looking up glyph widths.
2021-09-30 18:10:47 +02:00
Tim van der Meij
9a74f3e6e0
Merge pull request #14049 from calixteman/bg_from_mk
Annotation - Use border and background colors from MK dictionary
2021-09-29 21:13:20 +02:00
Calixte Denizet
0776cd9b90 Annotation - Use border and background colors from MK dictionary
- it aims to fix #13003;
  - set the bg and fg colors as they're in the pdf;
  - put a transparent overlay to help to see the fields.
2021-09-26 20:49:26 +02:00
Jonas Jenwald
e6e04694f4 [api-minor] Move the addDefaultProtocolToUrl/tryConvertUrlEncoding functionality into the createValidAbsoluteUrl function
Having recently worked with, and reviewed patches touching, this code it seemed that it's probably not a bad idea to move that functionality into `createValidAbsoluteUrl` as new options instead.

For the `addDefaultProtocolToUrl` functionality in particular, the existing helper function was not only moved but slightly improved as well. Looking at the code, I realized that there's a small risk that it would incorrectly match a *relative* URL-string too.

With these changes, the `createValidAbsoluteUrl` call-sites in the `src/core/`-code can be simplified a little bit.

*Please note:* This patch may, indirectly, change the format of the `unsafeUrl`-property returned with relevant Annotations and OutlineItems; hence the `api-minor` tag.
However, I'd argue that it's actually more correct this way since the whole purpose of `unsafeUrl` is/was to return the URL data as-is without any parsing done.
2021-09-26 14:29:54 +02:00
Calixte Denizet
558e58f354 XFA - Add <a> element in button when an url is detected (bug 1716758)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716758;
  - some buttons have a JS action with the pattern `app.launchURL(...)` (or similar) so extract when it's possible the url and generate a <a> element with the href equals to the found url;
  - pdf.js already had some code to handle that so this patch slightly refactor that.
2021-09-25 21:59:39 +02:00
Calixte Denizet
c0e9108d00 Annotation - Some checkboxes have an empty N dictionary
- it aims to fix #14021;
  - the N dict is empty here so just create a default one;
  - it implies that the checked checkbox has no appearance so create a default one too in order to print it;
  - in the pdf in the issue, a checked box is not printed because it has no default appearance so we need to guess its appearance from its state.
2021-09-25 16:00:47 +02:00
Tim van der Meij
cc110b8542
Merge pull request #14064 from Snuffleupagus/issue-13845
Fallback to font name matching, when checking for serif fonts (issue 13845)
2021-09-25 12:41:57 +02:00
Jonas Jenwald
b23b8d8a5d
Merge pull request #14074 from Snuffleupagus/issue-14046
[api-minor] Add basic support for RTL text-content in PopupAnnotations (issue 14046)
2021-09-25 12:37:44 +02:00
Tim van der Meij
36dc93fe5d
Merge pull request #14065 from Snuffleupagus/fewer-EXPORT_DATA_PROPERTIES
[api-minor] Stop exporting, by default, a few additional Font properties (PR 11777 follow-up)
2021-09-25 12:25:56 +02:00
Jonas Jenwald
1dcd2f0cd3 [api-minor] Add basic support for RTL text-content in PopupAnnotations (issue 14046)
In order to implement this, we utilize the existing `bidi` function to infer the text-direction of /T and /Contents entries. While this may not be perfect in cases where one PopupAnnotation mixes LTR and RTL languages, it should work well enough in most cases.
To avoid having to add *two new* properties in lots of annotations, supplementing the existing `title`/`contents`-properties, this patch instead re-factors the existing code such that the properties are replaced by Objects (containing `str` and `dir`).

*Please note:* In order avoid breaking existing third-party implementations, `GENERIC`-builds of the PDF.js library will still provide the old `title`/`contents`-properties on annotations returned by `PDFPageProxy.getAnnotations`.
2021-09-25 09:18:58 +02:00
calixteman
104e049338
Merge pull request #14073 from calixteman/bindItems
XFA - Bind items when there's a bindItems entry
2021-09-24 09:01:52 -07:00
Calixte Denizet
97c1e076a1 XFA - Bind items when there's a bindItems entry
- In the pdf in issue #14071, some select fields don't contain any values;
  - the corresponding node has a bindItems and a bind elements and _bindItems function was just not called.
2021-09-24 16:08:58 +02:00
Calixte Denizet
cd73e282eb XFA - Create a new page in case of overflow
- it aims to fix #14071;
  - a subform is overflowing and the the target in case of overflow is itself. In this case we must create a new page.
2021-09-24 14:57:55 +02:00
Calixte Denizet
4b0538d07a Don't save anything in XFA entry if no XFA! (bug 1732344)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1732344
  - rename some variables to have a more clear code;
  - and last but no least, add a unit test to test saving.
2021-09-23 19:51:23 +02:00
Jonas Jenwald
9acfe486d4 Fallback to font name matching, when checking for serif fonts (issue 13845)
In order to handle fonts that specify completely bogus /Flags-entries, fallback to font name matching to determine if the font is a serif one.
2021-09-23 01:11:57 +02:00
Jonas Jenwald
e027748627 [api-minor] Stop exporting, by default, a few additional Font properties (PR 11777 follow-up)
*This is similar to the "isSymbolicFont"-property, which is no longer exported by default after PR 11777.*

Both "isMonospace" and "isSerifFont" are internal properties, used during font parsing and building of the glyph mapping on the worker-thread.
However both of these properties are completely unused on the main-thread and/or in the API, and accessing them they will now require setting the `fontExtraProperties`-option when calling `getDocument`.
2021-09-23 00:44:43 +02:00
Jonas Jenwald
81a1c1cef7 Correctly validate URLs in XFA documents (bug 1731240)
With this patch we'll ensure that only valid absolute URLs can be used in XFA documents, similar to the existing validation done for "regular" PDF documents.
Furthermore, we'll also attempt to add a default protocol (i.e. `http`) to URLs beginning with "www." in XFA documents as well; this on its own is enough to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1731240
2021-09-21 21:21:01 +02:00
Jonas Jenwald
8ea27ce157 Tweak how fonts with an /Encoding are handled in adjustToUnicode (issue 14048, PR 13277 follow-up)
Currently we only exclude /Encoding entries that also contains a /Differences array, which is the cause of the text-selection problem in the referenced issue.
In order to address this we'll now also exclude /Encoding entries that contain one of the predefined *named* encodings, and no longer require that it also contains a /Differences array.

*Please note:* This patch cases a small "regression" in the `bug1130815-text` test-case, however this is actually an improvement when compared with Adobe Reader and PDFium (in Google Chrome).
2021-09-18 22:44:25 +02:00
Tim van der Meij
83d3bb43f4
Merge pull request #14041 from Snuffleupagus/issue-9367
Support cmaps with only CID characters, when building the ToUnicode-map (issue 9367)
2021-09-18 16:47:06 +02:00
Calixte Denizet
2fc10727c5 XFA - Only warn about the wrong xfa type when there is an xfa thing 2021-09-18 15:44:05 +02:00
Jonas Jenwald
e3223b68fc Extract some of the glyphMap handling, for non-embedded composite standard fonts, into a helper function
This reduces some unnecessary duplication, since we currently have essentially the same code in a handful of places in the `Font.fallbackToSystemFont`-method.
2021-09-18 12:39:48 +02:00
Jonas Jenwald
ed73cf6d50 Support cmaps with only CID characters, when building the ToUnicode-map (issue 9367)
In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings.

*Please note:* This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.
2021-09-18 00:26:15 +02:00
Calixte Denizet
5bef8120e7 Annotation - For checkboxes, get field value from AS (if any) instead of V (bug 1722036)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1722036.
  - AS and V should share the same value for checkbox: it's at least what the specs say;
  - the pdf in the above bug opens correctly in Acrobat so it likely means that AS is chosen over V.
2021-09-17 13:04:16 +02:00
Jonas Jenwald
a11343e9af Improve glyph mapping for non-embedded composite standard fonts with a /CIDToGIDMap (issue 11915)
*Please note:* All of this feels very handwavy, but at least it passes all tests locally. Hopefully we have enough tests for this part of the font code.

For non-embedded composite standard fonts with an "incomplete" /CIDToGIDMap, we'll now fallback to an *explicitly defined* /ToUnicode map even when that one happens to be an /Identity-H or /Identity-V map.

The `Font.fallbackToSystemFont` method is unfortunately getting more and more special-cases, however that might be unavoidable given all the weird non-embedded fonts found in the wild :-(
2021-09-15 11:30:40 +02:00
Calixte Denizet
9812e35916 XFA - Don't create images for unsupported mime types 2021-09-14 10:55:25 +02:00
Jonas Jenwald
7025b9f859 [src/core/writer.js] Support null values in the writeValue function
*This fixes something that I noticed, having recently looked at both the `Lexer.getObj` and `writeValue` code.*

Please note that I unfortunately don't have an example of a form where saving fails without this patch. However, given its overall simplicity and that unit-tests are added, it's hopefully deemed useful to fix this potential issue pro-actively rather than waiting for a bug report.

At this point one might, and rightly so, wonder if there's actually any real-world PDF documents where a `null` value is being used?
Unfortunately the answer is *yes*, and we have a couple of examples in the test-suite (although none of those are related to forms); please see: `issue1015`, `issue2642`, `issue10402`, `issue12823`, `issue13823`, and `pr12564`.
2021-09-12 18:24:37 +02:00
Jonas Jenwald
5d578ea36a [src/core/writer.js] Remove unnecessary string-wrapping for boolean values in writeValue (PR 13998 follow-up) 2021-09-12 15:45:45 +02:00
Jonas Jenwald
761519ef3f
Merge pull request #13998 from calixteman/bug1729971
Write boolean value when saving a form (bug 1729971)
2021-09-12 15:38:10 +02:00
Jonas Jenwald
a47844d1fc Let Lexer.getObj return a dummy-Cmd for commands that start with a non-visible ASCII character (issue 13999)
This way we avoid breaking badly generated PDF documents where a non-visible ASCII character is "glued" to a valid command.
2021-09-11 19:54:13 +02:00
Tim van der Meij
e97f01b17c
Merge pull request #13977 from Snuffleupagus/enqueueChunk-batch
[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962)
2021-09-11 13:34:07 +02:00
Jonas Jenwald
9ce63a6dc6
Merge pull request #13991 from brendandahl/interpolate
Enable/disable image smoothing based on image interpolate value. (bug 1722191)
2021-09-11 10:02:53 +02:00
Brendan Dahl
f38fb42b42 Enable/disable image smoothing based on image interpolate value. (bug 1722191)
While some of the output looks worse to my eye, this behavior more
closely matches what I see when I open the PDFs in Adobe acrobat.

Fixes: #4706, #9713, #8245, #1344
2021-09-10 14:23:35 -07:00
Calixte Denizet
474ab7c86d Write boolean value when saving a form (bug 1729971)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1729971#c4.
2021-09-10 14:10:25 +02:00
Calixte Denizet
c5841b3794 XFA - Handle shorcut in SOM expression (issue #13994) 2021-09-09 19:54:45 +02:00
Jonas Jenwald
45ddb12f61 Remove no-op onPull/onCancel streamSink callbacks from the "GetTextContent"-handler
The `MessageHandler`-implementation already handles either of these callbacks being undefined, hence there's no particular reason (as far as I can tell) to add no-op functions here.

Also, in a couple of `MessageHandler`-methods, utilize an already existing local variable more.
2021-09-09 00:01:10 +02:00
Jonas Jenwald
f90f9466e3 [api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962)
Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building.

While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used.
That seems *extremely* unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2]

Given that this patch adds a *minimum* batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a *simple* stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.

---
[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large.

[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257.
    Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here.

[3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.
2021-09-09 00:01:07 +02:00
Jonas Jenwald
69034ab8dc Improve glyph mapping for non-embedded composite standard fonts (issue 11088)
For non-embedded CIDFontType2 fonts with a non-/Identity encoding, use the /ToUnicode data to improve the glyph mapping.
2021-09-08 15:15:33 +02:00
Tim van der Meij
680f33c31c
Merge pull request #13961 from Snuffleupagus/simpler-regexp
Simplify some regular expressions
2021-09-04 15:39:30 +02:00
Jonas Jenwald
3ccf277f58 Fallback to the /ToUnicode map for TrueType fonts with (3, 1) and (1, 0) cmap-tables (issue 13316)
In the PDF document some of the glyphs have bogus `differences`-entries[1] that cannot be resolved to valid glyph names, thus causing the glyph mapping to fail.
My initial idea was to use a similar approach as in the `PartialEvaluator._simpleFontToUnicode`-method, to extract the charCodes from those entries, however it turned out that that didn't actually help in this case (the mapping was still wrong).

To fix this I'm thus proposing that we fallback to the /ToUnicode map when no other useable data exists (e.g. no post-table), since it *hopefully* shouldn't make things any worse than leaving parts of the glyph map empty (which currently happens).

---
[1] As can be seem below, some of the entries are completely normal while others are non-standard:
```
Differences (array)
    0 = 65
    1 = /g5167
    2 = /space
    3 = /g11927
    4 = /g17737
    5 = /g11540
    6 = /g2180
    7 = /K
    8 = /P
    9 = /two
    10 = /zero
    11 = /one
    12 = /five
    13 = /four
    14 = /g6932
    15 = /g7246
    16 = /g1691
    17 = /g2343
    18 = /g14792
    19 = /g3325
    20 = /g4280
    21 = /g20383
    22 = /g18166
    23 = /g16988
    24 = /g17943
    25 = /g19223
    26 = /g10830
    27 = 97
    28 = /g982
    29 = /g1226
    30 = /g5059
    31 = /g2677
    32 = /g1042
    33 = /g11568
    34 = /L
    35 = /three
    36 = /seven
    37 = /g2364
    38 = /g12063
    39 = /g5356
    40 = /g2173
    41 = /g17877
    42 = /g7273
    43 = /g7647
    44 = /g7224
    45 = /g19327
    46 = /g5054
    47 = /g2342
    48 = /g10136
    49 = /g6856
    50 = /g13381
    51 = /g7257
    52 = /g12093
    53 = /g2359
```
2021-09-04 07:38:22 +02:00
Brendan Dahl
da15dbf962
Merge pull request #13698 from linfangrong/master
[FIX] fix jpx tag tree decode (issue 11957)
2021-09-03 10:00:19 -07:00
Brendan Dahl
a8ce15a2d7
Merge pull request #13966 from calixteman/no_ns
XFA - Created data node mustn't belong to datasets namespace
2021-09-03 09:59:40 -07:00
Calixte Denizet
77b9657e57 XFA - Overwrite AcroForm dictionary when saving if no datasets in XFA (bug 1720179)
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1720179
  - in some pdfs the XFA array in AcroForm dictionary doesn't contain an entry for 'datasets' (which contains saved data), so basically this patch allows to overwrite the AcroForm dictionary with an updated XFA array when doing an incremental update.
2021-09-03 17:04:03 +02:00
Calixte Denizet
57ae3a5a76 XFA - Created data node mustn't belong to datasets namespace
- when some named nodes in the template don't have their counterpart in datasets we create some nodes: the main node mustn't belong to the datasets namespace because it doesn't make sense and Acrobat Reader isn't able to read pdf with such nodes.
  - so created nodes under a datasets node have a namespaceId set to -1 and consequently when serialized no namespace prefix will appear.
2021-09-03 15:43:25 +02:00
Brendan Dahl
804abb3786
Merge pull request #13959 from calixteman/encrypt
Correctly pad strings when saving an encrypted pdf (bug 1726789)
2021-09-02 11:41:02 -07:00
Jonas Jenwald
c42887221a Simplify some regular expressions
There's a fair number of regular expressions througout the code-base which are slightly more verbose than strictly necessary, in particular:
 - We have a lot of regular expressions that use `[0-9]` explicitly, and those can be simplified to use `\d` instead.
 - We have one instance of a regular expression containing a `A-Za-z0-9_` sequence, which can be simplified to use `\w` instead.
2021-09-02 11:50:42 +02:00
Calixte Denizet
9619bf92be Correctly pad strings when saving an encrypted pdf (bug 1726789) 2021-09-02 10:37:21 +02:00
Tim van der Meij
0a366dda6a
Merge pull request #13955 from Snuffleupagus/issue-13433
Always prefer the post-table for TrueType fonts with (0, x) cmap-tables (issue 13433)
2021-09-01 21:46:34 +02:00
Jonas Jenwald
b7b6076294 Always prefer the post-table for TrueType fonts with (0, x) cmap-tables (issue 13433)
While I don't know if this is necessarily the "correct" solution, it does fix issue 13433 without breaking any of the existing reference-tests.
2021-09-01 12:35:49 +02:00
Jonas Jenwald
ba9f004097 Extend getNonStdFontMap for non-embedded versions of the ItcSymbol font (issue 11532)
Despite its name, the fonts in ItcSymbol-family are "regular" fonts and not Symbol ones. However, given that the font name contains the word "Symbol" we ended up picking the wrong code-path in the `Font.fallbackToSystemFont`-method.

*Please note:* While this patch ensures that the text becomes readable, by falling back a standard font, the rendering will obviously not be perfect. However, that's the PDF generators "fault" since non-embedded fonts cannot be guaranteed to render correctly in all environments.
2021-08-31 23:21:16 +02:00