In practice it's not uncommon for PDF documents to re-use the same TilingPatterns more than once, and parsing them is essentially equal to parsing of a (small) page since a `getOperatorList` call is required.
By caching the internal TilingPattern representation we can thus avoid having to re-parse the same data over and over, and there's also *less* asynchronous parsing required for repeated TilingPatterns.
Initially I had intended to include (standard) benchmark results with this patch, however it's not entirely clear that this is actually necessary here given the preliminary results.
When testing this manually in the development viewer, using `pdfBug=Stats`, the following (approximate) reduction in rendering times were observed when comparing `master` against this patch:
- http://pubs.usgs.gov/sim/3067/pdf/sim3067sheet-2.pdf (from issue 2765): `6800 ms` -> `4100 ms`.
- https://github.com/mozilla/pdf.js/files/1046131/stepped.pdf (from issue 8473): `54000 ms` -> `13000 ms`
- https://github.com/mozilla/pdf.js/files/1046130/proof.pdf (from issue 8473): `5900 ms` -> `2500 ms`
As always, whenever you're dealing with documents which are "slow", there's usually a certain level of subjectivity involved with regards to what's deemed acceptable performance.
Hence it's not clear to me that we want to regard any of the referenced issues as fixed, however the improvements are significant enough to warrant caching of TilingPatterns in my opinion.
This simplifies/consolidates the ESLint configuration slightly in the `src/` folder, and prevents the addition of any new files where `var` is being used.[1]
Hence we no longer need to manually add `/* eslint no-var: error */` in files, which is easy to forget, and can instead disable the rule in the `src/core/` files where `var` is still in use.
---
[1] Obviously the `no-var` rule can, in the same way as every other rule, be disabled on a case-by-case basis where actually necessary.
- Check that the "Info"-entry, in the XRef-trailer, is actually a dictionary before accessing it. This is similar to the `PDFDocument.documentInfo` method and follows the general principal of validating data carefully before accessing it, given how often PDF-software may create corrupt PDF files.
- Slightly simplify the "XFA"-lookup, since there's no point in trying to fetch something from the empty dictionary.
Currently there's nothing that prevents modification of the `Dict.empty` primitive, which obviously needs to be *truly* empty to prevent any future (hard to find) bugs.
This patch contains a possible approach for fixing issue 12294, which compared to other PRs is purposely limited to the affected `WidgetAnnotation` code.
As mentioned elsewhere, considering that we're (at least for now) trying to fix *one specific* case, I think that we should avoid modifying the `Dict` primitive[1] and/or avoid a solution that (indirectly) modifies an existing `Dict`-instance[2].
This patch simply fixes the issue at hand, since that seems easiest for now, and I'd suggest that we worry about a more general approach if/when that actually becomes necessary.
Hence the solution implemented here, for `WidgetAnnotation`, is to simply use a combination of the local *and* AcroForm /DR resources during OperatorList-parsing to ensure that things work correctly regardless of where a particular /Font resource is found.
For saving of form-data, on the other hand, we want to avoid increasing the file-size unnecessarily and need to be smarter than just merging all of the available resources. To achive this, a new `WidgetAnnotation._getSaveFieldResources` method will when necessary produce a combined resources `Dict` with only the minimum amount of data from the AcroForm /DR resources included.
---
[1] You want to avoid anything that could cause the general `Dict` implementation to become slower, or more complex, just for handling an edge-case in my opinion.
[2] If an existing `Dict`-instance is modified unexpectedly, that could very easily lead to problems elsewhere since e.g. `Dict`-instances created during parsing are not expected to be changed.
For these streams, compared to `Stream` and `ChunkedStream`, there's no well defined concept of length and consequently no `length` getter.[1] However, attempting to access the non-existent `length` won't currently error, but just return `undefined`, which could thus easily lead to bugs elsewhere in the code-base.
---
[1] However, note that *all* stream implementations have an `isEmpty` getter which can be used instead.
* Move display/xml_parser.js in shared to use it in worker
* Save form data in XFA datasets when pdf is a mix of acroforms and xfa
Co-authored-by: Brendan Dahl <brendan.dahl@gmail.com>
This allows for merging of dictionaries one level deeper than previously. This could be useful e.g. for /Resources dictionaries, where you want to e.g. merge their respective /Font dictionaries (and other) together rather than picking just the first one.
The `/Order` array is used to improve the display of Optional Content groups in PDF viewers, and it allows a PDF document to e.g. specify that Optional Content groups should be displayed as a (collapsable) tree-structure rather than as just a list.
Note that not all available Optional Content groups must be present in the `/Order` array, and PDF viewers will often (by default) hide those toggles in the UI.
To allow us to improve the UX around toggling of Optional Content groups, in the default viewer, these hidden-by-default groups are thus appended to the parsed `/Order` array under a *custom* nesting level (with `name == null`).
Finally, the patch also slightly tweaks an `OptionalContentConfig` related JSDoc-comment in the API.
In issue 12120, the font has a 1,0 cmap and is marked symbolic which
according to the spec means we should directly use the cmap instead of
the extra steps that are defined in 9.6.6.4.
However, just fixing that caused bug 1057544 to break. The font in bug
1057544 has a 0,1 cmap (Unicode 1.1) which we were not using, but is
easy to support. We're also easily able to support some of the other
unicode cmaps, so I added those as well.
There was also a second issue with bug 1057544, the cmap doesn't have
a mapping for the "quoteright" glyph, but it is defined in the post
table. To handle this, I've moved post table as a fallback for any
font that has an encoding.
Now that the `parse` method is simplified we can inline the `setup`
method in the `parse` method since it's only two lines of code. This
avoids some indirection.
Good form type detection is important to get reliable telemetry and to
only show the fallback bar if a form cannot be filled out by the user.
PDF.js only supports AcroForm data, so XFA data is explicitly unsupported
(tracked in issue #2373). However, the previous form type detection
couldn't separate AcroForm and XFA well enough, causing form type
telemetry to be incorrect sometimes and the fallback bar to be shown for
forms that could in fact be filled out by the user.
The solution in this commit is found by studying the specification and
the form documents that are available to us. In a nutshell the rules are:
- There is XFA data if the `XFA` entry is a non-empty array or stream.
- There is AcroForm data if the `Fields` entry is a non-empty array and
it doesn't consist of only document signatures.
The document signatures part was not handled in the old code, causing a
document with only XFA data to also be marked as having AcroForm data.
Moreover, the old code didn't check all the data types.
Now that AcroForm and XFA can be distinguished, the viewer is configured
to only show the fallback bar for documents that only have XFA data. If
a document also has AcroForm data, the viewer can use that to render the
form. We have not found documents where the XFA data was necessary in
that case.
Finally, we include unit tests to ensure that all cases are covered and
move the form type detection out of the `parse` function so that it's
only executed if the document information is actually requested
(potentially making initial parsing a tiny bit faster).
Not only is `catDict` never accessed anymore outside of this file, it
should also never happen since it's internal to the catalog. If data
from it is needed elsewhere, the catalog should provide a getter for it
that can do basic data integrity checks and abstract away any
unnecessary details.
The `AcroForm` entry is part of the catalog, not of the document, so its
logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.
Moreover, make the AcroForm member private on the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `IsAcroFormPresent`.
Only a boolean is exposed, so we now also only store the boolean on the
document instance.
Finally, the annotation code needs access to the full AcroForm
dictionary, so it's updated to fetch the data from the catalog instead
of the document that now only holds the boolean.
The `Collection` entry is part of the catalog, not of the document, so
its logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.
Moreover, remove the collection member from the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `IsCollectionPresent`.
Moving this out of the `parse` function makes sure that the getter is
only executed if the document information is actually requested
(potentially making initial parsing a tiny bit faster).
The `Version` entry is part of the catalog, not of the document, so its
logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.
Moreover, make the version member private on the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `PDFFormatVersion`.
Finally, clarify how the version from the header and the version from
the catalog are treated using a comment.
Even though the code obviously works as-is, given that we have unit-tests for it, it still feels incorrect to just *assume* that the `Catalog`-instance has all of its properties immediately available. Especially when (almost) all of the other handlers, in `src/core/worker.js`, protect their data accesses with appropriate `pdfManager.ensure` calls.
Even though the code obviously works as-is, given that we have unit-tests for it, it still feels incorrect to just *assume* that the `XRef`-instance has all of its properties immediately available. Especially when (almost) all of the other handlers, in `src/core/worker.js`, protect their data accesses with appropriate `pdfManager.ensure` calls.
In `display/canvas.js` the accent offsets must be multiplied by `fontSize` to make the offsets large enough. Another problem is in `core/type1_parser.js` when the Type1 command `seac` is handled. There is an error in the Adobe Type1 spec. See chapter 6 in Type1 Font Format Supplement, which provides an errata: The arguments of `seac` specify the offset of the left side bearing (LSB) points, not the offset of origins. This can be fixed in `core/type1_parser.js` by adding the difference of the LSB values.
The down appearance (`D`) is optional and not available in the document
from #12233, so the checkboxes are never saved/printed as checked
because the checked appearance is based on the export value that is
missing because the `D` entry is not available.
Instead, we should use the normal appearance (`N`) since that one is
required and therefore always available.
Finally, the /Off appearance is optional according to section 12.7.4.2.3
of the specification, so that needs to be taken into account to match
the specification and to fix reference test failures for the
`annotation-button-widget-print` test. That is a file that doesn't
specify an /Off appearance in the normal appearance dictionary.
The helper method `_decodeFormValue` is used to ensure that it happens
in one place. Note that form values are field values, display values
and export values.
The specification states that the field value is `null` if no item is
selected and we didn't handle this case properly. Even though this did
not break the rendering because we always convert the value to an array
and the `includes` check in the display layer would simply not match,
the field value would be `[null]` which is not expected and strange from
an API perspective.
This commit fixes that by ensuring that we return an empty array in
case the field value is `null`. The API therefore still always gives an
array for the field value, but now the code is more specific so that the
value is either an empty array or an array of strings.
This is *similar* to the existing transfer function support for SMasks, but extended to simple image data.
Please note that the extra amount of data now being sent to the worker-thread, for affected /ExtGState entries, is limited to *at most* 4 `Uint8Array`s each with a length of 256 elements.
Refer to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G9.1658137 for additional details.
I completely overlooked the fact that `PartialEvaluator.handleSetFont` also updates the current `state`, which means that currently we're not actually handling font data correctly for cached /ExtGState data. (Thankfully, using /ExtGState to set a font is somewhat rare in practice.)
Some fonts have loca tables that aren't sorted or use 0 as an offset to
signal a missing glyph. This fixes the bad loca tables by sorting them
and then rewriting the loca table and potentially re-ordering the glyf
table to match.
Fixes#11131 and bug 1650302.