Similar to all other data accesses, note e.g. the "GetDocJSActions" handler just above, we need to ensure that a `MissingDataException` isn't propagated to the main-thread if this data is accessed while the PDF document is still loading.
It's obviously (a bit) more efficient to return early in `Page.getStructTree`, rather than trying to first "parse" an *empty* structTree-root.
*Somehow I didn't think of this yesterday, but this feels like a much better solution overall; sorry about the churn here!*
Looking at the API, there's no code which actually sends this message. Most likely it's a left-over from a previous version of PR 13069, since the `isPureXfa` parameter is being included in the "GetDoc" message.
This is first of all consistent with existing API-methods, where we return `null` when the data in question doesn't exist. Secondly, it should also be (slightly) more efficient since there's less dummy-data that we need to transfer between threads.
Finally, this prevents us from adding an empty/unnecessary span to *every* single page even in documents without any structure tree data.
- but don't validate them for now;
- Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315)
- almost all (all ?) pdf readers display signatures;
- validation is done in edge but for now it's behind a pref.
When a PDF is "marked" we now generate a separate DOM that represents
the structure tree from the PDF. This DOM is inserted into the <canvas>
element and allows screen readers to walk the tree and have more
information about headings, images, links, etc. To link the structure
tree DOM (which is empty) to the text layer aria-owns is used. This
required modifying the text layer creation so that marked items are
now tracked.
Note how we purposely don't expose the `AnnotationStorage`-class directly in the official API (see `src/pdf.js`), since trying to use *multiple* ones simultaneously doesn't really make sense (e.g. in the viewer).
Instead we lazily initialize, and cache, just *one* instance via `PDFDocumentProxy.annotationStorage` which should thus be available internally in the API itself without having to be manually passed to various methods.
To support these changes, the `AnnotationStorage`-instance initialization is moved into the `WorkerTransport`-class to allow both `PDFDocumentProxy` and `PDFPageProxy` to access it.
This patch implements the following simplifications:
- Remove the `annotationStorage`-parameter from `PDFDocumentProxy.saveDocument`, since it's already available internally.
Furthermore, while it's currently possible to call that method without an `AnnotationStorage`-instance, that really does *not* make any sense at all. In this case you're effectively reducing `PDFDocumentProxy.saveDocument` to a "regular" `PDFDocumentProxy.getData` call, but with *a lot* more overhead, which was obviously not the intention of the `PDFDocumentProxy.saveDocument`-method.
- Try to discourage third-party users from calling `PDFDocumentProxy.saveDocument` unconditionally, as a replacement for `PDFDocumentProxy.getData` (note the previous point).
- Replace the `annotationStorage`-parameter, in `PDFPageProxy.render`, with a boolean `includeAnnotationStorage`-parameter which simply indicates if the (internally available) `AnnotationStorage`-instance should be used during rendering (e.g. for printing).
- By removing the need to *manually* provide `annotationStorage`-parameters to various API-methods, using the API should become simpler (e.g. for third-parties) since you no longer need to worry about manually fetching and passing around this data.
The fontName, as defined in the PDF document, cannot be found in *any* of the "name"-tables in the TrueType Collection font. To work-around that, this patch adds a *fallback* code-path to allow using an approximately matching fontName rather than outright failing.
While this method has only been deprecated in one releases now, the `AnnotationStorage`-functionality is new enough that third-party implementations hopefully don't rely heavily on it just yet. (And removing this quickly should help reduce the likelihood that someone starts using it.)
When removing tasks we're currently forced to *indirectly* iterate through the array, which can be avoided by using a Set instead.
Furthermore, we can also (slightly) modernize the code responsible for initializing the `renderTasks`.
Given that what we actually want is only to keep track of the loadedFont-names, rather than storing any actual data, using an object isn't really necessary here. Furthermore, in the current code, we're also using `in` when checking if the data exists, which is generally less efficient than just checking for the value directly.
As mentioned in the JSDoc comment, this should not be used unless you know what you're doing, since it will lead to increased memory usage. However, in some situations (e.g. SVG-rendering), we still want to be able to run general clean-up on both the main/worker-thread while keeping loaded fonts attached to the DOM.[1]
As part of these changes, `WorkerTransport.startCleanup` is converted to an async method and we'll also skip clean-up when destruction has started (since it's redundant).
---
[1] The SVG-rendering mode is obviously not officially supported, since it's both rather incomplete and inherently slower. However with recent changes, whereby we cache repeated images on the document rather than the page level, memory usage can be *a lot* worse than before if we never attempt to release e.g. cached image-data when the viewer is in SVG-rendering mode.
These two properties were *never* intended to be anything but "private", hence it really cannot hurt to actually indicate that they're *not* part of any official API.
Currently the `fontName`-property contains an actual /Name-instance, which is a problem given that its fallback value is an empty string; see ca7f546828/src/core/default_appearance.js (L35)
The reason that this is a problem can be seen in ca7f546828/src/core/primitives.js (L30-L34), since an empty string short-circuits the cache. Essentially, in PDF documents, a /Name-instance cannot be empty and the way that the `DefaultAppearanceEvaluator` does things is unfortunately not entirely correct.
Hence the `fontName`-property is changed to instead contain a string, rather than a /Name-instance, which simplifies the code overall.
*Please note:* I'm tagging this patch with "[api-minor]", since PR 12831 is included in the current pre-release (although we're not using the `fontName`-property in the display-layer).
Fixes#13107
In the issue, some TrueType glyph names have the format `uniXXXX`.
Font's `Encoding` dictionary has the entry `Differences` but no
`BaseEncoding`. `uniXXXX` names are converted to glyph indices
using font's `post` table but currently that is done only when
`BaseEncoding` exists. We must enable the conversion also when only
`Differences` exists.
Currently only URL-strings are officially supported by `getDocument`, however at this point in time I cannot really see any compelling reason to not support `URL`-objects as well.
Most likely the reason that we've don't already support `URL`-objects, in `getDocument`, is that historically `URL` wasn't fully implemented across browsers and our old polyfill wasn't perfect; see https://developer.mozilla.org/en-US/docs/Web/API/URL/URL#browser_compatibility
*Please note:* Because of how the `url` parameter is currently handled, there's actually *some* cases where passing a `URL`-object to `getDocument` already works. That, in my opinion, provides additional motivation for supporting `URL`-objects officially, since it makes the API more consistent.
The following is an attempt to summarize the *current* situation, based on the actual code rather than the JSDocs:
- `getDocument("url string")` works and is documented.[1]
- `getDocument({ url: "url string", })` works and is documented.[1]
- `getDocument(new URL(...))` throws immediately, since no supported parameters are found.
- `getDocument({ url: new URL(...), })` actually works even though it's not documented.[1] Originally, when data was fetched on the worker-thread, this would likely have thrown since `URL` isn't clonable.[2]
- `getDocument({ url: { abc: 123, }, })`, or some similarily meaningless input, will be "accepted" by `getDocument` and then throw a `MissingPDFException` when attempting to fetch the bogus data.
With the changes in this patch, not only is `URL`-objects now officially supported and documented when calling `getDocument`, but we'll also do a much better job at actually validating any URL-data passed to `getDocument` (and instead fail early).
---
[1] In *browsers*, we create a valid URL thus indirectly validating the input. In Node.js environments, on the other hand, no validation is done since obtaining a baseUrl is more difficult (and PDF.js is primarily written for browsers anyway).
[2] https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types
Given the number of parameters that we now need to parse here, this code is no longer as readable as one would like. Hence this re-factoring, which will improve overall readability and also help with the next patch.
The reasons for making this change are:
- This property is not, nor has it ever been, used anywhere in the PDF.js display-layer.
- Related to the previous point, the format of the `defaultAppearance`-string is such that it'd be difficult to use it as-is in the display-layer anyway.
- It (usually) contains the "raw" appearance-string, from the PDF document, which is neither parsed nor validated and could thus be bogus.
- We now expose a `defaultAppearanceData`-property, which is first of all used in the display-layer and secondly contains actually parsed/validated data.
- In the event that a third-party implementation needs the `defaultAppearance`-string, it could be easily constructed from the recently added `defaultAppearanceData`-property.
All-in-all, I'm thus suggesting that we stop exposing an unused and unnecessary property on all Annotation-instances.
* JS - Handle correctly hierarchy of fields
- it aims to fix#13132;
- annotations can inherit their actions from the parent field;
- there are some fields which act as a container for other fields:
- they can be access through js so need to add them with an empty type (nothing in the spec about that but checked in Acrobat);
- calculation order list (CO) can reference them so need make them through this.getField;
- getArray method must return kids.
- field values are number, string, ... depending of their type but nothing in the spec on how to know what's the type:
- according to the comment for Canonical Format: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=461
- it seems that this "type" can be guessed from js action Format (when setting a type in Acrobat DC, the only affected thing is this action).
- util.scand with an empty string returns the current date.
Based on this compatibility information, given that IE 11 is now *explicitly* unsupported, we should no longer need to bundle a `URL` polyfill in any builds: https://developer.mozilla.org/en-US/docs/Web/API/URL/URL#browser_compatibility
Note that the caveat listed for older Safari-versions doesn't apply to any code in the PDF.js library, since we never call `new URL(url, undefined)` in the code-base.
Note also that Node.js has a web-compatible `URL` implementation, which according to the "History" section at https://nodejs.org/api/url.html#url_the_whatwg_url_api has been available since Node.js `10.0.0` (according to https://nodejs.org/en/about/releases/ that branch is one month away from being EOL-ed).
The rotation handling that's currently living in `PDFViewerApplication` is *very* old, and pre-dates the introduction of the viewer components by years.
As can be seen in the `BaseViewer.pagesRotation` setter, we're not actually normalizing the rotation as intended and instead rely on the caller to handle that correctly. This is first of all inconsistent, given how other setters are implemented, and secondly it could also lead to the rotation being set to a value outside of the `[0, 360)`-range.
Finally, for improved consistency the rotation handling in `PageViewport` is updated similarly. Please note that this case, it's *not* changing the pre-existing logic.
- implement few positioning properties: position, width, height, anchor;
- implement font element;
- implement fill element (used by font) and its children (linear, radial, ...);
- font property is inherited from ancestor container (see https://www.pdfa.org/wp-content/uploads/2020/07/XFA-3_3.pdf#page=43) so let CSS handles that stuff;
- in order to reduce the number of properties to set, only set non default properties and put the default in CSS;
- set a background to some containers to be able to see them (will be removed in a future commit).
Similar to the existing `annotationsPromise` and `_jsActionsPromise` properties, the new `_xfaPromise` should obviously also be reset, since otherwise you might end up holding onto a lot of data for pages that are no longer active.
(That caching wasn't present in the original version of PR 13069, which is why I didn't spot it until now.)
- add an option to enable XFA rendering if any;
- for now, let the canvas layer: it could be useful to implement XFAF forms (embedded pdf in xml stream for the background and xfa form for the foreground);
- ui elements in template DOM are pretty close to their html counterpart so we generate a fake html DOM from template one:
- it makes easier to translate template properties to html ones;
- it makes faster the creation of the html element in the main thread.
While there is nothing *outright* wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.
To address this, this patch makes the following changes:
- Ignore any non-string in the `docBaseUrl` option passed to `getDocument`, since those are unsupported anyway, already on the main-thread.
- Ignore "data:"-URLs in the `docBaseUrl` option passed to `getDocument`, to avoid having to send what could potentially be a *very* long string to the worker-thread.
- Parse the `docBaseUrl` option *directly* in the `BasePdfManager`-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.
It seems reasonable to place this alongside the *similar* `getFilenameFromUrl` helper function. This way, with the changes in the next patch, we also avoid having to expose the `isDataScheme` function in the API itself and we instead expose `getPdfFilenameFromUrl` in the API (which feels overall more appropriate).
This extends PR 13033 slightly, with a heuristic to support corrupt PDF documents where the `LineAnnotation`s have an empty /Rect-entry. Please note that while I have no idea if this is "correct", this patch at least makes us output the same /BBox as re-saving in Adobe Reader does.