*Please note:* I'm completely fine with this patch being rejected, and the issue instead closed as WONTFIX, since this is unfortunately a case where the TypeScript definitions dictate how we can/cannot write JavaScript code.
Apparently the TypeScript definitions generation converts the existing `PixelsPerInch` code into a `namespace` and simply ignores the getter; please see a7fc0d33a1/types/src/display/display_utils.d.ts (L223-L226)
Initially I tried tagging `PixelsPerInch` as en `@enum`, see https://jsdoc.app/tags-enum.html, however that unfortunately didn't help.
Hence the only good/simple solution, as far as I'm concerned, is to convert `PixelsPerInch` into a class with `static` properties. This patch results in the following diff, for the `gulp types` build target:
```diff
@@ -195,9 +195,10 @@
*/
static toDateObject(input: string): Date | null;
}
-export namespace PixelsPerInch {
- const CSS: number;
- const PDF: number;
+export class PixelsPerInch {
+ static CSS: number;
+ static PDF: number;
+ static PDF_TO_CSS_UNITS: number;
}
declare const RenderingCancelledException_base: any;
export class RenderingCancelledException extends RenderingCancelledException_base {
```
Soft masks can be enabled/disabled at anytime and at different
points in the save/restore stack. This can lead to
the amount of save/restores becoming unbalanced across the
two canvases. Instead of save/restoring on the temporary canvas
change it so we only track state on the main (suspended canvas).
I was also getting an out balance stack from patterns, so I've also
fixed that and added a warning that will at least show up on chrome.
It would be nice to add this so Firefox at some point too.
Fixes#11328, #14297 and bug 1755507
At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.
Unfortunately I don't have a test-case that breaks without this change, however the `stringToPDFString` helper function will fail if anything other than a string is passed to it.
The changes in this patch thus make this code more-or-less identical to that found in the `Catalog.{_collectJavaScript, parseDestDictionary}` methods.
According to the MDN compatibility data, see https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#browser_compatibility, all browsers that we support have native `ReadableStream` implementations (since quite some time too).
Hence only Node.js is now lagging behind w.r.t. `ReadableStream` support, and its experimental implementation doesn't really help us given the life-span of the LTS releases (see https://en.wikipedia.org/wiki/Node.js#Releases).
It seems quite unfortunate to bundle a `ReadableStream` polyfill in the `legacy` builds when it's unnecessary in browsers, given its overall size, but fortunately we can avoid that by simply listing `web-streams-polyfill` as a dependency for the `pdfjs-dist` library.
With recent changes, specifically PR 14515 *and* the previous patch, the `createObjectURL` helper function is now only used with the SVG back-end.
All other call-sites, throughout the code-base, are now using `URL.createObjectURL(...)` directly and it no longer seems necessary to keep exposing the helper function in the API.
Finally, the `createObjectURL` helper function is moved into the `src/display/svg.js` file to avoid unnecessarily duplicating this code on both the main- and worker-threads.
This is essentially a *continuation* of PR 7926, where we added support for rejecting the current `PDFDocumentLoadingTask`-promise by throwing inside of the `onPassword`-callback.
Hence the naive way to address [bug 1754421](https://bugzilla.mozilla.org/show_bug.cgi?id=1754421) would be to simply throw in the `onPassword`-callback used in the default viewer. However it unfortunately turns out to not work, since the password input/validation is asynchronous, and we thus need another approach.
The simplest solution that I can come up with here, is thus to *extend* the `onPassword`-callback to also reject the current `PDFDocumentLoadingTask`-instance if an `Error` is explicitly passed as the input to the callback function. (This doesn't feel great, but I cannot see a better solution that isn't really complicated.)
This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.
Please note that while we "support" some (by now) fairly old browsers, that essentially means that the library (and viewer) will load and that the basic functionality will work as intended.[1]
However, in older browsers, some functionality may not be available and generally we'll ask users to update to a modern browser when bugs (specific to old browsers) are reported.[2]
There's always a question of just how old browsers the PDF.js contributors can realistically support, and here I'm suggesting that we place the cut-off point at approximately *three* years.
With that in mind, this patch updates the *minimum* supported browsers (and environments) as follows:
- Chrome 73, which was released on 2019-03-12; see https://en.wikipedia.org/wiki/Google_Chrome_version_history
- Firefox ESR (as before); see https://wiki.mozilla.org/Release_Management/Calendar
- Safari 12.1, which was released on 2019-03-25; see https://en.wikipedia.org/wiki/Safari_version_history#Safari_12
- Node.js 12, which was release on 2019-04-23 (and will soon reach EOL); see https://en.wikipedia.org/wiki/Node.js#Releases
---
[1] Assuming a `legacy`-build is being used, of course.
[2] In general it's never a good idea to use an old/outdated browser, since those may contain *known* security vulnerabilities.
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
- it aims to fix#14502 and bug 1721335;
- Acrobat and Pdfium do the same;
- it'll avoid to have truncated data when printed;
- change the factor to compute font size in using field height: lineHeight = 1.35*fontSize
- this is the value used by Acrobat.
- in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.
This allows us to remove the manually implemented `structuredClone` polyfill, thus reducing the maintenance burden for the `LoopbackPort` class; refer to https://github.com/zloirock/core-js#structuredclone
*Please note:* While `structuredClone` support landed already in Firefox 94, Google Chrome only added it in version 98 (currently in Beta). However, given that the `LoopbackPort` will only be used together with *fake workers* in browsers this shouldn't be too much of a problem.[1]
For Node.js environments, where *fake workers* are unfortunately necessary, using a `legacy/`-build is already required which thus guarantees that the `structuredClone` polyfill is available.
Also, the patch updates core-js to the latest version since that one includes `structuredClone` improvements; please see https://github.com/zloirock/core-js/releases/tag/v3.20.3
---
[1] Given that we only support browsers with proper worker support, if *fake workers* are being used that essentially indicates a configuration problem/error.
- it aims to fix#14497;
- previously, only rotations with an angle 0, 90, 180 or 270 were taken into account;
- so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.
This commit fixes Bug 1743245 (Grided PDF file lines rendered too thick) which was created by a fix for #12868 .
The lineWidth was set to round(1 * this._combinedScaleFactor) when the pixel is drawn as a parallelorgam with a height <1. This fix changes this to floor(1*this._combinedScaleFactor) .
This change shows a visual result comparable to Chrome and Acrobat.
Regarding the last PR 3 statements in canvas.js are affected and will change with this commit (stroke and paintChar).
renaming the reference files to naming comvention
Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing.
Obviously the `Glyph`-instances are being cached *per* font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1].
Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` *unique* unicode-chars being checked.
*Please note:* In practice I suppose that this won't have a *huge* effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review.
---
[1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.
- it aims to fix issue #14307;
- this event has been added recently in Firefox and we can now use it;
- fix few bugs in aform.js or in annotation_layer.js;
- add some integration tests to test keystroke events (see `AFSpecial_Keystroke`);
- make dispatchEvent in the quickjs sandbox async.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563;
- use some helper functions to get (u|i)int** values in buffer: it helps to have a clearer code;
- in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8;
- add few TODOs.
Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries.
In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.
After the changes in PR 14428 we can *directly*, and more efficiently, handle whitespace conversion in `PartialEvaluator.getTextContent` when the `normalizeWhitespace` option is being used.
This way we no longer need a separate helper function for this, and can avoid having to (again) iterate through the text and checking each character. Finally, this also removes the need for using a regular expression on e.g. all non-ASCII text.
Inlining the checks should be a *tiny bit* more efficient, since it avoids have to make *unconditional* function calls in these fairly commonly used helper functions.
This patch implements this by looking for the UTF-8 BOM, i.e. `\xEF\xBB\xBF`, in order to determine the encoding.[1]
The actual conversion is done using the `TextDecoder` interface, which should be available in all environments/browsers that we support; please see https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder#browser_compatibility
---
[1] Assuming that everything lacking a UTF-16 BOM would have to be UTF-8 encoded really doesn't seem correct.
In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing.
Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs
*Please note:* That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents.
Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.
While `PageViewport` apparently makes sense in TypeScript environments, given that it's being returned by the `PDFPageProxy.getViewport`-method in the API, we really don't want to extend the *public* API by simply exporting the class directly in `src/pdf.js` since it should never be called/initialized manually.
Hence we follow the same pattern as in PR 14013, and also extend the API unit-tests to ensure that `PDFPageProxy.getViewport` always returns a `PageViewport`-instance as expected.
This prevents the `BaseSVGFactory.create`-method from throwing, and thus preventing any remaining Annotations (on the page) from rendering in corrupt documents.
This helper function has never been used in e.g. the worker-thread, hence its placement in `src/shared/util.js` led to a *small* amount of unnecessary duplication.
After the previous patches this helper function is now *only* used in the viewer, hence it no longer seems necessary to expose it through the official API.
*Please note:* It seems somewhat unlikely that third-party users were relying *directly* on this helper function, which is why it's not being exported as part of the viewer components. (If necessary, we can always change this later on.)
As part of the changes/improvement in PR 14092, we're no longer using the `addLinkAttributes` directly in e.g. the AnnotationLayer-code.
Given that the helper function is now *only* used in the viewer, hence it no longer seems necessary to expose it through the official API.
*Please note:* It seems somewhat unlikely that third-party users were relying *directly* on the helper function, which is why it's not being exported as part of the viewer components. (If necessary, we can always change this later on.)
The patch in PR 14335 *essentially* re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.
While this method is currently only used as a *fallback* in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the *entire* /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a *much* easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method *directly* instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s).
Furthermore, this patch contains the following other changes:
- Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead.
- Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error *immediately* afterwards.
---
[1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.
This method is now being used a lot more, compared to when it's added, since it's now used together with scripting as part of the `PDFDocument.fieldObjects` parsing (called during viewer initialization).
For /Page Dictionaries that we've already parsed, the `pageIndex` corresponding to a particular Reference is already known and we're thus able to skip *all* parsing in the `Catalog.getPageIndex` method for those cases.
Besides converting `Catalog.getPageDict` to an `async` method, thus simplifying the code, this patch also allows us to pro-actively fix a existing issue.
Note how we're looking up References in such a way that `MissingDataException`s won't cause trouble, however it's *technically possible* that the entries (i.e. /Count, /Kids, and /Type) in a /Pages Dictionary could actually be indirect objects as well. In the existing code this could lead to *some*, or even all, pages failing to load/render as intended.
In practice that doesn't *appear* to happen in real-world PDF documents, but given all the weird things that PDF software do I'd prefer to fix this pro-actively (rather than waiting for a bug report).
With `Catalog.getPageDict` being `async` this is now really simple to address, however I didn't want to introduce a bunch more *unconditional* asynchronicity in this method if it could be avoided (since that could slow things down). Hence we'll *synchronously* lookup the *raw* data in a /Pages Dictionary, and only fallback to asynchronous data lookup when a Reference was encountered.
In addition to the above, this patch also makes the following notable changes:
- Let `Catalog.getPageDict` *consistently* reject with the actual error, regardless of what data we're fetching. Previously we'd "swallow" the actual errors except when looking up Dictionary entries, which is inconsistent and thus seem unfortunate. As can be seen from the updated unit-tests this change is API-observable, hence why the patch is tagged `[api-minor]`.
- Improve the consistency of the Dictionary /Type-checks in both the `Catalog.getPageDict` and `Catalog.getAllPageDicts` methods.
In `Catalog.getPageDict` there's a fallback code-path where we're *incorrectly* checking the /Page Dictionary for a /Contents-entry, which is wrong since a /Page Dictionary doesn't need to have a /Contents-entry in order to be valid.
For consistency the `Catalog.getAllPageDicts` method is also updated to handle errors in the /Type-lookup correctly.
- Reduce the `PagesCountLimit.PAUSE_EAGER_PAGE_INIT` viewer constant, to further improve loading/rendering performance of the *second* page during initialization of very long documents; PR 14359 follow-up.
This patch circumvents the issues seen when trying to update TypeScript to version `4.5`, by "simply" fixing the broken/missing JSDocs and `typedef`s such that `gulp typestest` now passes.
As always, given that I don't really know anything about TypeScript, I cannot tell if this is a "correct" and/or proper way of doing things; we'll need TypeScript users to help out with testing!
*Please note:* I'm sorry about the size of this patch, but given how intertwined all of this unfortunately is it just didn't seem easy to split this into smaller parts.
However, one good thing about this TypeScript update is that it helped uncover a number of pre-existing bugs in our JSDocs comments.
In PR 14114 this was only added to the default viewer, which means that in the viewer components the user would need to *manually* implement /Lang handling. This was (obviously) a bad choice, since the viewer components already support e.g. structTrees by default; sorry about overlooking this!
To avoid having to make *two* `getMetadata` API-calls[1] very early during initialization, in the default viewer, the API will now cache its result. This will also come in handy elsewhere in the default viewer, e.g. by reducing parsing when opening the "document properties" dialog.
---
[1] This not only includes a round-trip to the worker-thread, but also having to re-parse the /Metadata-entry when it exists.
After the changes in PR 14338, specifically in the `XRef.parse`-method, the /Pages-entry will now always have been fetched/validated when the `Catalog`-instance is created.
Hence we can directly access the /Pages-entry in `Catalog.getPageDict` and thus avoid *one* asynchronous data-lookup per page in the document. (In practice this is unlikely to show up in e.g. benchmarks, but it really cannot hurt.)
Finally, make sure that the `getPageDict`/`getAllPageDicts`-methods track the /Pages-tree reference correctly to prevent circular references in corrupt documents.
Rather than "swallowing" the actual Errors, when data fetching fails, ensure that they're always being propagated as intended to the call-site instead.
Note that we purposely handle `XRefEntryException` specially, to make it possible to fallback to indexing all XRef objects.
Rather than trying, and failing, to fetch the entire /Pages-tree for documents with corrupt XRef tables, let's fallback to indexing all objects *before* trying to invoke the `Catalog.getAllPageDicts` method.
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Finally, also changes the `pagePromises` to a *private* property since it's not supposed to be accessed from the "outside".
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
For one thing, this simplifies iteration since we no longer have to worry about/check if `pageCache`-entries are undefined (which will happen for *sparse* `Array`s).
Of particular note is that we're no longer attempting to "null" the `pageCache`-entry from within the `PDFPageProxy._destroy`-method. Given that *synchronous* JavaScript will always run to completion[1] and that we're looping through all pages in `WorkerTransport.destroy` and immediately clear the cache afterwards, that code did/does not really make a lot of sense (as far as I can tell).
Finally, also changes the `pageCache` to a *private* property since it's not supposed to be accessed from the "outside".
---
[1] Unless there are errors, of course.
PR 8207 added caching to improve the performance of `Catalog.getPageDict`, by not having to repeatedly fetch the same data and also reducing the asynchronicity of that method.
However, because of *another* oversight on my part, we're only caching /Page references once we've found the correct page. As long as all pages are loaded *in order* this doesn't really matter (happens by default in the viewer), but when `disableAutoFetch` is used the pages may be fetched in a more random order (this patch reduces the asynchronicity of `Catalog.getPageDict` slightly in that case).
PR 8207 added caching to improve the performance of `Catalog.getPageDict`, by not having to repeatedly fetch the same data and also reducing the asynchronicity of that method.
However, because of annoying off-by-one errors[1] the caching became less efficient than it could/should be.[2] Note here that the /Pages-tree is zero-indexed, and that e.g. `pageIndex = 5` thus correspond to the *sixth* page of the document.
---
[1] In particular the `currentPageIndex + count < pageIndex` part.
[2] For example, even when loading a relatively small/simple document such as `tracemonkey.pdf` in the viewer, the number of `xref.fetchAsync(currentNode)` calls are reduced from `56` to `44` with this patch.
Trying to shadow a non-existent property is always an implementation mistake, since it leads to the `shadow`-call not having any effect.
In PR 14152 I overlooked the fact that it's fairly easy to enforce this during development/testing, since that can help catch e.g. simple spelling bugs.
Currently the `Catalog.metadata` getter only handles errors during parsing, however in a *corrupt* PDF document fetching of the raw /Metadata can obviously fail as well.
Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it *never* should and this will cause the viewer to not initialize all state as expected.
Fixes one of the documents in issue 14305.
*This patch improves handling of a couple of PDF documents from issue 14303.*
- Update `XRef.indexObjects` to actually clear *all* XRef-caches. Invalid XRef tables *usually* cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.
- Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched *and* that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.
- Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)
With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than *appearing* to work).
*Please note:* This is similar to the method that existed prior to PR 3848, but the new method will *only* be used as a fallback when parsing of corrupt PDF documents.
The implementation in PR 14311 unfortunately turned out to be *way* too simplistic, as evident by the recently added test-files in issue 14303, since it may *cause* infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1]
To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the *entire* /Pages-tree when the /Count-entry validation fails during document initialization.
Fixes *at least* two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents.
---
[1] The whole point of PR 14311 was obviously to *get rid of* infinte loops during document initialization, not to introduce any more of those.
This was added in PR 14311, but given that I completely missed to update the `PDFDocument.getPage` signature accordingly it's completely unused.
Given that things work just as fine as-is, let's simply remove that optional parameter for now; sorry about the churn here!
This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree.
Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.
Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.
Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).
Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.
Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
*This patch basically extends the approach from PR 10392, by also checking the last page.*
Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).
Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.
Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
- This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
- For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
- This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.
As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).
Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.
While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.
Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).
The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.
Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.
The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.
Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.
Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).
---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)
[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.
[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available.
This patch is a follow-up to PR 11123, which made it impossible to *manually* disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1]
Hence we'll now only support *proper* Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise.
---
[1] At the time of that PR we still "supported" IE, which is why this code was left intact.
There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.
Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.
---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
- real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
- for example, an extra spacing can cancel a space so visually there are no space.
- First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260;
- several interactive pdfs use the possibility to hide/show buttons to show different icons;
- render pushbuttons on their own canvas and then insert it the annotation_layer;
- update test/driver.js in order to convert canvases for pushbuttons into images.
Previously, when we created a shading pattern canvas we created it
as the same size as the page. This was good for caching if the same
pattern was used over and over again, but when lots of different
shadings are created that caused us to create many full page
canvases.
Instead of creating the full page canvses, create the canvas
as the same size as the current path bounding box. This reduces memory
consumption by a lot since most paths are pretty small. Also, in real world
PDFs it's rare for a shading (non shading fill) to be reused over and over again.
Bug 1721949 is an example where the same pattern is reused and it will be slightly
slower than before.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.
This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502;
- when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created).
- the pdf shows an alignment issue on page 4:
- the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout;
- there is an extra empty page at the end of the pdf:
- there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility).
- so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.
We were incorrectly using the transform in the pattern before it had been
adjusted causing the pattern to be misplaced relative to the page.
Fixes: ShowText-ShadingPattern.pdf (already in corpus)
Fixes: #8111Fixes: #9243
Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0>
If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf
- The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it.
However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause:
- eliminate unmatched data with occur min=0
we are checking our empty data and sending it to uselessNode array where at the end it gets removed;
Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour.
In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can *obviously* cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).
Starting a new path will wipe out any of the current subpaths in the
current graphics state, so we should reset the min/maxes.
This makes a number of the bounding boxes smaller and reduces the number
of composed pixels. For the smask tests in the corpus, the number of
composed pixesl goes from 19,872,109 to 19,676,905. The difference is much
larger on other PDFs though.
Embedded JS in PDF keep throwing alert reagdring specific version of Acrobat (Spanish and version 5.0 or greater).
This happens because:
- JS in pdf is enabled
- PDF contains some unsupported features (e.g. XFA)
Alert come when app.formVersion = undefined || app.formVersion < 5.0
In pdf.js we were using FORM_VERSION = undefined. After researching based on https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/pdfs/acrobatsdk_jsapiref.pdf\#G4.1993509 and Acrobat DC we decided to go with the larger number to avoid unnecessary popups.
Through investigation we realise that VIEWER_VERSION should have same value - a number.
Due to all that, we implemented 21.00720099 as a value for both FORMS_VERSION and VIEWER_VERSION
This allows us to compose much smaller regions of soft
mask making them much faster. This should also allow
for further optimizations in the pattern code.
For example locally I see issue #6573 go from 55s
to 5s with this change.
Fixes#6573
The old method of handling soft masks had a number of issues where the temporary
drawing canvas and the suspended main canvas could get out of sync
(e.g. mismatched save/restores or clip state) or we could end up compositing at
the wrong time. A good example of things getting out sync is the reduced test
case in #9017.
To fix this I've changed two big things:
1) Duplicate all the needed graphics state from the temporary canvas to the
suspended main canvas. This ensure the canvases stay in sync so that when we
switch back to the main canvas the graphics state stack is the same
(e.g. transforms, clip paths).
2) Immediately composite after each drawing operation. This ensures that if
there's an active clip region that we'll still be able to composite the correct
portions of the canvas. Note: This solution could be avoided by using
getImageData and putImageData since those ignore clipping region, but this is
very very slow. Note2: I also think the old way of only compositing at the end
of the soft mask is incorrect and can lead to wrong colors if drawing over the
same region, but in practice this doesn't seem to matter much.
Fixes: #5781Fixes: #5853Fixes: #7267Fixes: #7891Fixes: #8403Fixes: #8624Fixes: #12798Fixes: #13891Fixes: #9017 (reduced test case)
Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1703683
With ResetForm-action support added in PR 14083, there's a regression in the `issue12716` test-case. More specifically the border around the "Clear Form"-link is now rendered *twice*, once in the canvas via the appearance-stream and once in the annotationLayer via the border-data.
This looks slightly weird, and was most likely not intended, which is why this patch suggests that we ignore the border in the annotationLayer when an appearance-stream exists.
Apparently Node.js has added *global* `URL.createObjectURL` support, but not done the same thing for `Blob`. Hence we also need to check for the availability of `Blob` in the `createObjectURL` helper function, and it's probably a good idea to also update `examples/node/pdf2svg.js` to work-around this until these changes reach an official PDF.js release.
Trying to render these Annotation-types, when the borderWidth is `0`, causes a "hairline" border to appear. If these Annotations included an appearance stream, as they are supposed to, this wouldn't have happened and the simplest solution here seem to be to just ignore these particular Annotations.
- PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
- the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
- no space are "drawn": it just moves the cursor but they aren't added in the chunk;
- so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
- to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
- it was a pretty good idea in general but it fails with some fonts where space was too big:
- in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
*Please note:* This is a tentative patch, since I don't have the necessary a11y-software to actually test it.
To avoid having to add a new API-method just for a single string, I figured that adding the new property to the existing `documentInfo`-data (accessed via `PDFDocumentProxy.getMetadata` in the API) will hopefully be deemed acceptable.
In these cases there's no good reason, in my opinion, to duplicate the `shadow`-lines since that unnecessarily increases the risk of simple typos (see the previous patch).
With this typo the shadowing doesn't actually work, which causes these checks to be unnecessarily repeated. In this particular case it didn't have a significant performance impact, however we should definately fix this nonetheless.
If a PDF included an embedded TrueType font whose preferred character
map (cmap) was in "format 2", the code would select that character map
and then refuse to read it because of an unsupported format, thus
causing the characters not to be rendered. This commit implements
support for format 2 as described at the link below.
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
This patch (slightly) simplifies a couple of `onProgress` and `onUnsupportedFeature` call-sites.
Finally, while unrelated, also removes some unnecessary `return undefined;` statements (PR 11601 follow-up).
For Circle, Square, and Polygon Annotations it's currently only possible to toggle the associated PopupAnnotation by clicking on its border. Depending on the border width, and also the current zoom-level in the viewer, that can make interacting with certain Annotations *practically* impossible (which is the case in issue 14107).
Hence, in order to improve this, change the "fill"-property of the SVG element in the annotationLayer to make the *entire* element part of the click/mouse-over target.
*Please note:* Given that this is a viewer-related issue, there's no simple way to test this as far as I can tell.
This is unfortunately *yet another* bug in the `preEvaluateFont`-implementation, and I've lost count of the number of times I've had to tweak this code over the years :-(
I really cannot help thinking that PR 4423 was way too simplistic, since it missed a bunch of cases that leads to broken font rendering in many PDF documents.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1734802
- the exact sentence from the spec:
"The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters."
- so just remove the warning.
With a recent addition to the HTML specification, the internal structured clone algorithm used in browsers is (or will be, once it's implemented) *directly* accessible to JavaScript; please see https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/structuredClone
Hence we'll *eventually* not need to maintain our own structured clone functionality in the `LoopbackPort`-class in the API, however for the time being we'll feature detect `structuredClone` and fallback to the existing PDF.js implementation.
Given that https://bugzilla.mozilla.org/show_bug.cgi?id=1722576 has landed in Firefox 94, we should no longer need the manually implemented `cloneValue`-functionality in MOZCENTRAL builds. Note also that in the Firefox built-in PDF Viewer it's not possible for users to *easily* disable workers, which should further reduce the risk of these changes.
This patch helps reduce some duplication, given that we now have a few essentially identical `addLinkAttributes` call-sites in the code-base.
To prevent runtime errors in the Annotation/XFA-layer code, we'll warn if a custom/incomplete `PDFLinkService` is being used (limited to GENERIC builds).
Note how both the annotationLayer and the document outline will apply various URL-related options when creating the link-elements.
For consistency the `xfaLayer`-rendering should obviously use the same options, to ensure that the existing options are indeed applied to all URLs regardless of where they originate.
Given that `NodeList`s can be iterated using `for..of` we can use that instead, since it's a little bit nicer and easier to read than the `Array.prototype.forEach` format.
- it aims to fix#12721.
- Thanks to PR #14023, we've now the fieldObjects in the annotation layer so we can easily map fields names on their id if needed.
- Reset values in the storage, in the JS sandbox and in the visible html elements.
In the referenced bug, the embedded fonts contain custom CMap-data that only include strings. Note how for embedded composite TrueType fonts we're using the CMap-data when building the glyph mapping, and currently we end up with a completely empty map because the code expects only CID *numbers*.
Furthermore, just fixing the glyph mapping alone isn't sufficient to fully address the bug, since we also need to consider this "special" kind of CMap-data when looking up glyph widths.
Having recently worked with, and reviewed patches touching, this code it seemed that it's probably not a bad idea to move that functionality into `createValidAbsoluteUrl` as new options instead.
For the `addDefaultProtocolToUrl` functionality in particular, the existing helper function was not only moved but slightly improved as well. Looking at the code, I realized that there's a small risk that it would incorrectly match a *relative* URL-string too.
With these changes, the `createValidAbsoluteUrl` call-sites in the `src/core/`-code can be simplified a little bit.
*Please note:* This patch may, indirectly, change the format of the `unsafeUrl`-property returned with relevant Annotations and OutlineItems; hence the `api-minor` tag.
However, I'd argue that it's actually more correct this way since the whole purpose of `unsafeUrl` is/was to return the URL data as-is without any parsing done.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716758;
- some buttons have a JS action with the pattern `app.launchURL(...)` (or similar) so extract when it's possible the url and generate a <a> element with the href equals to the found url;
- pdf.js already had some code to handle that so this patch slightly refactor that.
- it aims to fix#14021;
- the N dict is empty here so just create a default one;
- it implies that the checked checkbox has no appearance so create a default one too in order to print it;
- in the pdf in the issue, a checked box is not printed because it has no default appearance so we need to guess its appearance from its state.
In order to implement this, we utilize the existing `bidi` function to infer the text-direction of /T and /Contents entries. While this may not be perfect in cases where one PopupAnnotation mixes LTR and RTL languages, it should work well enough in most cases.
To avoid having to add *two new* properties in lots of annotations, supplementing the existing `title`/`contents`-properties, this patch instead re-factors the existing code such that the properties are replaced by Objects (containing `str` and `dir`).
*Please note:* In order avoid breaking existing third-party implementations, `GENERIC`-builds of the PDF.js library will still provide the old `title`/`contents`-properties on annotations returned by `PDFPageProxy.getAnnotations`.
- In the pdf in issue #14071, some select fields don't contain any values;
- the corresponding node has a bindItems and a bind elements and _bindItems function was just not called.
In particular the `_processStreamMessage`-method is a bit cumbersome to read, given the way that the current streamController/streamSink is accessed, which we can improve with a couple of local variables.
After PR 11601, the `paintJpegXObject` operator is no longer used for anything. While I don't think we can just remove it, and essentially leave a "hole" in the `OPS` structure, we should at least mark it as explicitly unused to aid readability/maintainability of the code.
This replaces direct `document.getElementsByName` lookups with a helper method which:
- Lets the AnnotationLayer use the data returned by the `PDFDocumentProxy.getFieldObjects` API-method, such that we can directly lookup only the necessary DOM elements.
- Fallback to using `document.getElementsByName` as before, such that e.g. the standalone viewer components still work.
Finally, to fix the problems reported in issue 14003, regardless of the code-path we now also enforce that the DOM elements found were actually created by the AnnotationLayer code.
With these changes we'll thus be able to update form elements on all visible pages just as before, but we'll additionally update the AnnotationStorage for not-yet-rendered elements thus fixing a pre-existing bug.
*This is similar to the "isSymbolicFont"-property, which is no longer exported by default after PR 11777.*
Both "isMonospace" and "isSerifFont" are internal properties, used during font parsing and building of the glyph mapping on the worker-thread.
However both of these properties are completely unused on the main-thread and/or in the API, and accessing them they will now require setting the `fontExtraProperties`-option when calling `getDocument`.
With this patch we'll ensure that only valid absolute URLs can be used in XFA documents, similar to the existing validation done for "regular" PDF documents.
Furthermore, we'll also attempt to add a default protocol (i.e. `http`) to URLs beginning with "www." in XFA documents as well; this on its own is enough to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1731240
Rather than re-computing this value in a number of different places throughout the code-base[1], we can expose this in the API via the existing `PixelsPerInch`-structure instead.
There's also been feature requests asking for the old `CSS_UNITS` viewer constant to be made accessible, such that it could be used in third-party implementations.
I suppose that it could be argued that it's somewhat confusing to place a unitless property in `PixelsPerInch`, however given that the `PDF_TO_CSS_UNITS`-property is defined strictly in terms of the existing properties this is hopefully deemed reasonable.
---
[1] These include:
- The viewer, with the `CSS_UNITS` name.
- The reference-tests.
- The display-layer, when rendering images; see PR 13991.
Currently we only exclude /Encoding entries that also contains a /Differences array, which is the cause of the text-selection problem in the referenced issue.
In order to address this we'll now also exclude /Encoding entries that contain one of the predefined *named* encodings, and no longer require that it also contains a /Differences array.
*Please note:* This patch cases a small "regression" in the `bug1130815-text` test-case, however this is actually an improvement when compared with Adobe Reader and PDFium (in Google Chrome).
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1719148;
- JS can set a property for a non-rendered annotation using the annotationStorage but the other unset default properties must be used when the annotation is finally rendered;
- so this patch just adds the properties already set in the annotationStorage to the default value.
This reduces some unnecessary duplication, since we currently have essentially the same code in a handful of places in the `Font.fallbackToSystemFont`-method.
In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings.
*Please note:* This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1722036.
- AS and V should share the same value for checkbox: it's at least what the specs say;
- the pdf in the above bug opens correctly in Acrobat so it likely means that AS is chosen over V.
- it aims to fix#14024.
- this patch adds an attribute `acroformExportValue` to the HTML input in order to set the checked attribute in taking into account the exportValue for the checkboxes with the same name.
*Please note:* All of this feels very handwavy, but at least it passes all tests locally. Hopefully we have enough tests for this part of the font code.
For non-embedded composite standard fonts with an "incomplete" /CIDToGIDMap, we'll now fallback to an *explicitly defined* /ToUnicode map even when that one happens to be an /Identity-H or /Identity-V map.
The `Font.fallbackToSystemFont` method is unfortunately getting more and more special-cases, however that might be unavoidable given all the weird non-embedded fonts found in the wild :-(
While these types apparently makes sense in TypeScript environments, we really don't want to extend the *public* API by simply exporting the relevant classes directly in `src/pdf.js` (since they should never be called/initialized manually).
Please see e.g. issue 12384 where this was first requested, and note that a possible work-around was also provided there. This patch simply implements that work-around[1], which will hopefully be helpful to TypeScript users.
---
[1] Based on the discussion in PR 13957, the two previous patches appear to be necessary for this to actually work.
*This fixes something that I noticed, having recently looked at both the `Lexer.getObj` and `writeValue` code.*
Please note that I unfortunately don't have an example of a form where saving fails without this patch. However, given its overall simplicity and that unit-tests are added, it's hopefully deemed useful to fix this potential issue pro-actively rather than waiting for a bug report.
At this point one might, and rightly so, wonder if there's actually any real-world PDF documents where a `null` value is being used?
Unfortunately the answer is *yes*, and we have a couple of examples in the test-suite (although none of those are related to forms); please see: `issue1015`, `issue2642`, `issue10402`, `issue12823`, `issue13823`, and `pr12564`.
While some of the output looks worse to my eye, this behavior more
closely matches what I see when I open the PDFs in Adobe acrobat.
Fixes: #4706, #9713, #8245, #1344
The `MessageHandler`-implementation already handles either of these callbacks being undefined, hence there's no particular reason (as far as I can tell) to add no-op functions here.
Also, in a couple of `MessageHandler`-methods, utilize an already existing local variable more.
Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building.
While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used.
That seems *extremely* unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2]
Given that this patch adds a *minimum* batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a *simple* stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.
---
[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large.
[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257.
Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here.
[3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.
While these changes will obviously not have a significant effect on overall memory usage, it cannot hurt as far as I'm concerned. This patch makes the following changes:
- Clear out `_textDivProperties` once rendering is done, since those properties are only necessary to keep alive when *enhanced* text-selection is being used.
- Reduce the size of the `_textDivProperties`-entries by default, since a majority of the properties are only relevant when *enhanced* text-selection is being used.
In the referenced PDF document the /Contents stream contains MarkedContent-operators, however no optional content dictionary exists; according to [the specification](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.3883825):
> Null values or references to deleted objects shall be ignored. If this entry is
not present, is an empty array, or contains references only to null or deleted
objects, the membership dictionary shall have no effect on the visibility of
any content.
In the PDF document some of the glyphs have bogus `differences`-entries[1] that cannot be resolved to valid glyph names, thus causing the glyph mapping to fail.
My initial idea was to use a similar approach as in the `PartialEvaluator._simpleFontToUnicode`-method, to extract the charCodes from those entries, however it turned out that that didn't actually help in this case (the mapping was still wrong).
To fix this I'm thus proposing that we fallback to the /ToUnicode map when no other useable data exists (e.g. no post-table), since it *hopefully* shouldn't make things any worse than leaving parts of the glyph map empty (which currently happens).
---
[1] As can be seem below, some of the entries are completely normal while others are non-standard:
```
Differences (array)
0 = 65
1 = /g5167
2 = /space
3 = /g11927
4 = /g17737
5 = /g11540
6 = /g2180
7 = /K
8 = /P
9 = /two
10 = /zero
11 = /one
12 = /five
13 = /four
14 = /g6932
15 = /g7246
16 = /g1691
17 = /g2343
18 = /g14792
19 = /g3325
20 = /g4280
21 = /g20383
22 = /g18166
23 = /g16988
24 = /g17943
25 = /g19223
26 = /g10830
27 = 97
28 = /g982
29 = /g1226
30 = /g5059
31 = /g2677
32 = /g1042
33 = /g11568
34 = /L
35 = /three
36 = /seven
37 = /g2364
38 = /g12063
39 = /g5356
40 = /g2173
41 = /g17877
42 = /g7273
43 = /g7647
44 = /g7224
45 = /g19327
46 = /g5054
47 = /g2342
48 = /g10136
49 = /g6856
50 = /g13381
51 = /g7257
52 = /g12093
53 = /g2359
```
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1720179
- in some pdfs the XFA array in AcroForm dictionary doesn't contain an entry for 'datasets' (which contains saved data), so basically this patch allows to overwrite the AcroForm dictionary with an updated XFA array when doing an incremental update.
- when some named nodes in the template don't have their counterpart in datasets we create some nodes: the main node mustn't belong to the datasets namespace because it doesn't make sense and Acrobat Reader isn't able to read pdf with such nodes.
- so created nodes under a datasets node have a namespaceId set to -1 and consequently when serialized no namespace prefix will appear.
There's a fair number of regular expressions througout the code-base which are slightly more verbose than strictly necessary, in particular:
- We have a lot of regular expressions that use `[0-9]` explicitly, and those can be simplified to use `\d` instead.
- We have one instance of a regular expression containing a `A-Za-z0-9_` sequence, which can be simplified to use `\w` instead.
Despite its name, the fonts in ItcSymbol-family are "regular" fonts and not Symbol ones. However, given that the font name contains the word "Symbol" we ended up picking the wrong code-path in the `Font.fallbackToSystemFont`-method.
*Please note:* While this patch ensures that the text becomes readable, by falling back a standard font, the rendering will obviously not be perfect. However, that's the PDF generators "fault" since non-embedded fonts cannot be guaranteed to render correctly in all environments.
Given that the Fetch API is normally being used now, these changes are probably less important now than they used to be. However, given that it's simple enough to implement this I figured why not just fix issue 9883 (better late than never I suppose).
Testing:
- delete the pdf file while the initial request is inflight
- delete the pdf file after the initial request has finished
Repeat for a small file and large file, exercising both one-off and
chunked transports.
This patch changes the `PDFDocumentLoadingTask.destroy`-method and the `_fetchDocument`-function to be `async`, which slightly simplifies the relevant code.
Furthermore, remove the catch-handler from the `WorkerTransport.getPageIndex`-method since it's no longer needed. Given that the `MessageHandler` is nowadays wrapping every possible Exception, it's no longer necessary to try and re-wrap the reason here.
While running the unit-tests with some logging statements added to this code, I noticed that `PasswordException` was missing from the list of potential Errors that could be passed to the `wrapReason` function.
Something that I *just* realized is that while PR 13854 fixed an issue as reported, it could still cause bugs in other similarily broken documents since we'll not insert a matching endMarkedContent-operator in the operatorList.
Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.
*This is done separately from the previous patch, to make it easier to revert these changes once they've been included in a couple of releases.*
Please note that because these two options are mutually exclusive, which is a large part of the reason for the previous patch, it's not guaranteed that the fallback-values will always be correct in every situation (but it's the best that we can do).
*This is a follow-up to PRs 13867 and 13899.*
This patch is tagged `api-minor` for the following reasons:
- It replaces the `renderInteractiveForms`/`includeAnnotationStorage`-options, in the `PDFPageProxy.render`-method, with the single `annotationMode`-option that controls which annotations are being rendered and how. Note that the old options were mutually exclusive, and setting both to `true` would result in undefined behaviour.
- For improved consistency in the API, the `annotationMode`-option will also work together with the `PDFPageProxy.getOperatorList`-method.
- It's now also possible to disable *all* annotation rendering in both the API and the Viewer, since the other changes meant that this could now be supported with a single added line on the worker-thread[1]; fixes 7282.
---
[1] Please note that in order to simplify the overall implementation, we'll purposely only support disabling of *all* annotations and that the option is being shared between the API and the Viewer. For any more "specialized" use-cases, where e.g. only some annotation-types are being rendered and/or the API and Viewer render different sets of annotations, that'll have to be handled in third-party implementations/forks of the PDF.js code-base.
Moves the logic out of TextLayerBuilder to handle
highlighting matches into a new separate class `TextHighlighter`
that can be used with regular PDFs and XFA PDFs.
To mimic the current find functionality in XFA, two arrays
from the XFA rendering are created to get the text content
and map those to DOM nodes.
Fixes#13878
This way there cannot be any *incorrect* cache hits, since Refs are guaranteed to be unique.
Please note that the reason for caching by Ref rather than doing something along the lines of the `localShadingPatternCache` (which uses a `Map` directly), is that TilingPatterns are streams and those cannot be cached on the `XRef`-instance (this way we avoid unnecessary parsing).
*This patch is very similar to the recently fixed `renderInteractiveForms`-options, see PR 13867.*
As far as I can tell, this *subtle* bug has existed ever since `AnnotationStorage`-support was first added in PR 12106 (a little over a year ago).
The value of the `includeAnnotationStorage`-option, as passed to the `PDFPageProxy.render` method, will (potentially) affect the size/content of the operatorList that's returned from the worker (for documents with forms).
Given that operatorLists will generally, unless they contain huge images, be cached in the API, repeated `PDFPageProxy.render` calls where the form-data has been changed by the user in between, can thus *wrongly* return a cached operatorList.
In the viewer we're only using the `includeAnnotationStorage`-option when printing, which is probably why this has gone unnoticed for so long. Note that we, for performance reasons, don't cache printing-operatorLists in the API.
However, there's nothing stopping an API-user from using the `includeAnnotationStorage`-option during "normal" rendering, which could thus result in *subtle* (and difficult to understand) rendering bugs.
In order to handle this, we need to know if the `AnnotationStorage`-instance has been updated since the last `PDFPageProxy.render` call. The most "correct" solution would obviously be to create a hash of the `AnnotationStorage` contents, however that would require adding a bunch of code, complexity, and runtime overhead.
Given that operatorList caching in the API doesn't have to be perfect[1], but only have to avoid *false* cache-hits, we can simplify things significantly be only keeping track of the last time that the `AnnotationStorage`-data was modified.
*Please note:* While working on this patch, I also noticed that the `renderInteractiveForms`- and `includeAnnotationStorage`-options in the `PDFPageProxy.render` method are mutually exclusive.[2]
Given that the various Annotation-related options in `PDFPageProxy.render` have been added at different times, this has unfortunately led to the current "messy" situation.[3]
---
[1] Note how we're already not caching operatorLists for pages with *huge* images, in order to save memory, hence there's no guarantee that operatorLists will always be cached.
[2] Setting both to `true` will result in undefined behaviour, since trying to insert `AnnotationStorage`-values into fields that are being excluded from the operatorList-building will obviously not work, which isn't at all clear from the documentation.
[3] My intention is to try and fix this in a follow-up PR, and I've got a WIP patch locally, however it will result in a number of API-observable changes.
At this point in time, all of the supported browsers (in the PDF.js project) have native `ReadableStream` implementations; see https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#browser_compatibility
Hence the polyfill is *only* necessary in Node.js environments now, and we shouldn't need to do any detailed feature detection either (since that was only done for the non-Chromium versions of the MS Edge browser).
Finally, we can slightly reduce the size of the Chromium-extension since the polyfill shouldn't be needed there either.
By not adding any additional non-`Dict` entries to the list of candidates for merging of sub-dictionaries, we can very slightly reduce the amount of parsing required by not having to *again* iterate through unmergeable data.
Once we're finally able to get rid of SystemJS, which is unfortunately still blocked on [bug 1247687](https://bugzilla.mozilla.org/show_bug.cgi?id=1247687), we might also want to clean-up (or even completely remove) the `BaseException` abstraction and simply extend `Error` directly instead.
At that point we'd need to (explicitly) set the `name` on each class anyway, so this patch is essentially preparing for future clean-up. Furthermore, after the `BaseException` abstraction was added there's been *multiple* issues filed about third-party minification breaking our code since `this.constructor.name` is not guaranteed to always do what you intended.
While hard-coding the strings indeed feels quite unfortunate, it's likely the "best" solution to avoid the problem described above.
Rather than caching only the *last* `PDFPageProxy.getAnnotations` call, and having to handle the intent separately, we can instead implement the caching in exactly the same way as done in the `PDFPageProxy.{render, getOperatorList}` methods.
This patch removes the only remaining closure in the `src/display/api.js` file, utilizing a similar approach as used in lots of other parts of the code-base, which results in a small decrease in the size of the *build* `pdf.js` file.
Given that `PDFWorker` is exposed through the *public* API, this complicates things somewhat since there's a couple of worker-related properties that really should stay *private*. Initially, while working on PR 13813, I believed that we'd need support for private (static) class fields in order to get rid of this closure, however I've managed to come up with what's hopefully deemed an acceptable work-around here.
Furthermore, some helper functions were simply moved into the `PDFWorker` class as static methods, thus simplifying the overall implementation (e.g. we don't need to manually cache the Promise in the `PDFWorker._setupFakeWorkerGlobal`-method).
Finally, as part of this re-factoring a number of missing JSDoc-comments were added which *together* with the removal of the closure significantly improves the `gulp jsdoc` output for the `PDFWorker` class.
*Please note:* This patch is tagged with `api-minor` since it deprecates `PDFWorker.getWorkerSrc()` in favor of the shorter `PDFWorker.workerSrc`, with the fallback limited to `GENERIC` builds.
The value of the `renderInteractiveForms` parameter, as passed to the `PDFPageProxy.render` method, will (potentially) affect the size/content of the operatorList that's returned from the worker (for documents with forms).
Given that operatorLists will generally, unless they contain huge images, be cached in the API, repeated `PDFPageProxy.render` calls that *only* change the `renderInteractiveForms` parameter can thus return an incorrect operatorList.
As far as I can tell, this *subtle* bug has existed ever since `renderInteractiveForms`-support was first added in PR 7633 (which is almost five years ago).
With the previous patch, fixing this is now really simple by "encoding" the `renderInteractiveForms` parameter in the *internal* renderingIntent handling.
With the changes made in PR 13746 the *internal* renderingIntent handling became somewhat "messy", since we're now having to do string-matching in various spots in order to handle the "oplist"-intent correctly.
Hence this patch, which implements the idea from PR 13746 to convert the `intent`-strings, used in various API-methods, into an *internal* renderingIntent that's implemented using a bit-field instead. *Please note:* This part of the patch, in itself, does *not* change the public API (but see below).
This patch is tagged `api-minor` for the following reasons:
1. It changes the *default* value for the `intent` parameter, in the `PDFPageProxy.getAnnotations` method, to "display" in order to be consistent across the API.
2. In order to get *all* annotations, with the `PDFPageProxy.getAnnotations` method, you now need to explicitly set "any" as the `intent` parameter.
3. The `PDFPageProxy.getOperatorList` method will now also support the new "any" intent, to allow accessing the operatorList of all annotations (limited to those types that have one).
4. Finally, for consistency across the API, the `PDFPageProxy.render` method also support the new "any" intent (although I'm not sure how useful that'll be).
Points 1 and 2 above are the significant, and thus breaking, changes in *default* behaviour here. However, unfortunately I cannot see a good way to improve the overall API while also keeping `PDFPageProxy.getAnnotations` unchanged.
Given how trivial the `isEOF` function is, we can simply inline the check at the various call-sites and remove the function (which ought to be ever so slightly more efficient as well).
Furthermore, this patch also changes the `EOF` primitive itself to a `Symbol` instead of an Object since that has the nice benefit of making it unclonable (thus preventing *accidentally* trying to send `EOF` from the worker-thread).
Given that the GitHub Advanced Security workflow now covers everything that LGTM does, but generally faster and with better GitHub-integration, there's no longer much point in also running LGTM separately.
As a follow-up to this patch, we should also disable/remove the LGTM-integration from the PDF.js repository.
While fixing issue 13794, I noticed that cancelling the `ReadableStream` returned by the `PDFPageProxy.streamTextContent`-method could lead to "Uncaught promise" messages in the console.[1]
Generally speaking, we don't really care about errors when *cancelling* a `ReadableStream` and it thus seems reasonable to simply suppress any output in those cases.
---
[1] Although, after that issue was fixed you'd now need to set the API-option `stopAtErrors = true` to actually trigger this.
This patch utilizes the same approach as used in lots of other parts of the code-base, which thus *slightly* reduces the size of this code.
By removing some of the (current) indirection, we can also simplify the JSDocs a little bit. Looking at the `gulp jsdoc` output, this actually seem to *improve* the documentation for this class.
The PDF in bug 1721949 uses many unique pattern objects
that references the same shading many times. This caused
a new canvas pattern to be created and cached many times
driving up memory use.
To fix, I've changed the cache in the worker to key off the
shading object and instead send the shading and matrix
separately. While that worked well to fix the above bug,
there could be PDFs that use many shading that could
cause memory issues, so I've also added a LRU cache
on the main thread for canvas patterns. This should prevent
memory use from getting too high.
- All the scale factors in for the substitution font were wrong because of different glyph positions between Liberation and the other ones:
- regenerate all the factors
- Text may have polish chars for example and in this case the glyph widths were wrong:
- treat substitution font as a composite one
- add a map glyphIndex to unicode for Liberation in order to generate width array for cid font
- In order to better compute text fields size, use line height with no gaps (and consequently guessed height for text are slightly better in general).
- Fix default background color in fields.
For code that's part of the core library, rather than e.g. the `web/`-folder, we should always be careful about *directly* accessing any DOM methods.
The `navigator` is one such structure, which shouldn't be assumed to always be available and we should thus check that it's actually present.[1]
Hence this patch re-factors the `navigator.platform` access, in the `AnnotationLayer`-code, to ensure that it's generally safe. Furthermore, to reduce unnecessary repeated string-matching to determine the current platform, we're now using a shadowed getter which is evaluated only once instead (at first access).
---
[1] Note e.g. the `isSyncFontLoadingSupported` getter, in the `src/display/font_loader.js` file.
- an Image element was created, attached to its parent but the $globalData property was not set and that led to an error.
- the pdf in bug 1722029 has 27 rendered rows (checked in Acrobat) when only one was displayed: this patch some binding issues around the occur element.
This patch makes use of the existing `ignoreErrors` option, thus allowing a page to continue parsing/rendering even if (some of) its sub-streams are corrupt. Obviously this may cause *part* of a page to be broken/missing, however it should be better than (potentially) rendering nothing.
Also, to the best of my knowledge, this is the first bug of its kind that we've encountered.
To avoid having to pass in a bunch of, for a `BaseStream`-instance, mostly unrelated parameters when initializing a `StreamsSequenceStream`-instance, I settled on utilizing a callback function instead to allow conditional Error-suppression.
Note that the `StreamsSequenceStream`-class is a *special* stream-implementation that we only use when the `/Contents`-entry, in the `/Page`-dictionary, consists of an Array with streams.
There's no good reason, as far as I can tell, to explicitly define a bunch of methods to be `undefined`, which the current unconditional "copying" of methods will do.
Note that of the `OPS` ~23 percent don't, for various reasons, have an associated method on the `CanvasGraphics.prototype`.
For e.g. the `gulp mozcentral` command, the *built* `pdf.js` file decreases from `304 607` to `301 295` bytes with this patch. The improvement comes mostly from having less overall indentation in the code.
Apparently I didn't put one of the disable comments on the *correct* line, since I didn't read the instructions carefully enough, so let's try again.
Note that, most unfortunately, disabling of warnings isn't applied until *after* a patch has been merged.
Most of the warnings we don't really care about, and those are simply white-listed using inline comments; however two cases prompted actual code changes:
- In `src/display/pattern_helper.js` the branch in question is indeed unreachable, and should thus be safe to remove. (This code originated in PR 4192, which is now over seven years ago.)
- In `test/test.js`, the function in question indeed doesn't accept any arguments. (The patch also re-formats a string just above, which didn't seem worthy of a separated patch.)
This now leaves only *one* warning in the LGTM report, however that one is a false positive that we'll need to report upstream.
In cases where even the very *first* attempt at reading from an object will throw, simply ignoring such objects will help improve rendering of *some* corrupt documents.
Note that this will lead to more parsing in some cases, but considering that this only applies to *corrupt* documents that shouldn't be a big deal.
Bug 1721218 has a shading pattern that was used thousands of times.
To improve performance of this PDF:
- add a cache for patterns in the evaluator and only send the IR form once
to the main thread (this also makes caching in canvas easier)
- cache the created canvas radial/axial patterns
- for shading fill radial/axial use the pattern directly instead of creating temporary
canvas
Currently the `!mergeSubDicts` code-path is essentially just duplicated code, which we can easily avoid by simply moving that check. (This may lead to ever so slightly more parsing for this case, but the difference ought to be negligible in practice.)
- in real life some xfa contains xml like <xfa:data><xfa:Foo><xfa:Bar>...</xfa:data>
since there are no Foo or Bar in the xfa namespace the JS representation are empty
and that leads to errors.
- so the idea is to make all nodes under xfa:data namespace agnostic which means
that ns are removed from nodes in the parser but only xfa:data descendants.
*Please note:* The PDF document in issue 13751 is *dynamically* created (in e.g. Adobe Reader), with pages added when certain buttons are clicked, hence this patch simply fixes the breaking error and nothing more.
It looks like the current code contains a little bit too much copy-and-paste from the *similar* `index` branch above, since we cannot set the `startIndex` to a negative value. Note how it's being used to initialize the loop-variable, which is then used to lookup values in an Array and accessing the `-1`th element of an Array obviously makes no sense.
*This is yet another case where I've got no idea if the patch is correct, but it does at least fix a breaking error :-)*
Note how in the [`Binder._bindValue` method](683ce66a48/src/core/xfa/bind.js (L92-L93)), we're assuming that if a `data`-value exists then it'll also be possible to actually access it. For the `XFAAttribute`-implementation however, the second method is missing and that's what causes the breaking errors in issue 13748.
Please note that another possible way of "fixing" the error wouldn't been to simply change the exists-check to return `false`, and I could see that being a preferred solution.
However, the reason for submitting the current patch is that we get *fewer* warnings about Nodes with mis-matched types this way.
As can be seen in the code (see below), the `searchNode` helper function will return `null` in some cases and all of its call-sites should protect against that before attempting to access the returned data.
While only one of these changes were necessary to fix the breaking errors in issue 13756, in order to prevent future bugs I've added similar defensive code throughout this file.
- 07955fa1d3/src/core/xfa/som.js (L169)
- 07955fa1d3/src/core/xfa/som.js (L239)
- 07955fa1d3/src/core/xfa/som.js (L254)
For e.g. `gulp mozcentral`, the *built* `pdf.worker.js` file decreases from `1 837 608` to `1 834 907` bytes with this patch-series.
The improvement comes first of all from less overall indentation in `PDFFunction`, and secondly from the removal of (now) unnecessary indirection in the code.
*This follows the exact same princial as PR 12083, but for the `PDFFunction` parsing instead.*
Given that the IR format is completely unused now, all that the current code does is add a bunch of unnecessary indirection/overhead to the handling of PDF-functions.
With this patch, the `PDFPageProxy.getOperatorList` method will now return `PDFOperatorList`-instances that also include Annotation-operatorLists (when those exist). Hence this closes a small, but potentially confusing, gap between the `render` and `getOperatorList` methods.
Previously we've been somewhat reluctant to do this, as explained below, but given that there's actual use-cases where it's required probably means that we'll *have* to implement it now.
Since we still need the ability to separate "normal" rendering operations from direct `getOperatorList` calls in the worker-thread, this API-change unfortunately causes the *internal* renderingIntent to become a bit "messy" which is indeed unfortunate (note the `"oplist-"` strings in various spots). As-is I suppose that it's not all that bad, but we may want to consider changing the *internal* renderingIntent to e.g. a bitfield in the future.
Besides fixing issue 13704, this patch would also be necessary if someone ever tries to implement e.g. issue 10165 (since currently `PDFPageProxy.getOperatorList` doesn't include Annotation-operatorLists).
*Please note:* This patch is *also* tagged "api-minor" for a second reason, which is that we're now including the Annotation-id in the `beginAnnotation` argument. The reason for this is to allow correlating the Annotation-data returned by `PDFPageProxy.getAnnotations`, with its corresponding operatorList-data (for those Annotations that have it).
When working on PR 13612, I mostly prioritized a simple solution that didn't require touching a lot of code. However, while working on PR 13735 I started to realize that the static `Name.empty` construction really wasn't a good idea.
In particular, having a special `Name`-instance where the `name`-property isn't actually a String is confusing (to put it mildly) and can easily lead to issues elsewhere. The only reason for not simply allowing the `name`-property to be an *empty* string, in PR 13612, was to avoid having to touch a lot of existing code. However, it turns out that this is only limited to a few methods in the `PartialEvaluator` and a few of the `BaseLocalCache`-implementations, all of which can be easily re-factored to handle *empty* `Name`-instances.
All-in-all, I think that this patch is even an *overall* improvement since we're now validating (what should always be) `Name`-data better in the `PartialEvaluator`.
This is what I ought to have done from the start, sorry about the code churn here!
*Please note:* This patch doesn't fix rendering of (various) patterns in browsers/environments without full `CanvasPattern.setTransform()` support, but it at least prevents outright failures and thus allows the rest of the page to render.
This patch provides a temporary work-around for Firefox 78 ESR[1], and for Node.js environments (see issue 13724), where rendering is currently completely broken.
---
[1] Please note that the `createMatrix` helper function doesn't actually work as intended. The reason is that it's not `DOMMatrix` itself which is unsupported in older Firefox versions, but rather calling `CanvasPattern.setTransform(...)` with a `DOMMatrix`-argument.
Furthermore, the `createSVGMatrix` fallback won't actually help either since that method doesn't accept any parameters and would thus require *manually* specifying the matrix-state; see e.g. https://developer.mozilla.org/en-US/docs/Web/API/CanvasPattern/setTransform#examples
Finally, given that it's less than a month to the [Firefox 91 ESR release](https://wiki.mozilla.org/RapidRelease/Calendar) and that as-is all patterns are completely broken e.g. when using the latest viewer in Firefox 78 ESR, I'm just not convinced that it's worth the "hassle" of providing a more proper work-around.
This way we ensure that these BBox values are *always* defined as expected for every `case`-block, and we also don't need to duplicate the lookup in multiple places. (Also, the patch removes a couple of unnecessary line-breaks in existing comments.)
Fixes https://github.com/mozilla/pdf.js/pull/13691#pullrequestreview-702356627, which was flagged by LGTM.
- font line height is taken into account by acrobat when it isn't with masterpdfeditor: I extracted a font from a pdf, modified some ascent/descent properties thanks to ttx and the reinjected the font in the pdf: only Acrobat is taken it into account. So in this patch, line heights for some substituted fonts are added.
- it seems that Acrobat is using a line height of 1.2 when the line height in the font is not enough (it's the only way I found to fix correctly bug 1718741).
- don't use flex in wrapper container (which was causing an horizontal overflow in the above bug).
- consequently, the above fixes introduced a lot of small regressions, so in order to see real improvements on reftests, I fixed the regressions in this patch:
- replace margin by padding in some case where padding is a part of a container dimensions;
- remove some flex display: some containers are wrongly sized when rendered;
- set letter-spacing to 0.01px: it helps to be sure that text is not broken because of not enough width in Firefox.
With the changes in PR 13687 we're now checking if `target` is defined *twice* in a row, which shouldn't be necessary :-)
(I noticed this when glancing at the unofficial LGTM results; maybe we should re-evalute the decision to not integrate that into the CI.)
When XFA support was added, the size of the *built* `pdf.worker.js` file increased quite a bit. Hence I think that it makes sense to, where easily possible, do what we can to (slightly) reduce the size of the PDF.js library.
The supplemental font data files (added for XFA rendering), containing rescale-factors respectively widths, seem like an excellent candidate here since they're not particularly large in either line-count or file sizes.
In this patch these files are instead merged into a *single* file per font, rather than four different ones, and even with these changes the resulting source files don't become all that large.[1]
For e.g. the `gulp mozcentral` build, this reduces the size of the *built* `pdf.worker.js` file by more than `3 kB`. Given the overall simplicity of the patch, that kind of size decrease definitely seem worthwhile to me.
---
[1] Especially when compared to truly large files such as e.g. `glyphlist.js`, `metrics.js`, and `unicode.js`.
Previously, when we filled image masks we didn't copy over the current transformation,
this caused patterns to be misaligned when painted. Now we create a temporary
canvas with the mask and have the transform copied over and offset it relative to
where the mask would be painted. We also weren't properly offsetting tiling patterns.
This isn't usually noticeable since patters repeat, but in the case of #13561 the pattern
is only drawn once and has to be in the correct position to line up with the mask image.
These fixes broke #11473, but highlighted that we were drawing that correctly by
accident and not correctly handling negative bounding boxes on tiling patterns.
Fixes#6297, #13561, #13441
Partially fixes#1344 (still blurry but boxes are in correct position now)
- Fix a typo in order to open the pdf in issue #13679
- After fixing the fill default color there wer some regressions because of z-index
and when fixing z-index there were some regressions because of borders
- So fix the borders rendering.
The PDF.js API has only ever supported accessing the original file ID, however the second one that (should) exist in *modified* documents have thus far been completely inaccessible through the API.
That seems like a simple oversight, caused e.g. by the viewer not needing it, since it really shouldn't hurt to provide API-users with the ability to check if a PDF document has been modified since its creation.[1]
Please refer to https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G13.2261661 for additional information.
For an example of how to update existing code to use the new API, please see the changes in the `web/app.js` file included in this patch.
*Please note:* While I'm not sure if we'll ever be able to remove the old `PDFDocumentProxy.fingerprint` getter, given that it's existed since "forever", that probably isn't a big deal given that it's now limited to only `GENERIC`-builds.
---
[1] Although this obviously depends on the PDF software following the specification, by updating the second file ID as intended.
Please refer to https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm
Based on that information, and manually testing our code, the implementation in `cloneValue` has the following shortcomings:
- Attempting to clone `function`s is only prevented when they're part of an Object, but is currently allowed when they occur standalone.
- Cloning of `Symbol`s is currently not prevented, which it should be since the native structured clone algorithm throws.
- Any disallowed types should be checked first, to reduce the risk of future changes accidentally allowing something that shouldn't be supported.
Using `instanceof Object` is generally problematic, since it's not guaranteed to always do the right thing for all Objects.
(I stumbled upon this while working on another patch, when I noticed that the `outlineView` was broken with workers disabled.)
- support paragraph margins, line height, letter spacing, ...
- compute missing dimensions from fields based almost on the dimensions of caption contents.
- it aims to fix#13583;
- fix the switch to breakBefore target;
- force the layout of an unsplittable element on an empty page;
- don't fail when there is horizontal overflow (except in lr-tb);
- handle correctly overflow in the same content area (bug 1717805, bug 1717668);
- fix a typo in radial gradient first argument.
- some pdf use some fonts which are not embedded or they don't have any width array or don't have any css info (e.g. for standard fonts or Arial).
- so add widths arrays for Liberation fonts in order to compute the ones for other fonts in using scale factors array.
- the break element has been deprecated in XFA 2.4 but some old documents can use it, so replace it with one (or more) of its possible substitutions:
- breakBefore;
- breakAfter;
- overflow.
- when binding (after parsing) we get a map between some template nodes and some data nodes;
- so set user data in input handlers in using data node uids in the annotation storage;
- to save the form, just put the value we have in the storage in the correct data nodes, serialize the xml as a string and then write the string at the end of the pdf using src/core/writer.js;
- fix few bugs around data bindings:
- the "Off" issue in Bug 1716980.
According to a comment in `readCmapTable`, we're assuming that the cmap tables (when more than one exist) are sorted in ascending order. If that's not the case, keep checking the following cmap tables in order to fix the referenced issue.
- when the CSS line-height property is set to 'normal' then the value depends of the user agent. So use a line height based on the font itself and if for any reasons this value is not available use 1.2 as default.
- it's a partial fix for https://bugzilla.mozilla.org/show_bug.cgi?id=1717681.
This patch provides an overall simpler *and* more consistent way of handling the `viewport` parameter during printing of XFA forms, since it's now again guaranteed to always be an instance of `PageViewport`.
Furthermore, for anyone attempting to e.g. implement custom printing of XFA forms this probably cannot hurt either.
Apparently some really bad PDF software can create documents with *empty* `Name`-entries, which we thus need to somehow deal with.
While I don't know if this patch is necessarily the best solution, it should at least ensure that the *empty* `Name`-instance cannot accidentally match a proper `Name`-instance (and it doesn't require changes to a lot of existing code).[1]
---
[1] I briefly considered using a `Symbol` rather than an Object, but quickly decided against that since the former one [is not clonable](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types) and `Name`-instances may be sent to the API.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716838.
- some fonts in the pdf in the bug where bold when they shouldn't so write the font properties in the html to avoid to use some wrong inherited ones.
- partial fix for https://bugzilla.mozilla.org/show_bug.cgi?id=1716980;
- some pdf can contain an invalid font family (e.g. 'Windings 3') so in this case remove the space;
- the font family in typeface attribute doesn't always match the one defined in the FontDescriptor dictionary.
Given that we're not imposing any font-type restrictions[1] in the non-/FontDescriptor case, it's not really clear to me why we'd actually need to do that in the general case.
Please note that there's some *expected* movement, all of which should be improvements, in the `fips197.pdf` file with this patch.
---
[1] With the exception of Type3-fonts, of course.
*Sorry about the churn here, since the change that I made in PR 13516 was not very smart.*
With the current code, it's now *impossible* for a user to actually control the `useSystemFonts` option manually. To prevent outright breakage we obviously still need to default to setting `useSystemFonts = false` when `disableFontFace === true`, however that should be possible for an API consumer to override.
- PR #13554 is buggy, so this patch aims to fix bugs.
- check if a component fits into its parent in taking into account the parent layout.
- introduce method isSplittable for template nodes to know if a component can be splitted in case of overflow.
- some containers doesn't always have their 2 dimensions and those dimensions re based on contents;
- so in order to measure text, we must get the glyph widths (for the xfa fonts) before starting the layout;
- implement a word-wrap algorithm;
- handle font change during text layout.
Note, this only really fixes Radial/Axial shading patterns with masks.
I'm guessing tiling patterns and mesh patterns would also be broken
if applied like the test pdf. Hopefully I'll have some time to make
test cases for the other shadings.
Fixes#13372
- and fix few bugs:
- avoid infinite loop when layout the document;
- avoid confusion between break and layout failure;
- don't add margin width in tb layout when getting available space.
Rather than having to create and check a *separate* `ToUnicodeMap` to handle these cases, we can simply use the `fallbackToUnicode`-data (when it exists) to directly supplement *missing* /ToUnicode entires in the regular `ToUnicodeMap` instead.
Note that all standard Encodings have the same length (i.e. `256` elements) and that missing entries are always represented by empty strings, hence why a separate exists-check isn't necessary in the `baseEncoding` case.
*This is somewhat similiar to the recent changes, in PR 13277, for fonts with an /Encoding entry.*
Currently we're *completely* ignoring the `builtInEncoding`, from the font data itself, for fonts which have a built-in /ToUnicode map.
While it (obviously) doesn't seem like a good idea in general to simply overwrite existing built-in /ToUnicode entries, it should however not hurt to use the `builtInEncoding` to supplement *missing* /ToUnicode entires.
- it aims to avoid to loop forever when opening pdf in #13213;
- the idea is to consider subformSet as inexistent when running in the tree. So if we've subformA > subformSet > subformB then subformB will be visited as a direct child of subformA.
This is first of all consistent with all of the other (similar) factories, and secondly it will also simplify a future addition of a corresponding `NodeSVGFactory` (if that's ever deemed necessary).
Given that `DOMMatrix` is, unsurprisingly, not supported in Node.js the `createMatrix` helper function in `src/display/pattern_helper.js` is most likely broken in Node.js environments. It will obviously try to fallback to the `DOMSVGFactory`, however that isn't intended for Node.js usage and errors will be thrown.
Rather than trying to implement a `NodeSVGFactory`, this patch takes the easier route of just adding a `DOMMatrix` polyfill using: https://www.npmjs.com/package/dommatrix
This isn't done only for simplicity, but it'll become necessary anyway since the `createMatrix` helper function is only temporary and will be removed in the future.
- a checkbox or radio doesn't have to be rescaled when the container is large so give the extra space to the caption to avoid some word wrapping.
- when the caption is on the right, then put ui on the left as first element and so remove flex:row-reverse stuff.
We must force-fetch standard font data, when `disableFontFace = true` is set in the API, since otherwise rendering in e.g. the viewer is still broken (same as before PR 12726 landed).
*Please note:* We still need to also load standard font data for patterns and/or some text-rendering modes, however that will require larger changes so I figured that it cannot hurt to submit *this* patch right now.
*This implementation is basically a copy of the pre-existing `builtInCMapCache` implementation.*
For some, badly generated, PDF documents it's possible that we'll end up having to fetch the *same* standard font data over and over (which is obviously inefficient).
While not common, it's certainly possible that a PDF document uses *custom* font names where the actual font then references one of the standard fonts; see e.g. issue 11399 for one such example.
Note that I did suggest adding worker-thread caching of standard font data in PR 12726, however it wasn't deemed necessary at the time. Now that we have a real-world example that benefit from caching, I think that we should simply implement this now.
- Some js files contain scale factors for each glyph in order to rescale Liberation to have a final font with the correct width.
- A lot of XFA have some containers where their dimensions are based on their text content, so using default font from browser can lead to an almost unreadable pdf.
- a lot of xfa files are using Myriad pro or Arial fonts without embedding them and some containers have some dimensions based on those font metrics. So not having the exact same font leads to a wrong display.
- since it's pretty hard to find a replacement font with the exact same metrics, this patch gives the possibility to read glyf table, rescale each glyph and then write a new table.
- so once PR #12726 is merged we could rescale for example Helvetica to replace Myriad Pro.
Given that there's no fallback on the worker-thread, it shouldn't be necessary to initialize `CMapReaderFactory`/`StandardFontDataFactory` when `useWorkerFetch = true` is set.
Slightly unrelated, but this patch also ensures that the `useSystemFonts` default value only does the `isNodeJS` check in builds where that's actually necessary.
At this point in time, the `apiCompatibilityParams` is essentially unused with the sole exception of the `disableFontFace` handling for Node.js environments.
Given that `isNodeJS` is a constant now (originally it was a function), we can simply set the correct fallback value for `disableFontFace` directly in the API and clean-up the code a bit here.
This patch uses the new option added in PR 12726 to *also* allow fetching binary CMap data directly in the worker-thread in browsers.
Given that these changes remove the need to transfer data between threads for the default (browser) use-case, we can also revert the changes in PR 11118 since that simplifies the overall implementation.
Given that these factories are being used in *different* files, for Browser respectively Node.js implementations, it seems reasonable to move them into their own file instead.
- some elements weren't displayed because their rotation angle was not taken into account;
- fix box model (XFA concept):
- remove use of outline;
- position correctly border which isn't part of box dimensions;
- fix margins issues (see issue #13474).
- move border on button instead of having it on wrapping div;
Generally, in the `src/display/` folder, we utilize `DOMSVGFactory` rather than manually creating an SVG-element; hence let's do the same thing in `src/display/pattern_helper.js` as well.
While this prevents the error which is currently thrown by the `assert` in the `DOMSVGFactory.create` method, the pattern still doesn't actually render (visibly). However, in the interest of getting rid of some open issues, this patch should make (some) sense and there's already other issues about patterns in the SVG-backend,
Given that, as clearly [outlined in the FAQ](https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#backends), the SVG-backend is *not* officially supported and that there's currently no development of it; this is probably the most that is reasonable to do here.
This is necessary now, since with the previous patch the /FontBBox potentially depends on the contents of the /CharProcs-streams.
Note that if `getOperatorList` is called *before* `getTextContent`, this patch doesn't matter since the font is already fully loaded/parsed. However, for e.g. the `text` test-cases this is necessary to ensure correct reference images.
For Type3 fonts where the /CharProcs-streams of the individual glyph starts with a `d1` operator, we can use that to build a fallback bounding box for the font and thus improve text-selection in some cases.
While these objects aren't exactly that big and/or complex, they are nonetheless *only* necessary for XFA documents.
However, currently these objects are initialized *eagerly* for all PDF documents. By using the same pattern as elsewhere in the code-base, it's very easy to make these lazily initialized; so let's just do that instead :-)
- attribute 'use' was already implemented but not usehref
- in general, usehref should make reference to current document
- add support for SOM expressions in use and usehref to search a node.
- get prototype for all nodes if any.
For HighlightAnnotations with a built-in appearance stream, we still rely on it to specify the opacity correctly via a suitable blend mode. However, if the Annotation-drawing operators are placed *within* a /XObject of the /Form-type, the /ExtGState won't apply to the final rendering and the result is that the highlighting obscures the underlying text.
The more *correct* and general solution would likely be to somehow modify the implementation in `src/display/canvas.js`, to special-case handling of /Form-type /XObjects when rendering Annotations. Since we can very easily work-around this problem for now by using the "no appearance stream" code-path, doing *something* here ought to be preferable.
This patch is (obviously) merely a work-around, but given that the referenced issue is (as far as I know) the first case we've seen of this problem a simple solution will hopefully suffice for now.
This fixes the colours, by respecting the strokeAlpha/fillAlpha-values, for a couple of Annotations in the PDF document from issue 13447.[1]
---
[1] Some of the annotations still won't render at all, when compared with Adobe Reader, but that could/should probably be handled separately.
- for xfa rendering, fonts are loaded and used in html;
- when printed and saved in pdf, on linux, Firefox uses cairo backend
- when subsetting a font, cairo uses the font postscript name and when this one is empty that leads to a bug
(the append at 63f0d62684/src/cairo-cff-subset.c (L2049) is failing because of null length)
- so this patch adds a postscript name to the font to make cairo happy.
Currently `charsCache` is initialized *lazily*, which considering that it just contains a simple `Object` doesn't seem entirely necessary. This first of all forces us to do repeated exists-checks in the `Font.charsToGlyphs` method, and secondly the similar/related `glyphCache` is already initialized eagerly.
Furthermore, this patch also does a bit of clean-up in the `Font.charsToGlyphs` method since this code is quite old.
- the only goal of this patch is to be able to get synchronously the fake html when printing from firefox:
- in order to print we need to inject some html in beforeprint callback but we cannot block in waiting for all the pages.
- from a memory point of view: it doesn't change anything since the fake HTML is deleted in the worker;
- this way we don't break any assumptions.
- I thought it was possible to rely on browser layout engine to handle layout stuff but it isn't possible
- mainly because when a contentArea overflows, we must continue to layout in the next contentArea
- when no more contentArea is available then we must go to the next page...
- we must handle breakBefore and breakAfter which allows to "break" the layout to go to the next container
- Sometimes some containers don't provide their dimensions so we must compute them in order to know where to put
them in their parents but to compute those dimensions we need to layout the container itself...
- See top of file layout.js for more explanations about layout.
- fix few bugs in other places I met during my work on layout.
Given that `URL`s aren't supported by the structured clone algorithm, see https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm, the document in issue 1773 will cause the browser to throw `DataCloneError: The object could not be cloned.`-errors and nothing will render.
To fix this, we'll instead simply send the stringified version of the `URL` to prevent these errors from occuring.
The building of glyph paths, in the `FontRendererFactory`, can fail in various ways for corrupt font data. However, we're currently not attempting to handle any such errors in the evaluator, which means that a single broken glyph *can* prevent an entire page from rendering.
To address this we simply have to pass along, and check, the existing `ignoreErrors` option in `PartialEvaluator.buildFontPaths` similar to the rest of the `PartialEvaluator` code.
According to the specification, see https://web.archive.org/web/20210404042322if_/https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.2384179, the keys of a NameTree/NumberTree should be ordered.
For corrupt PDF files, which violate this assumption, it's thus possible that trying to lookup a single entry fails.
Previously, in PR 10274, we implemented a fallback that only applies to the "bottom" node of a NameTree/NumberTree, which in general might not actually help for sufficiently corrupt NameTree/NumberTree data.
Instead we remove the current *limited* fallback from `NameOrNumberTree.get`, and defer to the call-site to handle this case explicitly e.g. by using `NameOrNumberTree.getAll` for data where that makes sense. For well-formed documents, these changes should *not* lead to any additional data fetching/parsing.
Finally, as part of these changes, the validation of named destination data is improved in the `Catalog` and a new unit-test is also added.
To get the maximum benefit from something like Prettier, you obviously don't want to disable the automatic formatting unless absolutely necessary. When we added Prettier there were a number of cases, mostly involving larger Arrays, which required disabling of the automatic formatting for overall readability and/or to not break inline comments.
With changes in Prettier version `2.3.0`, see [the release notes](https://prettier.io/blog/2021/05/09/2.3.0.html#concise-formatting-of-number-only-arrays-10106httpsgithubcomprettierprettierpull10106-10160httpsgithubcomprettierprettierpull10160-by-thorn0httpsgithubcomthorn0), there's now better formatting support for Arrays containing only numbers. Hence we can now remove a number of `// prettier-ignore` comments, and thus get the benefit of automatic formatting in (slightly) more of the code-base.
- in charstring specs at page 21 (section 4.2): "Also, it may appear in the charstring as the difference from nominalWidthX" so the number we've on the stack doesn't have to be positive.
- currently this bug has probably no visible effect
- but when the font is loaded to be used with XFA, then the rendering is incorrect.
As can be seen in PR 13371, some of the `no-var` changes in the `PartialEvaluator.{getOperatorList, getTextContent}` methods caused errors in `gulp server`-mode.
However, there's a handful of instances of `var` in other methods which should be completely *safe* to convert since there's no strange scope-issues present in that code.
With modern JavaScript modules, where only *explicitly* exported properties are visible to the outside, the `QueueOptimizerClosure` should no longer be necessary.
Furthermore, to reduce the possibility of `NullOptimizer` and `QueueOptimizer` getting out of sync (note e.g. the inconsistency fixed in PR 10784), we now let the latter extend the former one.
This patch replaces the old structure with an abstract base-class, which the new ShadingPattern classes then inherit from.
The old `createMeshCanvasClosure` can now be removed, since it's not necessary any more with modern JavaScript, and the `createMeshCanvas` function is now instead a method on the new `MeshShadingPattern` class (avoids unnecessary parameter passing).
With the changes in PR 13361, we're now using the `CanvasPattern.setTransform()` method when rendering certain Shadings/Patterns.
Note that while `CanvasPattern` itself has been supported since basically "forever", its `setTransform` method is a slightly newer addition to the specification; please refer to https://developer.mozilla.org/en-US/docs/Web/API/CanvasPattern#browser_compatibility
Rather than trying to re-write PR 13361 to not use, or possibly spending time/effort (if possible) polyfilling, `CanvasPattern.setTransform()` this patch thus suggests that we simply update the *minimum* supported browser versions instead.
According to the compatibility data linked above, the *minimum* supported browser versions in the PDF.js library are now as follows:
- Chrome >= 68, which was released on 2018-07-24.[1]
- Firefox ESR, see https://wiki.mozilla.org/Release_Management/Calendar.
- Safari >= 11.1, which was release on 2018-03-29.[2]
(Given that the PDF.js contributors cannot realistically test a bunch of old browsers, it's not unimaginable that some older browser versions are already not working with the PDF.js library.)
Based on these changes, which we should ensure are reflected in the Wiki as well, we can also remove a number of now redundant polyfills. Furthermore we'll no longer "claim" to support Windows XP, note the `gulpfile.js` changes, which should definitely *not* be an issue given that it's no longer officially supported.[3]
---
[1] According to https://en.wikipedia.org/wiki/Google_Chrome_version_history
[2] According to https://en.wikipedia.org/wiki/Safari_version_history#Safari_11
[3] According to https://en.wikipedia.org/wiki/Windows_XP#End_of_support
This patch replaces the old structure with a abstract base-class, which the new RadialAxial/Mesh-shading classes then inherit from.[1]
The old `MeshClosure` can now be removed, since it's not necessary any more, and most of the functions inside of it are now instead methods on the new `MeshShading` class. This is particularly nice, in my opinion, since we previously were *manually* passing around a reference to the current `Mesh`-instance.
---
[1] If we want/need to, in the future, split e.g. the Mesh-handling into multiple classes that should now be easy to do.
Neither the `type` or the `cs` properties are used outside of the "constructors", and we can thus remove them.[1]
Note that a lot of this code is very old, and that it actually predates the main/worker-thread split before which the *same* file was used on both the main- *and* worker-threads.
---
[1] On the main-thread, a similar `type` property was removed in PR 12591.
- Right now, a glyph with an erroneous outline is replaced by an empty glyph
if the error is far enough from the start there's likely something to render
so the idea is to replace a command with args by an endchar when no args are
on the stack: this way OTS is likely happy (no remaining args on stack) and we
can draw something which is likely better than nothing.
First of all, by using `Dict.getArray` in the `Page.content` getter we remove the need to manually iterate through and fetch the sub-streams (when they exist) in the `Page.getContentStream` method.
Secondly, we can simplify the code in `Page.{getOperatorList, extractTextContent}` by letting `Page.getContentStream` ensure that `content` is available and returning a Promise instead.
Similar to the `get`/`getAsync` methods, this should be a *tiny* bit more efficient which cannot hurt considering that `getArray` is now used a lot more than when initially added.
This reverts commit 0ef9b5aafc, since it cases a lot of warnings (see below) *locally* with e.g. the document from issue 9627.
Strangely enough, this only occurs with `gulp server`-mode and the actual builds are apparently fine. It seems that this *may* be some unfortunate interaction with the old Babel-plugin that's used together with SystemJS.
```
Warning: getTextContent - ignoring ExtGState: "FormatError: ExtGState should be a dictionary.".
```
Rather than taking the risk that this could actually cover a more serious bug, and since I cannot immediately figure out what's wrong, it thus seem safest to revert this for now and we can (carefully) revisit this once SystemJS has been removed (see PR 12563).
This patch was tested using the PDF file from issue 2618, i.e. https://bug570667.bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 50,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ---- | -------------
firefox | Overall | 50 | 3417 | 3426 | 9 | 0.27 |
firefox | Page Request | 50 | 1 | 1 | 0 | 5.41 |
firefox | Rendering | 50 | 3416 | 3426 | 9 | 0.27 |
```
Based on these results, there's no significant performance regression from using standard classes and this patch should thus be OK.
Previously, we set the base transformation and pattern matrix
directly to the main rendering ctx of the page, however doing this
caused the current transform to be lost. This would cause issues
with things like shear missing so the pattern was misaligned or when
stroke was used the scale of the line width or dash would be wrong.
Instead we should leave the current transform and use setTransfrom
on the pattern so it is applied correctly. For axial and radial shadings I had
to create a temporary canvas to draw the shading so I could in turn
use setTransform.
Fixes: #13325, #6769, #7847, #11018, #11597, #11473
The following already in the corpus are improved:
issue8078-page1
issue1877-page1
This patch moves the `PDFJSDev`-checks *inline*, similar to the rest of the code-base, such that the code in question is actually being removed from the *built* files in e.g. the official releases.
After PR 13117 it's now (finally) possible for *different* build targets to specify individual options/preferences, and we can utilize that to only expose the `renderer`-preference in builds where `SVGGraphics` is actually defined.
Note that for e.g. `MOZCENTRAL`-builds, trying to enable SVG-rendering will throw immediately and the preference thus doesn't make sense to include there.
Also, update the dummy `SVGGraphics` to use a class, tweak the `PDFJSDev`-check in `src/display/svg.js` to agree fully with the option/preference, and remove an unnecessary `eslint-disable`.
Reasons for the removal include:
- This functionality was always somewhat experimental and has never been enabled by default, partly because of worries about rendering bugs caused by e.g. bad/outdated graphics drivers.
- After the initial implementation, in PR 4286 (back in 2014), no additional functionality has been added to the WebGL implementation.
- The vast majority of all documents do not benefit from WebGL rendering, since only a couple of *specific* features are supported (e.g. some Soft Masks and Patterns).
- There is, and has always been, *zero* test-coverage for the WebGL implementation.
- Overall performance, in the PDF.js library, has improved since the experimental WebGL implementation was added.
Rather than shipping unused *and* untested code, it seems reasonable to simply remove the WebGL implementation for now; thanks to version control it's always possible to bring back the code should the need ever arise.
Compared to other data-structures, such as e.g. `Dict`s, we're purposely *not* caching Streams on the `XRef`-instance.[1]
The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the *same* Stream-data requires re-parsing/re-initializing of a bunch of data; see `XRef.fetch` and related methods.
For the font-parsing in particular we're currently fetching the `toUnicode`-data, which is very often a Stream, in `PartialEvaluator.preEvaluateFont` and then *again* in `PartialEvaluator.extractDataStructures` soon afterwards.
By instead letting `PartialEvaluator.preEvaluateFont` export the "raw" `toUnicode`-data, we can avoid *some* unnecessary re-parsing/re-initializing when handling fonts.
*Please note:* In this particular case, given that `PartialEvaluator.preEvaluateFont` only accesses the "raw" `toUnicode` data, exporting a Stream should be safe.
---
[1] The reasons for this include:
- Streams, especially `DecodeStream`-instances, can become *very* large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.
- Attempting to read from the *same* Stream-instance more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position.
- Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Stream-instance isn't going to work in the general case.
Rather than re-fetching/re-parsing these properties immediately in `PartialEvaluator.translateFont`, we can simply export them instead. (Obviously the effect will be really tiny, but there is less parsing overall this way.)
*Please note:* While I don't have a document that this patches fixes, the current code is however not entirely correct as far as I can tell.
Looking at how the `Widths` array is parsed in `PartialEvaluator.extractWidths`, it's clear that the implementation in `PartialEvaluator.preEvaluateFont` is a bit too simplistic. In particular, by only wrapping the data into a TypedArray, there's no attempt to handle *indirect* objects which could potentially lead to colliding `hash`es being computed.
To hopefully help prevent any future bugs, make sure that composite/non-composite fonts cannot accidentally get matching `hash`es. Given the differences between those font types, that's very unlikely to be useful or even correct in general.
Without this some *composite* fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
*Please note:* Given that the document, in the referenced issue, doesn't embed *any* of its fonts there's no guarantee that it renders correctly in all configurations even with this patch.
With modern JavaScript modules, where you explicitly list the properties that should be exported, it's no longer necessary to wrap *all* of the code within one file into a top-level closure.[1]
This patch reduces the size, of even the *built* `pdf.worker.js` file, since there's now a lot less unnecessary whitespace.
---
[1] For files which contain *different* functionality, some closures may however still make sense in order to separate the code.
It might be possible to remove some of those cases later, e.g. once private class fields becomes generally available/usable in browsers.
With modern JavaScript modules, where you explicitly list the properties that should be exported, it's no longer necessary to wrap all of the code in a closure.[1]
This patch also tries to clean-up/improve a couple of the existing JSDoc-comments.
---
[1] This reduces the size, even of the *built* `pdf.js` file, since there's now a lot less unnecessary whitespace.
*My apologies for breaking this; thankfully PR 13303 hasn't reach mozilla-central yet.*
It's (obviously) necessary to initialize a `PredictorStream`-instance fully, since otherwise breakage may occur if there's errors during the actual stream parsing.
To reproduce this issue, try opening the PDF document from issue 13051 locally and observe the following message in the console:
```
Warning: Invalid stream: "ReferenceError: this hasn't been initialised - super() hasn't been called"
```
- app.alert and few other function can use an object as parameter ({cMsg: ...});
- support app.alert with a question and a yes/no answer;
- update field siblings when one is changed in an action;
- stop calculation if calculate is set to false in the middle of calculations;
- get a boolean for checkboxes when they've been set through annotationStorage instead of a string.
It shouldn't be possible for the `getBytes`-call to throw a `MissingDataException`, since all resources are loaded *before* e.g. font-parsing ever starts; see f0817015bd/src/core/object_loader.js (L111-L126)
Furthermore, even if we'd *somehow* re-throw a `MissingDataException` here that still won't help considering where the `Type1Font`-instance is created. Note how in the `Font`-constructor we simply catch any errors and fallback to a standard font, which means that a `MissingDataException` would just lead to rendering errors anyway; see f0817015bd/src/core/fonts.js (L648-L691)
All-in-all, it's not possible for a `MissingDataException` to be thrown in `getHeaderBlock` and this code-path can thus be removed.
Obviously the `Font`-class is still *very* large, given particularly how TrueType fonts are handled, however this patch-series at least improves things by moving a number of functions/classes into their own files.
As a follow-up it might make sense to try and re-factor/extract the TrueType parsing into its own file, since all of this code is quite old, however that's probably best left for another time.
For e.g. `gulp mozcentral`, the *built* `pdf.worker.js` files decreases from `1 620 332` to `1 617 466` bytes with this patch-series.
These changes were made automatically, using `gulp lint --fix`.
Given the large size of this patch, the manual fixes are done separately in the next commit.
These changes were made *mostly* automatically, using `gulp lint --fix`, with the following manual changes:
```diff
diff --git a/src/core/fonts_utils.js b/src/core/fonts_utils.js
index f88ce4a8c..c4b3f3808 100644
--- a/src/core/fonts_utils.js
+++ b/src/core/fonts_utils.js
@@ -167,8 +167,8 @@ function type1FontGlyphMapping(properties, builtInEncoding,
glyphNames) {
}
// Lastly, merge in the differences.
- let differences = properties.differences,
- glyphsUnicodeMap;
+ const differences = properties.differences;
+ let glyphsUnicodeMap;
if (differences) {
for (charCode in differences) {
const glyphName = differences[charCode];
```
- `FontFlags`, is used in both `src/core/fonts.js` and `src/core/evaluator.js`.
- `getFontType`, same as the above.
- `MacStandardGlyphOrdering`, is a fairly large data-structure and `src/core/fonts.js` is already a *very* large file.
- `recoverGlyphName`, a dependency of `type1FontGlyphMapping`; please see below.
- `SEAC_ANALYSIS_ENABLED`, is used by both `Type1Font`, `CFFFont`, and unit-tests; please see below.
- `type1FontGlyphMapping`, is used by both `Type1Font` and `CFFFont` which a later patch will move to their own files.
This is done automatically with `gulp lint --fix` and the following
manual changes:
```diff
diff --git a/src/core/function.js b/src/core/function.js
index 878001057..b7e3e6ccf 100644
--- a/src/core/function.js
+++ b/src/core/function.js
@@ -131,7 +131,7 @@ function toNumberArray(arr) {
return arr;
}
-var PDFFunction = (function PDFFunctionClosure() {
+const PDFFunction = (function PDFFunctionClosure() {
const CONSTRUCT_SAMPLED = 0;
const CONSTRUCT_INTERPOLATED = 2;
const CONSTRUCT_STICHED = 3;
@@ -484,7 +484,9 @@ var PDFFunction = (function PDFFunctionClosure() {
// clip to domain
const v = clip(src[srcOffset], domain[0], domain[1]);
// calculate which bound the value is in
- for (var i = 0, ii = bounds.length; i < ii; ++i) {
+ const length = bounds.length;
+ let i;
+ for (i = 0; i < length; ++i) {
if (v < bounds[i]) {
break;
}
@@ -673,23 +675,21 @@ const PostScriptStack = (function PostScriptStackClosure() {
roll(n, p) {
const stack = this.stack;
const l = stack.length - n;
- let r = stack.length - 1,
- c = l + (p - Math.floor(p / n) * n),
- i,
- j,
- t;
- for (i = l, j = r; i < j; i++, j--) {
- t = stack[i];
+ const r = stack.length - 1;
+ const c = l + (p - Math.floor(p / n) * n);
+
+ for (let i = l, j = r; i < j; i++, j--) {
+ const t = stack[i];
stack[i] = stack[j];
stack[j] = t;
}
- for (i = l, j = c - 1; i < j; i++, j--) {
- t = stack[i];
+ for (let i = l, j = c - 1; i < j; i++, j--) {
+ const t = stack[i];
stack[i] = stack[j];
stack[j] = t;
}
- for (i = c, j = r; i < j; i++, j--) {
- t = stack[i];
+ for (let i = c, j = r; i < j; i++, j--) {
+ const t = stack[i];
stack[i] = stack[j];
stack[j] = t;
}
@@ -939,7 +939,7 @@ class PostScriptEvaluator {
// We can compile most of such programs, and at the same moment, we can
// optimize some expressions using basic math properties. Keeping track of
// min/max values will allow us to avoid extra Math.min/Math.max calls.
-var PostScriptCompiler = (function PostScriptCompilerClosure() {
+const PostScriptCompiler = (function PostScriptCompilerClosure() {
class AstNode {
constructor(type) {
this.type = type;
```
The `this.data` property is, when defined, sent from the worker-thread as a `Uint8Array` and there's thus no reason to re-initialize the TypedArray here.
Note also the `FontFaceObject.createNativeFontFace` method just above, where we simply use `this.data` as-is.
The explanation for this code looking like it does is, as is often the case, for historical reasons. Originally we only supported `@font-face`, before the Font Loading API existed, and back then we also polyfilled TypedArrays (using regular Arrays) which should explain this particular line of code.
Given that the `bytesToString(BaseStream.getBytes(...))` pattern is somewhat common throughout the `src/core/` code, it cannot hurt to add a new `BaseStream`-method which handles that case internally.
- Improve chunking in order to fix some bugs where the spaces aren't here:
* track the last position where a glyph has been drawn;
* when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
- there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
- and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
- Add some breaks in order to get lines;
- Remove the multiple whites spaces:
* some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
* other pdf readers replace spaces by one white space.
Update src/core/evaluator.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Looking at the `ChunkedStream` implementation, it's basically a "regular" `Stream` but with added functionality in order to deal with fetching/loading of missing data.
Hence, by letting `ChunkedStream` extend `Stream`, we can remove some duplicate methods from the `ChunkedStream` class.
For all of the other `DecodeStream`s we're not passing in a `Dict`-instance manually, but instead get it from the `stream`-parameter. Hence there's no particularly good reason, as far as I can tell, to not do the same thing in `Jbig2Stream`/`JpegStream`/`JpxStream` as well.
The way that `getBaseStreams` is currently handled has bothered me from time to time, especially how we're checking if the method exists before calling it.
By adding a dummy `BaseStream.getBaseStreams` method, and having the call-sites simply check the return value, we can improve some of the relevant code.
Note in particular how the `ObjectLoader._walk` method didn't actually check that the data in question is a Stream instance, and instead only checked the `currentNode` (which could be anything) for the existence of a `getBaseStreams` property.
By having an abstract base-class, it becomes a lot clearer exactly which methods/getters are expected to exist on all Stream instances.
Furthermore, since a number of the methods are *identical* for all Stream implementations, this reduces unnecessary code duplication in the `Stream`, `DecodeStream`, and `ChunkedStream` classes.
For e.g. `gulp mozcentral`, the *built* `pdf.worker.js` files decreases from `1 619 329` to `1 616 115` bytes with this patch-series.
Given that we're using modules, meaning that only explicitly `export`ed things are visible to the outside, it's no longer necessary to wrap all of the code in a closure.
This gets rid of *a lot* of boilerplate that stems from our old way of simulating classes, and it actually reduces the filesize noticeably.
For e.g. `gulp mozcentral`, the *built* `pdf.js` files decreases from `318 404` to `314 722` bytes (~1 percent) with this patch.
This is done automatically with `gulp lint --fix` and the following
manual changes:
```diff
diff --git a/src/core/image.js b/src/core/image.js
index 35c06b8ab..e718b9937 100644
--- a/src/core/image.js
+++ b/src/core/image.js
@@ -97,7 +97,7 @@ class PDFImage {
if (isName(filter)) {
switch (filter.name) {
case "JPXDecode":
- var jpxImage = new JpxImage();
+ const jpxImage = new JpxImage();
jpxImage.parseImageProperties(image.stream);
image.stream.reset();
```
Using `for...of` is a modern and generally much nicer pattern, since it gets rid of unnecessary callback-functions. (In a couple of spots, a "regular" `for` loop had to be used.)
- In case of large string the sandbox initialization failed because of an OOM
* so allocate a new string in the heap
* and free it after use.
- it requires a quickjs update since we need to export some symbols (stringToNewUTF8 and free).
This patch first of all moves all checking/validation into the `appendIfJavaScriptDict` function, to avoid duplicating it in multiple places. Secondly, also removes what's now an outdated/incorrect comment since we have implemented scripting support.
Given that we're (almost) always iterating through the result of the `getAll`-calls, using a `Map` seems nicer overall since it's more suited to iteration compared to a regular Object.
Also, add a couple of `Dict`-checks in existing code touched by this patch, since it really cannot hurt to prevent *potential* errors in a corrupt PDF document.
First of all, while it should be very unlikely that the /ID-entry is an *indirect* object, note how we're using `Dict.get` when parsing it e.g. in `PDFDocument.fingerprint`. Hence we definitely should be consistent here, since if the /ID-entry is an *indirect* object the existing code in `src/core/writer.js` would already fail.
Secondly, to fix the referenced issue, we also need to check that the /ID-entry actually is an Array before attempting to access its contents in `src/core/writer.js`.
*Drive-by change:* In the `xrefInfo` object passed to the `incrementalUpdate` function, re-name the `encrypt` property to `encryptRef` since its data is fetched using `Dict.getRaw` (given the names of the other properties fetched similarly).
For CFF fonts without proper `ToUnicode`/`Encoding` data, utilize the "charset"/"Encoding"-data from the font file to improve text-selection (issue 13260)
Keeps screen readers from pausing on every span so
paragraphs are read more naturally. Note: this only seems
to affect Firefox, Chrome automatically combines the spans.
By not waiting for the /Properties to load, before parsing of the operatorList/textContent starts, there's a very real risk that a `MissingDataException` will be thrown when trying to access the data in the `PartialEvaluator.parseMarkedContentProps` method.
If this ever happens it will thus lead to incomplete and/or outright broken rendering, and with e.g. `disableAutoFetch=true` set the likelihood of this occuring would increase quite a bit.
*Please note:* While I've not yet seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.
- Use `PDFManager.ensureDoc`, rather than `PDFManager.ensure`, in a couple of spots in the code. If there exists a short-hand format, we should obviously use it whenever possible.
- Fix a unit-test helper, to account for the previous changes. (Also, converts a function to be `async` instead.)
- Add one more exists-check in `PDFDocument.loadXfaFonts`, which I missed to suggest in PR 13146, to prevent any possible errors if the method is ever called in a situation where it shouldn't be.
Also, print a warning if the actual font-loading fails since that could help future debugging. (Finally, reduce overall indentation in the loop.)
- Slightly unrelated, but make a small tweak of a comment in `src/core/fonts.js` to reduce possible confusion.
- Different fonts can be used in xfa and some of them are embedded in the pdf.
- Load all the fonts in window.document.
Update src/core/document.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Update src/core/worker.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `XRef` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves `NameTree`/`NumberTree` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `FileSpec` into its own file.
The size of the `src/core/obj.js` file has increased slowly over the years, and it also contains a fair amount of *distinct* functionality.
In order to improve readability and make it easier to navigate through the code, this patch moves the `ObjectLoader` into its own file.
`setSelectionRange(0, 0)` added in 44b24fcc29 for #12359, required only by Firefox ([bug](https://bugzilla.mozilla.org/show_bug.cgi?id=860329)), causes issues mozilla#13191, mozilla#12592 in Safari.
`scrollLeft = 0` is a fix that breaks the focus trap in Safari while **keeping Firefox behavior same for #12359**.
Remove the unused "GetIsPureXfa" message handler; and avoid unnecessary parsing when no structTree is available (PR 13069 follow-up, PR 13221 follow-up)
With the current implementation of `PDFDocument.hasJSActions`, in the worker-thread, we're not actually handling not-yet-loaded data correctly. This can thus fail in *two* different ways:
- The `PDFDocument.fieldObjects` getter (and its helper method), while it may *return* a Promise, still fetches all of its data synchronously and it can thus throw a `MissingDataException` during parsing.
- The `Catalog.jsActions` getter, which is completely synchronous, can obviously throw a `MissingDataException` during parsing.
If either of these cases occur currently, the `PDFDocumentProxy.hasJSActions` method in the API can either return a *rejected* Promise (which it never should) or possibly "hang" and never resolve.
*Please note:* While I've not *yet* seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.
This patch is thus based on code-inspection *and* on manually throwing a `MissingDataException` on the first access of `Catalog.jsActions` to simulate this situation.
Finally, this patch adds a couple of *API* unit-tests for this (since none existed).
Given that this only an internal helper method, used by the `Catalog.{javaScript, jsActions}` getters, this change simplifies iteration of the returned data.
We can also (slightly) re-factor the code of the `jsActions` getter, and remove an obsolete[1] JSDoc-comment from the `openAction` getter.
---
[1] Not really relevant now that we've got proper scripting support.
Similar to all other data accesses, note e.g. the "GetDocJSActions" handler just above, we need to ensure that a `MissingDataException` isn't propagated to the main-thread if this data is accessed while the PDF document is still loading.
It's obviously (a bit) more efficient to return early in `Page.getStructTree`, rather than trying to first "parse" an *empty* structTree-root.
*Somehow I didn't think of this yesterday, but this feels like a much better solution overall; sorry about the churn here!*
Looking at the API, there's no code which actually sends this message. Most likely it's a left-over from a previous version of PR 13069, since the `isPureXfa` parameter is being included in the "GetDoc" message.
This is first of all consistent with existing API-methods, where we return `null` when the data in question doesn't exist. Secondly, it should also be (slightly) more efficient since there's less dummy-data that we need to transfer between threads.
Finally, this prevents us from adding an empty/unnecessary span to *every* single page even in documents without any structure tree data.
- but don't validate them for now;
- Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315)
- almost all (all ?) pdf readers display signatures;
- validation is done in edge but for now it's behind a pref.
When a PDF is "marked" we now generate a separate DOM that represents
the structure tree from the PDF. This DOM is inserted into the <canvas>
element and allows screen readers to walk the tree and have more
information about headings, images, links, etc. To link the structure
tree DOM (which is empty) to the text layer aria-owns is used. This
required modifying the text layer creation so that marked items are
now tracked.
Note how we purposely don't expose the `AnnotationStorage`-class directly in the official API (see `src/pdf.js`), since trying to use *multiple* ones simultaneously doesn't really make sense (e.g. in the viewer).
Instead we lazily initialize, and cache, just *one* instance via `PDFDocumentProxy.annotationStorage` which should thus be available internally in the API itself without having to be manually passed to various methods.
To support these changes, the `AnnotationStorage`-instance initialization is moved into the `WorkerTransport`-class to allow both `PDFDocumentProxy` and `PDFPageProxy` to access it.
This patch implements the following simplifications:
- Remove the `annotationStorage`-parameter from `PDFDocumentProxy.saveDocument`, since it's already available internally.
Furthermore, while it's currently possible to call that method without an `AnnotationStorage`-instance, that really does *not* make any sense at all. In this case you're effectively reducing `PDFDocumentProxy.saveDocument` to a "regular" `PDFDocumentProxy.getData` call, but with *a lot* more overhead, which was obviously not the intention of the `PDFDocumentProxy.saveDocument`-method.
- Try to discourage third-party users from calling `PDFDocumentProxy.saveDocument` unconditionally, as a replacement for `PDFDocumentProxy.getData` (note the previous point).
- Replace the `annotationStorage`-parameter, in `PDFPageProxy.render`, with a boolean `includeAnnotationStorage`-parameter which simply indicates if the (internally available) `AnnotationStorage`-instance should be used during rendering (e.g. for printing).
- By removing the need to *manually* provide `annotationStorage`-parameters to various API-methods, using the API should become simpler (e.g. for third-parties) since you no longer need to worry about manually fetching and passing around this data.
The fontName, as defined in the PDF document, cannot be found in *any* of the "name"-tables in the TrueType Collection font. To work-around that, this patch adds a *fallback* code-path to allow using an approximately matching fontName rather than outright failing.
While this method has only been deprecated in one releases now, the `AnnotationStorage`-functionality is new enough that third-party implementations hopefully don't rely heavily on it just yet. (And removing this quickly should help reduce the likelihood that someone starts using it.)
When removing tasks we're currently forced to *indirectly* iterate through the array, which can be avoided by using a Set instead.
Furthermore, we can also (slightly) modernize the code responsible for initializing the `renderTasks`.
Given that what we actually want is only to keep track of the loadedFont-names, rather than storing any actual data, using an object isn't really necessary here. Furthermore, in the current code, we're also using `in` when checking if the data exists, which is generally less efficient than just checking for the value directly.
As mentioned in the JSDoc comment, this should not be used unless you know what you're doing, since it will lead to increased memory usage. However, in some situations (e.g. SVG-rendering), we still want to be able to run general clean-up on both the main/worker-thread while keeping loaded fonts attached to the DOM.[1]
As part of these changes, `WorkerTransport.startCleanup` is converted to an async method and we'll also skip clean-up when destruction has started (since it's redundant).
---
[1] The SVG-rendering mode is obviously not officially supported, since it's both rather incomplete and inherently slower. However with recent changes, whereby we cache repeated images on the document rather than the page level, memory usage can be *a lot* worse than before if we never attempt to release e.g. cached image-data when the viewer is in SVG-rendering mode.
These two properties were *never* intended to be anything but "private", hence it really cannot hurt to actually indicate that they're *not* part of any official API.
Currently the `fontName`-property contains an actual /Name-instance, which is a problem given that its fallback value is an empty string; see ca7f546828/src/core/default_appearance.js (L35)
The reason that this is a problem can be seen in ca7f546828/src/core/primitives.js (L30-L34), since an empty string short-circuits the cache. Essentially, in PDF documents, a /Name-instance cannot be empty and the way that the `DefaultAppearanceEvaluator` does things is unfortunately not entirely correct.
Hence the `fontName`-property is changed to instead contain a string, rather than a /Name-instance, which simplifies the code overall.
*Please note:* I'm tagging this patch with "[api-minor]", since PR 12831 is included in the current pre-release (although we're not using the `fontName`-property in the display-layer).
Fixes#13107
In the issue, some TrueType glyph names have the format `uniXXXX`.
Font's `Encoding` dictionary has the entry `Differences` but no
`BaseEncoding`. `uniXXXX` names are converted to glyph indices
using font's `post` table but currently that is done only when
`BaseEncoding` exists. We must enable the conversion also when only
`Differences` exists.
Currently only URL-strings are officially supported by `getDocument`, however at this point in time I cannot really see any compelling reason to not support `URL`-objects as well.
Most likely the reason that we've don't already support `URL`-objects, in `getDocument`, is that historically `URL` wasn't fully implemented across browsers and our old polyfill wasn't perfect; see https://developer.mozilla.org/en-US/docs/Web/API/URL/URL#browser_compatibility
*Please note:* Because of how the `url` parameter is currently handled, there's actually *some* cases where passing a `URL`-object to `getDocument` already works. That, in my opinion, provides additional motivation for supporting `URL`-objects officially, since it makes the API more consistent.
The following is an attempt to summarize the *current* situation, based on the actual code rather than the JSDocs:
- `getDocument("url string")` works and is documented.[1]
- `getDocument({ url: "url string", })` works and is documented.[1]
- `getDocument(new URL(...))` throws immediately, since no supported parameters are found.
- `getDocument({ url: new URL(...), })` actually works even though it's not documented.[1] Originally, when data was fetched on the worker-thread, this would likely have thrown since `URL` isn't clonable.[2]
- `getDocument({ url: { abc: 123, }, })`, or some similarily meaningless input, will be "accepted" by `getDocument` and then throw a `MissingPDFException` when attempting to fetch the bogus data.
With the changes in this patch, not only is `URL`-objects now officially supported and documented when calling `getDocument`, but we'll also do a much better job at actually validating any URL-data passed to `getDocument` (and instead fail early).
---
[1] In *browsers*, we create a valid URL thus indirectly validating the input. In Node.js environments, on the other hand, no validation is done since obtaining a baseUrl is more difficult (and PDF.js is primarily written for browsers anyway).
[2] https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types
Given the number of parameters that we now need to parse here, this code is no longer as readable as one would like. Hence this re-factoring, which will improve overall readability and also help with the next patch.
The reasons for making this change are:
- This property is not, nor has it ever been, used anywhere in the PDF.js display-layer.
- Related to the previous point, the format of the `defaultAppearance`-string is such that it'd be difficult to use it as-is in the display-layer anyway.
- It (usually) contains the "raw" appearance-string, from the PDF document, which is neither parsed nor validated and could thus be bogus.
- We now expose a `defaultAppearanceData`-property, which is first of all used in the display-layer and secondly contains actually parsed/validated data.
- In the event that a third-party implementation needs the `defaultAppearance`-string, it could be easily constructed from the recently added `defaultAppearanceData`-property.
All-in-all, I'm thus suggesting that we stop exposing an unused and unnecessary property on all Annotation-instances.
* JS - Handle correctly hierarchy of fields
- it aims to fix#13132;
- annotations can inherit their actions from the parent field;
- there are some fields which act as a container for other fields:
- they can be access through js so need to add them with an empty type (nothing in the spec about that but checked in Acrobat);
- calculation order list (CO) can reference them so need make them through this.getField;
- getArray method must return kids.
- field values are number, string, ... depending of their type but nothing in the spec on how to know what's the type:
- according to the comment for Canonical Format: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=461
- it seems that this "type" can be guessed from js action Format (when setting a type in Acrobat DC, the only affected thing is this action).
- util.scand with an empty string returns the current date.
Based on this compatibility information, given that IE 11 is now *explicitly* unsupported, we should no longer need to bundle a `URL` polyfill in any builds: https://developer.mozilla.org/en-US/docs/Web/API/URL/URL#browser_compatibility
Note that the caveat listed for older Safari-versions doesn't apply to any code in the PDF.js library, since we never call `new URL(url, undefined)` in the code-base.
Note also that Node.js has a web-compatible `URL` implementation, which according to the "History" section at https://nodejs.org/api/url.html#url_the_whatwg_url_api has been available since Node.js `10.0.0` (according to https://nodejs.org/en/about/releases/ that branch is one month away from being EOL-ed).
The rotation handling that's currently living in `PDFViewerApplication` is *very* old, and pre-dates the introduction of the viewer components by years.
As can be seen in the `BaseViewer.pagesRotation` setter, we're not actually normalizing the rotation as intended and instead rely on the caller to handle that correctly. This is first of all inconsistent, given how other setters are implemented, and secondly it could also lead to the rotation being set to a value outside of the `[0, 360)`-range.
Finally, for improved consistency the rotation handling in `PageViewport` is updated similarly. Please note that this case, it's *not* changing the pre-existing logic.
- implement few positioning properties: position, width, height, anchor;
- implement font element;
- implement fill element (used by font) and its children (linear, radial, ...);
- font property is inherited from ancestor container (see https://www.pdfa.org/wp-content/uploads/2020/07/XFA-3_3.pdf#page=43) so let CSS handles that stuff;
- in order to reduce the number of properties to set, only set non default properties and put the default in CSS;
- set a background to some containers to be able to see them (will be removed in a future commit).
Similar to the existing `annotationsPromise` and `_jsActionsPromise` properties, the new `_xfaPromise` should obviously also be reset, since otherwise you might end up holding onto a lot of data for pages that are no longer active.
(That caching wasn't present in the original version of PR 13069, which is why I didn't spot it until now.)
- add an option to enable XFA rendering if any;
- for now, let the canvas layer: it could be useful to implement XFAF forms (embedded pdf in xml stream for the background and xfa form for the foreground);
- ui elements in template DOM are pretty close to their html counterpart so we generate a fake html DOM from template one:
- it makes easier to translate template properties to html ones;
- it makes faster the creation of the html element in the main thread.
While there is nothing *outright* wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.
To address this, this patch makes the following changes:
- Ignore any non-string in the `docBaseUrl` option passed to `getDocument`, since those are unsupported anyway, already on the main-thread.
- Ignore "data:"-URLs in the `docBaseUrl` option passed to `getDocument`, to avoid having to send what could potentially be a *very* long string to the worker-thread.
- Parse the `docBaseUrl` option *directly* in the `BasePdfManager`-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.
It seems reasonable to place this alongside the *similar* `getFilenameFromUrl` helper function. This way, with the changes in the next patch, we also avoid having to expose the `isDataScheme` function in the API itself and we instead expose `getPdfFilenameFromUrl` in the API (which feels overall more appropriate).
This extends PR 13033 slightly, with a heuristic to support corrupt PDF documents where the `LineAnnotation`s have an empty /Rect-entry. Please note that while I have no idea if this is "correct", this patch at least makes us output the same /BBox as re-saving in Adobe Reader does.
This is mostly done using `gulp lint --fix` with a few manual changes in
the following diff:
```diff
diff --git a/src/core/pattern.js b/src/core/pattern.js
index 365491ed3..eedd8b686 100644
--- a/src/core/pattern.js
+++ b/src/core/pattern.js
@@ -105,7 +105,7 @@ const Pattern = (function PatternClosure() {
return Pattern;
})();
-var Shadings = {};
+const Shadings = {};
// A small number to offset the first/last color stops so we can insert ones to
// support extend. Number.MIN_VALUE is too small and breaks the extend.
@@ -597,16 +597,15 @@ Shadings.Mesh = (function MeshClosure() {
if (!(0 <= f && f <= 3)) {
throw new FormatError("Unknown type6 flag");
}
- var i, ii;
const pi = coords.length;
- for (i = 0, ii = f !== 0 ? 8 : 12; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 8 : 12; i < ii; i++) {
coords.push(reader.readCoordinate());
}
const ci = colors.length;
- for (i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
colors.push(reader.readComponents());
}
- var tmp1, tmp2, tmp3, tmp4;
+ let tmp1, tmp2, tmp3, tmp4;
switch (f) {
// prettier-ignore
case 0:
@@ -729,16 +728,15 @@ Shadings.Mesh = (function MeshClosure() {
if (!(0 <= f && f <= 3)) {
throw new FormatError("Unknown type7 flag");
}
- var i, ii;
const pi = coords.length;
- for (i = 0, ii = f !== 0 ? 12 : 16; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 12 : 16; i < ii; i++) {
coords.push(reader.readCoordinate());
}
const ci = colors.length;
- for (i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
+ for (let i = 0, ii = f !== 0 ? 2 : 4; i < ii; i++) {
colors.push(reader.readComponents());
}
- var tmp1, tmp2, tmp3, tmp4;
+ let tmp1, tmp2, tmp3, tmp4;
switch (f) {
// prettier-ignore
case 0:
@@ -897,7 +895,7 @@ Shadings.Mesh = (function MeshClosure() {
decodeType4Shading(this, reader);
break;
case ShadingType.LATTICE_FORM_MESH:
- var verticesPerRow = dict.get("VerticesPerRow") | 0;
+ const verticesPerRow = dict.get("VerticesPerRow") | 0;
if (verticesPerRow < 2) {
throw new FormatError("Invalid VerticesPerRow");
}
```
A significant portion of the code-base has now been converted to use `let`/`const`, rather than `var`, hence it should be possible to simply enable the ESLint `no-var` rule globally.
This way we can ensure that new code won't accidentally use `var`, and it also removes the need to manually enable the rule in various folders.
Obviously it makes sense to continue the efforts to replace `var`, but that should probably happen on a file and/or folder basis.
Please note that this patch excludes the following code:
- The `extensions/` folder, since that seemed easiest for now (and I don't know exactly what the support situation is for the Chromium-extension).
- The entire `external/` folder is ignored, since most of it's currently excluded from linting.
For the code that isn't imported from elsewhere (and should be ignored), we should probably (at some point) bring the code up to the same linting/formatting standard as the rest of the code-base.
- Various files in the `test/` folder are ignored, as necessary, since the way that a lot of this code is loaded will require some care (or perhaps larger re-factoring) when removing `var` usage.
While the JSDocs have never advertised `getDocument` as supporting Node.js `Buffer`s, that apparently doesn't stop users from passing such data structures to `getDocument`.
In theory the existing `instanceof Uint8Array` check ought to have caught Node.js `Buffer`s, however for reasons that I don't even pretend to understand that check actually passes. Hence this patch which, *only* in Node.js environments, will special-case `Buffer`s to hopefully provide a slightly better out-of-the-box behaviour in Node.js environments[1].
---
[1] Although I'm not sure that we necessarily want to advertise this in the JSDocs, given the specialized use-case.
Replace the `objectFromEntries` helper function with an `objectFromMap` one instead, and simplify the data lookup in the AnnotationStorage.getValue method
Note that the majority of these changes were done automatically, by using `gulp lint --fix`, and the manual changes were limited to the following diff:
```diff
diff --git a/src/core/cff_parser.js b/src/core/cff_parser.js
index d684c200e..2e2b811e4 100644
--- a/src/core/cff_parser.js
+++ b/src/core/cff_parser.js
@@ -555,7 +555,7 @@ const CFFParser = (function CFFParserClosure() {
stackSize %= 2;
validationCommand = CharstringValidationData[value];
} else if (value === 10 || value === 29) {
- var subrsIndex;
+ let subrsIndex;
if (value === 10) {
subrsIndex = localSubrIndex;
} else {
@@ -886,15 +886,15 @@ const CFFParser = (function CFFParserClosure() {
format = bytes[pos++];
switch (format & 0x7f) {
case 0:
- var glyphsCount = bytes[pos++];
+ const glyphsCount = bytes[pos++];
for (i = 1; i <= glyphsCount; i++) {
encoding[bytes[pos++]] = i;
}
break;
case 1:
- var rangesCount = bytes[pos++];
- var gid = 1;
+ const rangesCount = bytes[pos++];
+ let gid = 1;
for (i = 0; i < rangesCount; i++) {
const start = bytes[pos++];
const left = bytes[pos++];
@@ -938,7 +938,7 @@ const CFFParser = (function CFFParserClosure() {
}
break;
case 3:
- var rangesCount = (bytes[pos++] << 8) | bytes[pos++];
+ const rangesCount = (bytes[pos++] << 8) | bytes[pos++];
for (i = 0; i < rangesCount; ++i) {
let first = (bytes[pos++] << 8) | bytes[pos++];
if (i === 0 && first !== 0) {
@@ -1173,7 +1173,7 @@ class CFFDict {
}
}
-var CFFTopDict = (function CFFTopDictClosure() {
+const CFFTopDict = (function CFFTopDictClosure() {
const layout = [
[[12, 30], "ROS", ["sid", "sid", "num"], null],
[[12, 20], "SyntheticBase", "num", null],
@@ -1229,7 +1229,7 @@ var CFFTopDict = (function CFFTopDictClosure() {
return CFFTopDict;
})();
-var CFFPrivateDict = (function CFFPrivateDictClosure() {
+const CFFPrivateDict = (function CFFPrivateDictClosure() {
const layout = [
[6, "BlueValues", "delta", null],
[7, "OtherBlues", "delta", null],
@@ -1265,11 +1265,12 @@ var CFFPrivateDict = (function CFFPrivateDictClosure() {
return CFFPrivateDict;
})();
-var CFFCharsetPredefinedTypes = {
+const CFFCharsetPredefinedTypes = {
ISO_ADOBE: 0,
EXPERT: 1,
EXPERT_SUBSET: 2,
};
+
class CFFCharset {
constructor(predefined, format, charset, raw) {
this.predefined = predefined;
@@ -1695,7 +1696,7 @@ class CFFCompiler {
// For offsets we just insert a 32bit integer so we don't have to
// deal with figuring out the length of the offset when it gets
// replaced later on by the compiler.
- var name = dict.keyToNameMap[key];
+ const name = dict.keyToNameMap[key];
// Some offsets have the offset and the length, so just record the
// position of the first one.
if (!offsetTracker.isTracking(name)) {
```
Rather than first checking if data exists before fetching it from storage, we can simply do the lookup directly and then check its value.
Note that this follows the same pattern as utilized in the `AnnotationStorage.setValue` method.
Given that it's only used with `Map`s, and that it's currently implemented in such a way that we (indirectly) must iterate through the data *twice*, some simplification cannot hurt here.
Note that the only reason that we're not using `Object.fromEntries(...)` directly, at each call-site, is that that one won't guarantee that a `null` prototype is being used.
Now that we have scripting support, warning about e.g. JavaScript actions doesn't seem necessary anymore. Especially considering that scripting-related actions are/will not be parsed by the `Catalog.parseDestDictionary` method anyway, since it's intended for handling "simple" actions.
All of this code predates the existence of native JS classes, however we can now clean this up a bit. This patch thus let us remove some variable "shadowing" from the code.
This helper function is first of all only called *twice*, and secondly it also leads to unnecessary intermediate allocations given how the `TypedArray`s are handled.
Hence we can simply inline this small function, and thus directly allocate the combined `TypedArray` instead.
The `compareByteArrays` is first of all duplicated in multiple closures in the `src/core/crypto.js` file. Secondly, despite its name, it's also functionally equivalent to the now existing `isArrayEqual` helper function.
The `isArrayEqual` helper function is changed to use a standard `for`-loop, rather than `Array.prototype.every`, since that ought to be slightly more efficient given that we're now using it with (potentially) larger data.
All of this code predates the existence of native JS classes, however we can now clean this up a bit. This patch thus let us remove some variable "shadowing" from the code.
Note that this particular helper function is, with the exception of the `GENERIC` default viewer and the (unsupported) SVG-backend, mostly unused at this point in time. Hence we should be able to clean-up this helper function slightly.
Also, fixes a small inconsistency in the `SVGGraphics` initialization in the viewer, by passing in the `disableCreateObjectURL` compatibility-option. Given that the SVG-backend isn't officially supported/recommended this shouldn't have been an issue, but given that I spotted this it can't hurt to fix it.
Note how the `PDFAttachmentViewer` handles PDF file attachments specially, by opening them in a new window/tab, rather than forcing them to be downloaded. This is done to improve the overall UX, since browsers in general are able to handle PDF files internally.
However, for file *annotations* we're currently not attempting to do the same thing and are instead just downloading them directly. In order to unify the behaviour, without having to duplicate a lot of code, the opening of PDF file attachments is thus moved into a new `DownloadManager.openOrDownloadData` method.
- strokeColor corresponds to borderColor;
- support fillColor and textColor;
- support colors on the different annotations;
- fix typo in aforms (+test).
This is similar to the other methods, and the only reason for this not having been done originally is that the `cancel` functionality is a later addition.
Rather than converting the `AnnotationStorage`-data to an Object, before sending it to the worker-thread, we should be able to simply send the internal `Map` directly.
The "structured clone algorithm" doesn't have a problem with `Map`s, however the `LoopbackPort` used when workers are *disabled* (e.g. in Node.js environments) didn't use to support them. With PR 12997 having lifted that restriction, we should now be able to simply send the `AnnotationStorage`-data as-is rather than having to iterate through it to first create an Object.
*Please note:* The changes in `src/core/annotation.js` could have been a lot more compact if we were able to use optional chaining in the `src/core` folder. Unfortunately that's still not possible, since SystemJS is being used in the development viewer (i.g. `gulp server`) and fixing that is *still* blocked by [bug 1247687](https://bugzilla.mozilla.org/show_bug.cgi?id=1247687).
*Please note:* The `defer` parameter has been enabled by default ever since PR 9777 (in 2018), which first shipped in PDF.js release `2.0.943`.
With workers *disabled*, e.g. in Node.js environments, this has been used ever since without any problems reported[1].
The impetus for this change was that I happened to notice that *if* the `LoopbackPort` was used with synchronous event dispatching, we'd simply send that data as-is to the listeners. This created an inconsistency in the data returned from the `pdf.worker.js` file, since `postMessage` used with *actual* workers (or the `LoopbackPort` with `defer = true`) will ignore/throw when encountering unclonable data.
Originally my intention was simply to just call `cloneValue` regardless of the event dispatching used in `LoopbackPort`, however looking at the use-cases (or lack thereof) of the `LoopbackPort` it seemed reasonable to simply remove the `defer` parameter instead.
This patch is tagged "[api-minor]" since the `LoopbackPort` is still exposed in the API, although I really hope that no third-party is using this (since disabling workers leads to bad performance).
Finally, this patch changes a `forEach` loop to `for...of` and makes uses of optional changing in existing code.
---
[1] As evident by the `npm test` command run by Github Actions, and previously by Travis.
With the previous patch this function is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
With the previous patch this functionality is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
The only reason, as far as I can tell, for parsing the Metadata on the main-thread is how it was originally implemented. When Metadata support was first implemented, it utilized the [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) which isn't available in workers.
Today, with the custom XML-parser being used, that's no longer an issue and it seems reasonable to move the Metadata parsing to the worker-thread[1], since that's where all parsing should happen (for performance reasons).
Based on these changes, we'll be able to reduce the now unnecessary duplication of the XML-parser (and related code) in both of the *built* `pdf.js`/`pdf.worker.js` files.
Finally, this patch changes the `_repair` method to use "Array + join" rather than string concatenation.
---
[1] This needed the previous patch, to enable sending of `Map`s between threads with workers disabled.
I happened to look at this code, and I can't for the life of me figure out why I didn't just implement it like this patch in the first place (since the current format feels overly verbose).
The following checks are all unneeded, and could easily cause confusion when reading the code. (All of them are my fault as well, since I've sometimes added those checks without really thinking about the surrounding code.)
- In `PartialEvaluator.hasBlendModes` there cannot be any `MissingDataException`s thrown, given that the `Page.getOperatorList` method waits for all the necessary /Resources to load first. Furthermore, note also that if an error is thrown from `PartialEvaluator.hasBlendModes` then it'd completely break rendering of that page, since any errors thrown from `Page.getOperatorList` are simply sent to the main-thread.
- In `PartialEvaluator.handleColorN` there cannot be any `MissingDataException`s thrown, given that again the `Page.getOperatorList` method waits for all the necessary /Resources to load before operatorList parsing starts.
- In `XRef.readXRef` there cannot be any `MissingDataException`s thrown, given that we're *explicitly* requesting (and waiting for) the entire document in `pdfManagerReady` (in `src/core/worker.js`) before re-parsing of a corrupt document starts.
* don't set a value in annotationStorage by default:
- having an undefined when the annotation is rendered for saving/printing means nothing has changed so use normal appearance
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1681687
* change the way to compute font size when this one is null in DA:
- make fontSize proportional to line height
- in multiline case, take into account the number of lines for text entered to adapt the font size
- use ascent of the fallback font instead of the one from pdf to position spans
- use TextMetrics.fontBoundingBoxAscent if available or
- use a basic heuristic to guess ascent in drawing char on a canvas
- compute ascent as a ratio of font height
The *third* page of the referenced PDF document currently fails to render completely, since one of its font files fail to load.
Since that error isn't handled, a large part of the text is thus missing which looks quite bad. By "replacing" the font data with an *empty* stream, we'll thus be able to fallback to rendering the text with a standard font (instead of using `ErrorFont`). While there's obviously no guarantee that things will look perfect, actually rendering the text at all should be an improvement in general.
Also, print a warning in `PartialEvaluator.loadFont` when the `PartialEvaluator.translateFont` method rejects, since that'd have helped debug/fix the issue faster.
*As far as I can tell, this has been broken ever since PR 3289 (back in 2013) without anyone noticing.*
For any non-`MissingDataException` errors encountered in `ObjectLoader._walk`, we're simply throwing immediately which thus has the potential to *completely* break rendering of an entire page.
In practice this is obviously only an issue for PDF documents which are in one way or another corrupt, since that's the only way that `XRef.fetch` will throw non-`MissingDataException` errors. To make matters worse these errors are *intermittent*, since they can only occur if the document is still loading when the `ObjectLoader`-code runs (note the early return in `ObjectLoader.load`).
Please note that we cannot simply catch the error and let "normal" parsing continue in `ObjectLoader._walk`, since that could lead to errors elsewhere given that resources "below" the current one (in the graph) might not be checked as intended then.
All-in-all, the only way to make absolutely sure that we won't cause *unexpected* `MissingDataException`s somewhere else in the code-base is to fallback to fetching the *entire* document in this edge-case.
- in order to evaluate SOM expressions nodes and their attributes must be checked in the same order as in the xml;
- add an object XFAObjectArray with a parameter max to handle multiple children with the same name.
- the parser is base on a class extending XMLParserBase
- it handle xml namespaces:
* each namespace is assocated with a builder
* builder builds nodes belonging to the namespace
* when a node is inserted in the parent namespace compatibility is checked (if required)
- to avoid name collision between xml names and object properties, use Symbol.
Given that `FontFaceObject` is not exposed in the public API, but only accessed internally, there's no need to assume that a `FontFaceObject`-instance is ever initialized without `onUnsupportedFeature` being provided. This is also consistent with the `BaseFontLoader` implementation.