- most of the time the current transform is a scaling one (modulo translation),
hence it's possible to avoid to apply the transform on each bbox and then apply
it a posteriori;
- compute the bbox when it's possible in the worker.
- it's the second part of the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- some image masks can be used several times but at different positions;
- an image need to be pre-process before to be rendered:
* rescale it;
* use the fill color/pattern.
- the two operations above are time consuming so we can cache the generated canvas;
- the cache key is based on the current transform matrix (without the translation part)
and the current fill color when it isn't a pattern.
- the rendering of the pdf in the above bug is really faster than without this patch.
- avoid to call normalizeRect which clones the rectangles: it's useless
and time consuming;
- in profiling the pdf in bug 1135277, the time spent in intersect drops
from ~1s to ~30ms.
Because of a bug in previous `core-js` versions, which caused an Error to be thrown if its `structuredClone` polyfill was called with an *explicit* `null`/`undefined` transfer-parameter, the `LoopbackPort`-class contained a work-around.
In the latest `core-js` version this has been fixed, and we can thus simplify our code ever so slightly; please see https://github.com/zloirock/core-js/releases/tag/v3.22.0
This CSS variable is only used together with the `annotationCanvasMap`-functionality in the canvas-code, however its value can be *trivially* computed by using the older `--zoom-factor` CSS variable together with the `PixelsPerInch`-structure.
Rather than having *two different* CSS variables that are this closely linked, it seems better to simplify things by using just one CSS variable instead.
- write some uint32 instead of uint8 to avoid the check before clamping;
- unroll the loop to write data in the buffer
- but keep a loop for the last element of a line: it likely doesn't hurt
that much since it's executed only for one time for each line;
- I tested on a macbook with an Apple chip, and on Firefox nightly the new
code is almost 3.5x faster than before (~1.8x with Chrome).
- it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw:
* an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread:
- this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them
for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330;
- or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so
it's slightly slower than using an OffscreenCanvas.
* it's transfered from the worker to the main thread by "reference";
* the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally
less than before.
- Use the localImageCache for the mask;
- Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image;
- Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set
as defined in operator_list.
We don't need to first check if the Dictionary contains the key, since trying to get a non-existent key simply returns `undefined` and we're already ensuring that the value is a boolean.
Furthermore, we shouldn't need to worry about the `Object.prototype` containing enumerable properties since the checks (in `src/core/worker.js`) done for `Array.prototype` *indirectly* also cover `Object`s. (Keep in mind that an `Array` is just a special kind of `Object` in JavaScript.)
*Please note:* This is possibly bad/wrong in general, but I figured that submitting it for review wouldn't hurt.
It seems that even Adobe Reader doesn't handle the non-ASCII characters that appear in some of the fields correctly, however it should be pretty easy to improve things on the PDF.js side.
- it aims to fix#14685;
- add a basic object to get values from the parsed datasets;
- these annotations don't have an appearance so we must create one when printing or saving.
- it aims to fix issue #14627;
- the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
So basically, this patch removes the constraint to have the chars in the same Tj
(in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
too small (hence `NOT_A_SPACE_FACTOR`).
Given that the textLayer-code has been using a `DocumentFragment` ever since PR 3356 (back in 2013), simply updating the type of the `container` property should be fine.
This patch also tries to, ever so slightly, improve the grammar of a couple of other properties in the typedef.
There's a couple of `getDocument` parameters that should be numbers, but which are currently not *fully* validated to prevent issues elsewhere in the code-base.
Also, improves validation of the `ownerDocument` parameter since we currently accept more-or-less anything here.
Note that the Prettier update made it possible to move a couple of comments after `default:`-cases back to their original/intended positions, please see https://prettier.io/blog/2022/03/16/2.6.0.html
This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.
Given that we now only use Workers when `postMessage` transfers are supported, there's really no point in trying to send a "test" message *without* transfers present.
Hence, if `postMessage` transfers are not supported by the browser, we'll now fallback to "fake" Workers immediately instead. The comment about Opera is also removed, since it was originally added back in PR 983 and mentions Opera `11.60` [which was released in 2011](https://en.wikipedia.org/wiki/History_of_the_Opera_web_browser#Version_11).
These changes make sense for two reasons:
- Given that the parameters are potentially passed to the worker-thread, depending on the `useWorkerFetch` parameter, we need to prevent errors if the user provides values that aren't clonable.
- By ensuring that the default values are indeed `null`, we'll trigger main-thread fetching (of CMaps and Standard fonts) as intended in the `PartialEvaluator` and thus potentially provide better Error messages.
This function is currently placed in the `src/shared/util.js` file, which means that the code is duplicated in both of the *built* `pdf.js` and `pdf.worker.js` files. Furthermore, it only has a single call-site which is also specific to the `GENERIC`-build of the PDF.js library.
Hence this helper function is instead moved into the `src/display/api.js` file, in such a way that it's conditionally defined but still can be unit-tested.
Originally the code in the `src/`-folder was shared between the main/worker-threads, and back then it probably made sense that the `PDFDocument` constructor accepted different arguments.
However, for many years we've not been passing anything *except* Streams to `PDFDocument` and we should thus be able to slightly simplify that code. Note that for e.g. unit-tests of this code, using either a `NullStream` or a `StringStream` works just fine.
According to the MDN compatibility data, see https://developer.mozilla.org/en-US/docs/Web/API/DOMMatrix/DOMMatrix#browser_compatibility, all browsers that we support have native `DOMMatrix` implementations (since quite some time too).
Hence Node.js is the only environment that lack `DOMMatrix` support, which probably isn't that surprising given that it's browser functionality.
While the `DOMMatrix` polyfill isn't that large, it nonetheless seems completely unnecessary to bundle it in the `legacy` builds when it's not needed in browsers. However, we can avoid that by simply listing `dommatrix` as a dependency for the `pdfjs-dist` library.
When there are *multiple* empty glyphs at the start of the data, ensure that the "first" glyph gets a correct `endOffset` to avoid skipping it during parsing in the `sanitizeGlyph` function.
The situation described in issue 14626 seems like a fairly special case, and it thus seem reasonable that we simply follow the same pattern as elsewhere in the `PartialEvaluator` when the `stopAtErrors` API-option is being used.
The "External: Promise"-page in the JSDocs pre-dates the introduction of `Promise`s, as a generally available standard JS feature, by a number of years. Hence it now longer seems necessary, as far as I can tell, to include this "special" page in the documentation.
Also, while unrelated to the rest of the patch, updates the `test/`-folder description in the documentation.
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isString`-calls.
This removes the `DocumentInfoValidators` structure, and thus (slightly) simplifies the code overall. With these changes we only have to iterate through, and validate, the actually available Dictionary entries.
- it aims to fix:
- https://bugzilla.mozilla.org/show_bug.cgi?id=1753075;
- https://bugzilla.mozilla.org/show_bug.cgi?id=1743245;
- https://bugzilla.mozilla.org/show_bug.cgi?id=1710019;
- issue #13211;
- issue #14521.
- previously we were trying to adjust lineWidth to have something correct after the current transform is applied but this approach was not correct because finally the pixel is rescaled with the same factors in both directions.
And sometimes those factors must be different (see bug 1753075).
- So the idea of this patch is to apply a scale matrix to the current transform just before setting lineWidth and stroking. This scale matrix is computed in order to ensure that after transform, a pixel will have its two thickness greater than 1.
All call-sites that use `wrapReason` should be passing a (possibly cloned) `Error` to the helper function, hence we shouldn't need to have a fallback code-path for any other data.
Note that for the `cancel`/`error` methods on Streams, since PR 11115 we've been asserting that the argument is in fact an `Error` as intended.
When calling `wrapReason` from *rejected* Promises, we should also be guaranteed that an `Error` is provided thanks to the ESLint rules `no-throw-literal` and `prefer-promise-reject-errors`.
Currently we'll happily attempt to send any argument passed to this method over to the worker-thread, without doing any sort of validation.
That could obviously be quite bad, since there's first of all no protection against sending unclonable data. Secondly, it's also possible to pass data that will cause the `Ref.get` call in the worker-thread to fail immediately.
In order to address all of these issues, we'll now properly validate the argument passed to `PDFDocumentProxy.getPageIndex` and when necessary reject already on the main-thread instead.
Trying to use a non-string argument in either a `Cmd` or a `Name` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, similar to the existing one used e.g. in the `Dict.set` method.
This removes the `ViewerPreferencesValidators` structure, and thus (slightly) simplifies the code overall. With these changes we only have to iterate through, and validate, the actually available Dictionary entries.
Trying to use a non-string `key` in a `Dict` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, complementing the existing `value` check added in PR 11672.
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isNum`-calls.
These changes were *mostly* done using regular expression search-and-replace, with two exceptions:
- In `Font._charToGlyph` we no longer unconditionally update the `width`, since that seems completely unnecessary.
- In `PDFDocument.documentInfo`, when parsing custom entries, we now do the `typeof`-check once.
Unless you actually need to check that something is both a `Name` and also of the *correct* type, using `instanceof Name` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.
This patch uses ESLint to enforce this, since we obviously still want to keep the `isName` helper function for where it makes sense.
Unless you actually need to check that something is both a `Dict` and also of the *correct* type, using `instanceof Dict` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.
This patch uses ESLint to enforce this, since we obviously still want to keep the `isDict` helper function for where it makes sense.
Unless you actually need to check that something is both a `Cmd` and also of the *correct* type, using `instanceof Cmd` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.
This patch uses ESLint to enforce this, since we obviously still want to keep the `isCmd` helper function for where it makes sense.
Given that we expose `PDFObjects`-instances, via the `commonObjs` and `objs` properties, on the `PDFPageProxy`-instances this ought to help provide slightly better TypeScript definitions.
The manually tracked `resolved`-property is no longer necessary, since the same information is now directly available on all `PromiseCapability`-instances.
Furthermore, since the `PDFObjects.resolve` method is not documented as accepting e.g. only Object-data, we probably shouldn't resolve the `PromiseCapability` with the `data` and instead only store it on the `PDFObjects`-instance.[1]
---
[1] While Objects are passed by reference in JavaScript, other primitives such as e.g. strings are passed by value and the current implementation *could* thus lead to increased memory usage. Given how we're using `PDFObjects` in the PDF.js code-base none of this should be an issue, but it still cannot hurt to change this.
This ensures that the underlying data cannot be accessed directly, from the outside, since that's definately not intended here.
Note that we expose `PDFObjects`-instances, via the `commonObjs` and `objs` properties, on the `PDFPageProxy`-instances hence these changes really cannot hurt.
This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.
*Please note:* I'm completely fine with this patch being rejected, and the issue instead closed as WONTFIX, since this is unfortunately a case where the TypeScript definitions dictate how we can/cannot write JavaScript code.
Apparently the TypeScript definitions generation converts the existing `PixelsPerInch` code into a `namespace` and simply ignores the getter; please see a7fc0d33a1/types/src/display/display_utils.d.ts (L223-L226)
Initially I tried tagging `PixelsPerInch` as en `@enum`, see https://jsdoc.app/tags-enum.html, however that unfortunately didn't help.
Hence the only good/simple solution, as far as I'm concerned, is to convert `PixelsPerInch` into a class with `static` properties. This patch results in the following diff, for the `gulp types` build target:
```diff
@@ -195,9 +195,10 @@
*/
static toDateObject(input: string): Date | null;
}
-export namespace PixelsPerInch {
- const CSS: number;
- const PDF: number;
+export class PixelsPerInch {
+ static CSS: number;
+ static PDF: number;
+ static PDF_TO_CSS_UNITS: number;
}
declare const RenderingCancelledException_base: any;
export class RenderingCancelledException extends RenderingCancelledException_base {
```
Soft masks can be enabled/disabled at anytime and at different
points in the save/restore stack. This can lead to
the amount of save/restores becoming unbalanced across the
two canvases. Instead of save/restoring on the temporary canvas
change it so we only track state on the main (suspended canvas).
I was also getting an out balance stack from patterns, so I've also
fixed that and added a warning that will at least show up on chrome.
It would be nice to add this so Firefox at some point too.
Fixes#11328, #14297 and bug 1755507
At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.
Unfortunately I don't have a test-case that breaks without this change, however the `stringToPDFString` helper function will fail if anything other than a string is passed to it.
The changes in this patch thus make this code more-or-less identical to that found in the `Catalog.{_collectJavaScript, parseDestDictionary}` methods.
According to the MDN compatibility data, see https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#browser_compatibility, all browsers that we support have native `ReadableStream` implementations (since quite some time too).
Hence only Node.js is now lagging behind w.r.t. `ReadableStream` support, and its experimental implementation doesn't really help us given the life-span of the LTS releases (see https://en.wikipedia.org/wiki/Node.js#Releases).
It seems quite unfortunate to bundle a `ReadableStream` polyfill in the `legacy` builds when it's unnecessary in browsers, given its overall size, but fortunately we can avoid that by simply listing `web-streams-polyfill` as a dependency for the `pdfjs-dist` library.
With recent changes, specifically PR 14515 *and* the previous patch, the `createObjectURL` helper function is now only used with the SVG back-end.
All other call-sites, throughout the code-base, are now using `URL.createObjectURL(...)` directly and it no longer seems necessary to keep exposing the helper function in the API.
Finally, the `createObjectURL` helper function is moved into the `src/display/svg.js` file to avoid unnecessarily duplicating this code on both the main- and worker-threads.
This is essentially a *continuation* of PR 7926, where we added support for rejecting the current `PDFDocumentLoadingTask`-promise by throwing inside of the `onPassword`-callback.
Hence the naive way to address [bug 1754421](https://bugzilla.mozilla.org/show_bug.cgi?id=1754421) would be to simply throw in the `onPassword`-callback used in the default viewer. However it unfortunately turns out to not work, since the password input/validation is asynchronous, and we thus need another approach.
The simplest solution that I can come up with here, is thus to *extend* the `onPassword`-callback to also reject the current `PDFDocumentLoadingTask`-instance if an `Error` is explicitly passed as the input to the callback function. (This doesn't feel great, but I cannot see a better solution that isn't really complicated.)
This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.
Please note that while we "support" some (by now) fairly old browsers, that essentially means that the library (and viewer) will load and that the basic functionality will work as intended.[1]
However, in older browsers, some functionality may not be available and generally we'll ask users to update to a modern browser when bugs (specific to old browsers) are reported.[2]
There's always a question of just how old browsers the PDF.js contributors can realistically support, and here I'm suggesting that we place the cut-off point at approximately *three* years.
With that in mind, this patch updates the *minimum* supported browsers (and environments) as follows:
- Chrome 73, which was released on 2019-03-12; see https://en.wikipedia.org/wiki/Google_Chrome_version_history
- Firefox ESR (as before); see https://wiki.mozilla.org/Release_Management/Calendar
- Safari 12.1, which was released on 2019-03-25; see https://en.wikipedia.org/wiki/Safari_version_history#Safari_12
- Node.js 12, which was release on 2019-04-23 (and will soon reach EOL); see https://en.wikipedia.org/wiki/Node.js#Releases
---
[1] Assuming a `legacy`-build is being used, of course.
[2] In general it's never a good idea to use an old/outdated browser, since those may contain *known* security vulnerabilities.
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
- it aims to fix#14502 and bug 1721335;
- Acrobat and Pdfium do the same;
- it'll avoid to have truncated data when printed;
- change the factor to compute font size in using field height: lineHeight = 1.35*fontSize
- this is the value used by Acrobat.
- in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.
This allows us to remove the manually implemented `structuredClone` polyfill, thus reducing the maintenance burden for the `LoopbackPort` class; refer to https://github.com/zloirock/core-js#structuredclone
*Please note:* While `structuredClone` support landed already in Firefox 94, Google Chrome only added it in version 98 (currently in Beta). However, given that the `LoopbackPort` will only be used together with *fake workers* in browsers this shouldn't be too much of a problem.[1]
For Node.js environments, where *fake workers* are unfortunately necessary, using a `legacy/`-build is already required which thus guarantees that the `structuredClone` polyfill is available.
Also, the patch updates core-js to the latest version since that one includes `structuredClone` improvements; please see https://github.com/zloirock/core-js/releases/tag/v3.20.3
---
[1] Given that we only support browsers with proper worker support, if *fake workers* are being used that essentially indicates a configuration problem/error.
- it aims to fix#14497;
- previously, only rotations with an angle 0, 90, 180 or 270 were taken into account;
- so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.
This commit fixes Bug 1743245 (Grided PDF file lines rendered too thick) which was created by a fix for #12868 .
The lineWidth was set to round(1 * this._combinedScaleFactor) when the pixel is drawn as a parallelorgam with a height <1. This fix changes this to floor(1*this._combinedScaleFactor) .
This change shows a visual result comparable to Chrome and Acrobat.
Regarding the last PR 3 statements in canvas.js are affected and will change with this commit (stroke and paintChar).
renaming the reference files to naming comvention
Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing.
Obviously the `Glyph`-instances are being cached *per* font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1].
Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` *unique* unicode-chars being checked.
*Please note:* In practice I suppose that this won't have a *huge* effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review.
---
[1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.
- it aims to fix issue #14307;
- this event has been added recently in Firefox and we can now use it;
- fix few bugs in aform.js or in annotation_layer.js;
- add some integration tests to test keystroke events (see `AFSpecial_Keystroke`);
- make dispatchEvent in the quickjs sandbox async.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563;
- use some helper functions to get (u|i)int** values in buffer: it helps to have a clearer code;
- in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8;
- add few TODOs.
Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries.
In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.
After the changes in PR 14428 we can *directly*, and more efficiently, handle whitespace conversion in `PartialEvaluator.getTextContent` when the `normalizeWhitespace` option is being used.
This way we no longer need a separate helper function for this, and can avoid having to (again) iterate through the text and checking each character. Finally, this also removes the need for using a regular expression on e.g. all non-ASCII text.
Inlining the checks should be a *tiny bit* more efficient, since it avoids have to make *unconditional* function calls in these fairly commonly used helper functions.
This patch implements this by looking for the UTF-8 BOM, i.e. `\xEF\xBB\xBF`, in order to determine the encoding.[1]
The actual conversion is done using the `TextDecoder` interface, which should be available in all environments/browsers that we support; please see https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder#browser_compatibility
---
[1] Assuming that everything lacking a UTF-16 BOM would have to be UTF-8 encoded really doesn't seem correct.
In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing.
Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs
*Please note:* That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents.
Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.
While `PageViewport` apparently makes sense in TypeScript environments, given that it's being returned by the `PDFPageProxy.getViewport`-method in the API, we really don't want to extend the *public* API by simply exporting the class directly in `src/pdf.js` since it should never be called/initialized manually.
Hence we follow the same pattern as in PR 14013, and also extend the API unit-tests to ensure that `PDFPageProxy.getViewport` always returns a `PageViewport`-instance as expected.
This prevents the `BaseSVGFactory.create`-method from throwing, and thus preventing any remaining Annotations (on the page) from rendering in corrupt documents.
This helper function has never been used in e.g. the worker-thread, hence its placement in `src/shared/util.js` led to a *small* amount of unnecessary duplication.
After the previous patches this helper function is now *only* used in the viewer, hence it no longer seems necessary to expose it through the official API.
*Please note:* It seems somewhat unlikely that third-party users were relying *directly* on this helper function, which is why it's not being exported as part of the viewer components. (If necessary, we can always change this later on.)
As part of the changes/improvement in PR 14092, we're no longer using the `addLinkAttributes` directly in e.g. the AnnotationLayer-code.
Given that the helper function is now *only* used in the viewer, hence it no longer seems necessary to expose it through the official API.
*Please note:* It seems somewhat unlikely that third-party users were relying *directly* on the helper function, which is why it's not being exported as part of the viewer components. (If necessary, we can always change this later on.)
The patch in PR 14335 *essentially* re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.
While this method is currently only used as a *fallback* in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the *entire* /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a *much* easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method *directly* instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s).
Furthermore, this patch contains the following other changes:
- Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead.
- Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error *immediately* afterwards.
---
[1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.
This method is now being used a lot more, compared to when it's added, since it's now used together with scripting as part of the `PDFDocument.fieldObjects` parsing (called during viewer initialization).
For /Page Dictionaries that we've already parsed, the `pageIndex` corresponding to a particular Reference is already known and we're thus able to skip *all* parsing in the `Catalog.getPageIndex` method for those cases.
Besides converting `Catalog.getPageDict` to an `async` method, thus simplifying the code, this patch also allows us to pro-actively fix a existing issue.
Note how we're looking up References in such a way that `MissingDataException`s won't cause trouble, however it's *technically possible* that the entries (i.e. /Count, /Kids, and /Type) in a /Pages Dictionary could actually be indirect objects as well. In the existing code this could lead to *some*, or even all, pages failing to load/render as intended.
In practice that doesn't *appear* to happen in real-world PDF documents, but given all the weird things that PDF software do I'd prefer to fix this pro-actively (rather than waiting for a bug report).
With `Catalog.getPageDict` being `async` this is now really simple to address, however I didn't want to introduce a bunch more *unconditional* asynchronicity in this method if it could be avoided (since that could slow things down). Hence we'll *synchronously* lookup the *raw* data in a /Pages Dictionary, and only fallback to asynchronous data lookup when a Reference was encountered.
In addition to the above, this patch also makes the following notable changes:
- Let `Catalog.getPageDict` *consistently* reject with the actual error, regardless of what data we're fetching. Previously we'd "swallow" the actual errors except when looking up Dictionary entries, which is inconsistent and thus seem unfortunate. As can be seen from the updated unit-tests this change is API-observable, hence why the patch is tagged `[api-minor]`.
- Improve the consistency of the Dictionary /Type-checks in both the `Catalog.getPageDict` and `Catalog.getAllPageDicts` methods.
In `Catalog.getPageDict` there's a fallback code-path where we're *incorrectly* checking the /Page Dictionary for a /Contents-entry, which is wrong since a /Page Dictionary doesn't need to have a /Contents-entry in order to be valid.
For consistency the `Catalog.getAllPageDicts` method is also updated to handle errors in the /Type-lookup correctly.
- Reduce the `PagesCountLimit.PAUSE_EAGER_PAGE_INIT` viewer constant, to further improve loading/rendering performance of the *second* page during initialization of very long documents; PR 14359 follow-up.
This patch circumvents the issues seen when trying to update TypeScript to version `4.5`, by "simply" fixing the broken/missing JSDocs and `typedef`s such that `gulp typestest` now passes.
As always, given that I don't really know anything about TypeScript, I cannot tell if this is a "correct" and/or proper way of doing things; we'll need TypeScript users to help out with testing!
*Please note:* I'm sorry about the size of this patch, but given how intertwined all of this unfortunately is it just didn't seem easy to split this into smaller parts.
However, one good thing about this TypeScript update is that it helped uncover a number of pre-existing bugs in our JSDocs comments.
In PR 14114 this was only added to the default viewer, which means that in the viewer components the user would need to *manually* implement /Lang handling. This was (obviously) a bad choice, since the viewer components already support e.g. structTrees by default; sorry about overlooking this!
To avoid having to make *two* `getMetadata` API-calls[1] very early during initialization, in the default viewer, the API will now cache its result. This will also come in handy elsewhere in the default viewer, e.g. by reducing parsing when opening the "document properties" dialog.
---
[1] This not only includes a round-trip to the worker-thread, but also having to re-parse the /Metadata-entry when it exists.
After the changes in PR 14338, specifically in the `XRef.parse`-method, the /Pages-entry will now always have been fetched/validated when the `Catalog`-instance is created.
Hence we can directly access the /Pages-entry in `Catalog.getPageDict` and thus avoid *one* asynchronous data-lookup per page in the document. (In practice this is unlikely to show up in e.g. benchmarks, but it really cannot hurt.)
Finally, make sure that the `getPageDict`/`getAllPageDicts`-methods track the /Pages-tree reference correctly to prevent circular references in corrupt documents.
Rather than "swallowing" the actual Errors, when data fetching fails, ensure that they're always being propagated as intended to the call-site instead.
Note that we purposely handle `XRefEntryException` specially, to make it possible to fallback to indexing all XRef objects.
Rather than trying, and failing, to fetch the entire /Pages-tree for documents with corrupt XRef tables, let's fallback to indexing all objects *before* trying to invoke the `Catalog.getAllPageDicts` method.
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Finally, also changes the `pagePromises` to a *private* property since it's not supposed to be accessed from the "outside".
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
For one thing, this simplifies iteration since we no longer have to worry about/check if `pageCache`-entries are undefined (which will happen for *sparse* `Array`s).
Of particular note is that we're no longer attempting to "null" the `pageCache`-entry from within the `PDFPageProxy._destroy`-method. Given that *synchronous* JavaScript will always run to completion[1] and that we're looping through all pages in `WorkerTransport.destroy` and immediately clear the cache afterwards, that code did/does not really make a lot of sense (as far as I can tell).
Finally, also changes the `pageCache` to a *private* property since it's not supposed to be accessed from the "outside".
---
[1] Unless there are errors, of course.
PR 8207 added caching to improve the performance of `Catalog.getPageDict`, by not having to repeatedly fetch the same data and also reducing the asynchronicity of that method.
However, because of *another* oversight on my part, we're only caching /Page references once we've found the correct page. As long as all pages are loaded *in order* this doesn't really matter (happens by default in the viewer), but when `disableAutoFetch` is used the pages may be fetched in a more random order (this patch reduces the asynchronicity of `Catalog.getPageDict` slightly in that case).
PR 8207 added caching to improve the performance of `Catalog.getPageDict`, by not having to repeatedly fetch the same data and also reducing the asynchronicity of that method.
However, because of annoying off-by-one errors[1] the caching became less efficient than it could/should be.[2] Note here that the /Pages-tree is zero-indexed, and that e.g. `pageIndex = 5` thus correspond to the *sixth* page of the document.
---
[1] In particular the `currentPageIndex + count < pageIndex` part.
[2] For example, even when loading a relatively small/simple document such as `tracemonkey.pdf` in the viewer, the number of `xref.fetchAsync(currentNode)` calls are reduced from `56` to `44` with this patch.
Trying to shadow a non-existent property is always an implementation mistake, since it leads to the `shadow`-call not having any effect.
In PR 14152 I overlooked the fact that it's fairly easy to enforce this during development/testing, since that can help catch e.g. simple spelling bugs.
Currently the `Catalog.metadata` getter only handles errors during parsing, however in a *corrupt* PDF document fetching of the raw /Metadata can obviously fail as well.
Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it *never* should and this will cause the viewer to not initialize all state as expected.
Fixes one of the documents in issue 14305.
*This patch improves handling of a couple of PDF documents from issue 14303.*
- Update `XRef.indexObjects` to actually clear *all* XRef-caches. Invalid XRef tables *usually* cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.
- Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched *and* that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.
- Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)
With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than *appearing* to work).
*Please note:* This is similar to the method that existed prior to PR 3848, but the new method will *only* be used as a fallback when parsing of corrupt PDF documents.
The implementation in PR 14311 unfortunately turned out to be *way* too simplistic, as evident by the recently added test-files in issue 14303, since it may *cause* infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1]
To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the *entire* /Pages-tree when the /Count-entry validation fails during document initialization.
Fixes *at least* two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents.
---
[1] The whole point of PR 14311 was obviously to *get rid of* infinte loops during document initialization, not to introduce any more of those.
This was added in PR 14311, but given that I completely missed to update the `PDFDocument.getPage` signature accordingly it's completely unused.
Given that things work just as fine as-is, let's simply remove that optional parameter for now; sorry about the churn here!
This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree.
Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.
Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.
Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).
Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.
Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
*This patch basically extends the approach from PR 10392, by also checking the last page.*
Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).
Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.
Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
- This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
- For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
- This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.
As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).
Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.
While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.
Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).
The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.
Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.
The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.
Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.
Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).
---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)
[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.
[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available.
This patch is a follow-up to PR 11123, which made it impossible to *manually* disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1]
Hence we'll now only support *proper* Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise.
---
[1] At the time of that PR we still "supported" IE, which is why this code was left intact.
There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.
Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.
---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
- real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
- for example, an extra spacing can cancel a space so visually there are no space.
- First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260;
- several interactive pdfs use the possibility to hide/show buttons to show different icons;
- render pushbuttons on their own canvas and then insert it the annotation_layer;
- update test/driver.js in order to convert canvases for pushbuttons into images.
Previously, when we created a shading pattern canvas we created it
as the same size as the page. This was good for caching if the same
pattern was used over and over again, but when lots of different
shadings are created that caused us to create many full page
canvases.
Instead of creating the full page canvses, create the canvas
as the same size as the current path bounding box. This reduces memory
consumption by a lot since most paths are pretty small. Also, in real world
PDFs it's rare for a shading (non shading fill) to be reused over and over again.
Bug 1721949 is an example where the same pattern is reused and it will be slightly
slower than before.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.
This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502;
- when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created).
- the pdf shows an alignment issue on page 4:
- the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout;
- there is an extra empty page at the end of the pdf:
- there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility).
- so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.
We were incorrectly using the transform in the pattern before it had been
adjusted causing the pattern to be misplaced relative to the page.
Fixes: ShowText-ShadingPattern.pdf (already in corpus)
Fixes: #8111Fixes: #9243
Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0>
If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf
- The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it.
However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause:
- eliminate unmatched data with occur min=0
we are checking our empty data and sending it to uselessNode array where at the end it gets removed;
Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour.
In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can *obviously* cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).
Starting a new path will wipe out any of the current subpaths in the
current graphics state, so we should reset the min/maxes.
This makes a number of the bounding boxes smaller and reduces the number
of composed pixels. For the smask tests in the corpus, the number of
composed pixesl goes from 19,872,109 to 19,676,905. The difference is much
larger on other PDFs though.