Note first of all how the `PDFDocumentProxy.getJSActions` method in the API caches the result, which makes repeated lookups cheap enough to not really be an issue.
Secondly, with the previous patch, we're now only dispatching "pageopen"/"pageclose"-events when there's actually a sandbox that listens for them.
All-in-all, with these changes we can thus simplify the default-viewer "pageopen"-event handler a fair bit.
There's built-in ESLint rule, see `sort-imports`, to ensure that all `import`-statements are sorted alphabetically, since that often helps with readability.
Unfortunately there's no corresponding rule to sort `export`-statements alphabetically, however there's an ESLint plugin which does this; please see https://www.npmjs.com/package/eslint-plugin-sort-exports
The only downside here is that it's not automatically fixable, but the re-ordering is a one-time "cost" and the plugin will help maintain a *consistent* ordering of `export`-statements in the future.
*Note:* To reduce the possibility of introducing any errors here, the re-ordering was done by simply selecting the relevant lines and then using the built-in sort-functionality of my editor.
Given that the API will now, after PR 12039, automatically pick the correct factories to use depending on the environment (browser vs. Node.js), we can utilize that in the unit-tests as well. This way we don't have to manually repeat the same initialization code in *multiple* unit-tests.
*Note:* The *official* PDF.js API is defined in `src/pdf.js`, hence the new exports in `src/display/api.js` will not affect that.
Also, updates the unit-test `FileReaderFactory` helpers similarily.
*Drive-by change:* Fix the `CMapReaderFactory` usage in the annotation unit-tests, since the cache should only contain raw data and not a Promise. While this obviously works as-is, having unit-tests that "abuse" the intended data format can easily lead to unnecessary failures if changes are made to the relevant `src/core/` code.
This will, in a very simple way using the existing events, thus allow the viewer to remove the "beforeunload" `window` event listener when the document is closed.
Generally speaking we want to avoid having *global* event listeners for the PDF document instance, which is why the `EventBus` exists, and instead reserve global events for the viewer itself. However, the `AnnotationStorage` "beforeunload" event unfortunately needs to be document-specific and we should thus ensure that it's correctly removed when the document is destroyed.
* the goal is to execute actions like Open or OpenAction
* can be tested with issue6106.pdf (auto-print)
* once #12701 is merged, we can add page actions
Given that we already include the "Content-Disposition"-header filename, when it exists, it shouldn't hurt to also include the information from the "Content-Length"-header.
For PDF documents opened via a URL, which should be a very common way for the PDF.js library to be used, this will[1] thus provide a way of getting the PDF filesize without having to wait for the `getDownloadInfo`-promise to resolve[2].
With these API improvements, we can also simplify the filesize handling in the `PDFDocumentProperties` class.
---
[1] Assuming that the server is correctly configured, of course.
[2] Since that's not *guaranteed* to happen in general, with e.g. `disableAutoFetch = true` set.
- Add support for logical assignment operators, i.e. `&&=`, `||=`, and `??=`, with a Babel-plugin. Given that these required incrementing the ECMAScript version in the ESLint and Acorn configurations, and that platform/browser support is still fairly limited, always transpiling them seems appropriate for now.
- Cache the `hasJSActions` promise in the API, similar to the existing `getAnnotations` caching. With this implemented, the lookup should now be cheap enough that it can be called unconditionally in the viewer.
- Slightly improve cleanup of resources when destroying the `WorkerTransport`.
- Remove the `annotationStorage`-property from the `PDFPageView` constructor, since it's not necessary and also brings it more inline with the `BaseViewer`.
- Update the `BaseViewer.createAnnotationLayerBuilder` method to actaually agree with the `IPDFAnnotationLayerFactory` interface.[1]
- Slightly tweak a couple of JSDoc comments.
---
[1] We probably ought to re-factor both the `IPDFTextLayerFactory` and `IPDFAnnotationLayerFactory` interfaces to take parameter objects instead, since especially the `IPDFAnnotationLayerFactory` one is becoming quite unwieldy. Given that that would likely be a breaking change for any custom viewer-components implementation, this probably requires careful deprecation.
* When no actions then set it to null instead of empty object
* Even if a field has no actions, it needs to listen to events from the sandbox in order to be updated if an action changes something in it.
By using optional chaining, see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining, it's possible to reduce unnecessary code-repetition in many cases.
Note that these changes also reduce the size of the *built* `pdf.js` file, when `SKIP_BABEL == true` is set, and for the `MOZCENTRAL` build-target that result in a `0.1%` filesize reduction from a simple and mostly mechanical code change.
Since we no longer use SystemJS to load the unit-tests, there's now nothing that prevents us from using optional chaining and nullish coalescing in the `src/display/` directory.
*This patch is based on a couple of smaller things that I noticed when working on PR 12479.*
- Don't store the /Fields on the `formInfo` getter, since that feels like overloading it with unintended (and too complex) data, and utilize a `hasFields` boolean instead.
This functionality was originally added in PR 12271, to help determine what kind of form data a PDF document contains, and I think that we should ensure that the return value of `formInfo` only consists of "simple" data.
With these changes the `fieldObjects` getter instead has to look-up the /Fields manually, however that shouldn't be a problem since the access is guarded by a `formInfo.hasFields` check which ensures that the data both exists and is valid. Furthermore, most documents doesn't even have any /AcroForm data anyway.
- Determine the `hasFields` property *first*, to ensure that it's always correct even if there's errors when checking e.g. the /XFA or /SigFlags entires, since the `fieldObjects` getter depends on it.
- Simplify a loop in `fieldObjects`, since the object being accessed is a `Map` and those have built-in iteration support.
- Use a higher logging level for errors in the `formInfo` getter, and include the actual error message, since that'd have helped with fixing PR 12479 a lot quicker.
- Update the JSDoc comment in `src/display/api.js` to list the return values correctly, and also slightly extend/improve the description.
Previously this rule has been enabled in the `web/` folder, and in select files in the `src/` sub-folders.
Note that a number of the files in the `src/display/` folder were already enforcing the `no-var` rule, and thanks to Prettier the necessary re-writing will be (mostly) handled automatically.
Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-var
The `/Order` array is used to improve the display of Optional Content groups in PDF viewers, and it allows a PDF document to e.g. specify that Optional Content groups should be displayed as a (collapsable) tree-structure rather than as just a list.
Note that not all available Optional Content groups must be present in the `/Order` array, and PDF viewers will often (by default) hide those toggles in the UI.
To allow us to improve the UX around toggling of Optional Content groups, in the default viewer, these hidden-by-default groups are thus appended to the parsed `/Order` array under a *custom* nesting level (with `name == null`).
Finally, the patch also slightly tweaks an `OptionalContentConfig` related JSDoc-comment in the API.
Obviously it doesn't make sense to call that method without providing an `AnnotationStorage`-instance, however we should ensure that doing so won't cause errors.
Hence we need to check that `annotationStorage` is actually defined, before attempting to call its `resetModified` method.
Related to https://bugzilla.mozilla.org/show_bug.cgi?id=1659753
This allows Firefox trigger a "save" event from ctrl/cmd+s or the "Save
Page As" context menu, which in turn lets pdf.js generate a new PDF if
there is form data to save.
I also now use `sourceEventType` on downloads so Firefox can determine if
it should launch the "open with" dialog or "save as" dialog.
- Initialize the `AnnotationStorage`-instance, on `PDFDocumentProxy`, lazily.
- Change the `AnnotationStorage` to use a `Map` internally, rather than a regular Object (simplifies the following points).
- Let `AnnotationStorage.getAll` return `null` when there's no data stored, to avoid unnecessary parsing on the worker-thread. This ought to "just work", since the worker-thread code *should* already handle the `!annotationStorage` case everywhere.
- Add a new `AnnotationStorage.size` getter, to be able to easily tell if there's any data stored.
While the parameter name (clearly) suggests that an `AnnotationStorage`-instance is expected, looking at the only call-sites that include the parameter (i.e. the `PDFPrintServiceFactory` instances) it actually contains just a normal Object.
Hence it seems much more reasonable to actually pass a valid `AnnotationStorage`-instance, as the name suggests, and simply have `PDFPageProxy.render` do the `annotationStorage.getAll()` call. (Since we cannot send an `AnnotationStorage`-instance as-is to the worker-thread, given the "structured clone algorithm".)
Over time we used multiple different formats for JSDoc comments. This
commit standardizes those formats to the one we used most often.
Moreover, this removes the example in the outline endpoint documentation
since it now has a proper type definition and it didn't render correctly
in JSDoc.
This commit:
- formats the documentation block according to the standards;
- replaces the callback definitions with the `function` type (we have
that for other definitions already and the callback type was not
rendered correctly by JSDoc);
- synchronizes the type documentation and the class documentation;
- fixes the documentation by making it easier to read and making sure
that all optional properties are indicated as such;
- uses the `@link` tag to indicate links to other code.
The `typestest` still passes and JSDoc now renders this class correctly.
Add a new method to the API to get the optional content configuration. Add
a new render task param that accepts the above configuration.
For now, the optional content is not controllable by the user in
the viewer, but renders with the default configuration in the PDF.
All of the test files added exhibit different uses of optional content.
Fixes#269.
Fix test to work with optional content.
- Change the stopAtErrors test to ensure the operator list has something,
instead of asserting the exact number of operators.
These errors can/will occur if data is still loading when the document is destroyed, which is the case in the API unit-tests that load the `tracemonkey.pdf` file.
While this patch prevents these kind of problems, and thus allows us to update Jasmine again, I cannot help but thinking that it's slightly "hacky". Basically, we'll simply catch and ignore (some) rejected promises once the document is destroyed and/or its data loading is aborted. However, I don't *think* that these changes should cause issues in general, since we don't really care about errors once document destruction has started (note e.g. the fair number of `catch` handlers ignoring `AbortException`s already).
This PR adds typescript definitions from the JSDoc already present.
It adds a new gulp-target 'types' that calls 'tsc', the typescript
compiler, to create the definitions.
To use the definitions, users can simply do the following:
```
import {getDocument, GlobalWorkerOptions} from "pdfjs-dist";
import pdfjsWorker from "pdfjs-dist/build/pdf.worker.entry";
GlobalWorkerOptions.workerSrc = pdfjsWorker;
const pdf = await getDocument("file:///some.pdf").promise;
```
Co-authored-by: @oBusk
Co-authored-by: @tamuratak
There's quite frankly no particular reason to special-case Type3-fonts with image resources, which are very rare anyway, now that we have a general mechanism for sending/receiving images globally.
This moves, and slightly simplifies, code that's currently residing in the unit-test utils into the actual library, such that it's bundled with `GENERIC`-builds and used in e.g. the API-code.
As an added bonus, this also brings out-of-the-box support for CMaps in e.g. the Node.js examples.
As can be seen in the code there's a handful of places where this structure needs to be iterated, something that becomes cumbersome when dealing with `Object`s. Hence, by changing this to a `Map` instead we can both simplify the code and avoid creating unnecessary closures.
Particularily the `PDFPageProxy._tryCleanup` method becomes a lot more readable, at least in my opinion.
Finally, since this property is intended to be "private" the name is adjusted to reflect that.
This patch aims to simplify the `PDFPageProxy._destroy` method, by:
- Replacing the unnecessary `forEach` with a "regular" `for`-loop instead.
- Use a more appropriate variable name, since `intentState.renderTasks` contain instances of `InternalRenderTask`.
- Move the "is rendering completed"-handling to a new `InternalRenderTask.completed` getter, to abstract away some (mostly) internal `InternalRenderTask` state.
With the changes in previous patches, the `disableCreateObjectURL` option/functionality is no longer used for anything in the API and/or in the Worker code.
Note however that there's some functionality, mainly related to file loading/downloading, in the GENERIC version of the default viewer which still depends on this option.
Hence the `disableCreateObjectURL` option (and related compatibility code) is moved into the viewer, see e.g. `web/app_options.js`, such that it's still available in the default viewer.
Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons:
- It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library.
- The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image.
- While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first.
In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should *always* be the case.
- The native decoding, for anything except the *simplest* of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707).
Furthermore this also leads to data being *parsed* on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons.
- Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests.
- Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used.
Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues.
At this point in time, there's two kinds of failure with this patch:
- Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder.
This type of "failure" accounts for the *vast* majority of the total number of changes in the reference tests.
- Changes where the JPEG images now looks *ever so slightly* blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough).
Basically if you disable [this downscaling in canvas.js](8fb82e939c/src/display/canvas.js (L2356-L2395)), which is what happens when zooming in, the differences simply vanish!
Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that *all* images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.
Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1]
Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2]
However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages.
In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data).
While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour.
Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3]
*Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator.
---
[1] There's e.g. PDF documents that use the same image as background on all pages.
[2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer.
[3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.