A number of methods have their Promises cached, to avoid repeated worker round-trips, since they're expected to be called more than once from the default viewer. The way that the caching is currently implemented means that we need to remember to manually clear these Promises on document cleanup/destruction, and it'd be nice to avoid that.
With this patch the relevant Promises are now instead placed in just one `Map`, which is easy to clear, and a new helper method is also introduced to reduce duplication for *simple* `WorkerTransport` methods.
The only parameter that we actually need here is the `PDFDataRangeTransport`-instance, since the others are not necessary.
- The `url` parameter, as passed to the `getDocument` function in the API, is simply being ignored; see 2d87a2eb1c/src/display/api.js (L447-L458)
- The `length` parameter, as passed to the `getDocument` function in the API, is always being overwritten; see 2d87a2eb1c/src/display/api.js (L519-L525)
Until PR 12563 is deemed safe to land, I'd still like to be able to use worker-modules in the viewer during local development.
Hence this patch which *temporarily* adds a new `workerModules` hash-parameter, only available in non-PRODUCTION mode, that allows using worker-modules in the development viewer.
To enable this functionality, simply use http://localhost:8888/web/viewer.html#workerModules=true
The initial CMap support was added in PR 4259 using the "raw" Adobe files, however they were quickly deemed to be unnecessarily large. As a result PR 4470 introduced the more compact "binary" CMap format, with both of those PRs being included in the very same release (version `0.8.1334`) .
Please note that we've thus never shipped anything *except* the "binary" CMap files with the PDF library, and furthermore note that we've not even once updated the CMap files since they were originally added almost nine years ago.
Requiring users to remember that `cMapPacked = true` is necessary, in addition to setting the `cMapUrl` parameter, in order for CMap loading to work feels like a less than ideal API.
Hence this patch, which suggests that we simply let `cMapPacked` default to `true` now.
Until just recently the only existing `Path2D` polyfill didn't have support for Node.js and/or the `node-canvas` package. Given that this was just fixed, in the latest version, we can now finally remove our inline-checks at the relevant call-sites; please also see https://github.com/nilzona/path2d-polyfill#usage-with-node-canvas
In PR #15757, a value is automatically converted into a number when it's possible
but the case of numbers like "000123" has been overlooked and their format must
be preserved.
When a script is doing something like "foo.value + bar.value" and the values are
numbers then "foo.value" must return a number but the displayed value must be what
the user entered or what a script set, so this patch is just adding a a field
_orginalValue in order to track the value has it has defined.
Some people are used to use a comma as decimal separator, hence it must be considered
when a value is parsed into a number.
This patch is fixing a regression introduced by #15757.
There's really no need for these "complicated" default value assignments, since `GlobalWorkerOptions` is a local variable at this point, and this is rather a case of too much copy-and-paste.
Note that years ago, when all options were set using a global `PDFJS` object, it's possible that options had been set (from the outside) *before* the object had been properly initialized; see e.g. a89071bdef/src/display/global.js
In general it's always recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
Unfortunately we cannot really remove this, since that code has existed since "forever", however we can limit it to only the GENERIC build to avoid completely unnecessary checks in e.g. the Firefox PDF Viewer.
Finally, note that the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
- Use a `URL`-instance directly, since it's by definition an absolute URL.
- Actually limit the "raw" url-string handling to Node.js environments, as intended.
- Skip the warning, since we're already throwing an Error if the `url`-parameter is invalid.
*Please note:* I cannot reproduce the problem reported in bug 1811668, regarding the context menu, and in any case it's not clear that that part is even a PDF Viewer bug.
Looking at bug 1811668 I couldn't help but noticing that the textLayer isn't correct, and it's unfortunately once again a problem with the `adjustType1ToUnicode` function. That's intended to help improve text-selection for fonts without a /ToUnicode-entry, and in many cases it does help (the original PR fixed lots of issues) however it's also caused some problems.
In order to improve text-selection in bug 1811668, we'll now properly ignore fonts that have a predefined *named* encoding specified since that's really the intention with PR 14050.
The JBIG2 images in this PDF document are corrupt enough that even Adobe Reader warns about it when opening the file.
*Please note:* I don't really know the JBIG2 image format at all, however from a very brief look at the specification it seems that integers should be 32-bit.
In general it's recommended to pass a *parameter object* when calling the `getDocument`-function in the API, since that's the only way to provide additional options, and the fact that it also accepts a URL or TypedArray directly is now mostly for backwards compatibility reasons.
However, the `getDocument`-function also accepts a direct `PDFDataRangeTransport`-instance which just seems unnecessary.
*Please note:* The `PDFDataRangeTransport`-implementation was added specifically for the *built-in* Firefox PDF Viewer, however it's most likely not commonly used by any third-party (given that it requires manual PDF-data loading).
Furthermore, the default-viewer always provides a *parameter object* when calling the `getDocument`-function and it's thus completely unaffected by these changes.
The relevant TrueType font is missing both /ToUnicode *and* /Encoding entires, either of which would have prevented the (current) broken textLayer rendering.
My first idea was that we could use the `post` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6post.html, to get the actual glyphNames and amend the fallback ToUnicode-map that way. Unfortunately that didn't work, since the `post` table only contained ".notdef" and "" (i.e. empty string) entries.
Instead we try to use the `name` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6name.html, to determine if the platform is Windows and thus fallback to generate a ToUnicode-map from the `WinAnsiEncoding`.
Note how all over the `src/core/annotation.js`-code we're assuming that if an `appearance`-entry exists it's also a Stream. However, we're not actually checking that thoroughly enough which causes issues in some badly generated PDF documents.
This patch removes the recently introduced `transferPdfData` API-option, and simply enables transferring of TypedArray data *by default* instead of copying it. This will help reduce main-thread memory usage, however it will take ownership of the TypedArrays. Currently this only applies to the following cases:
- TypedArrays passed to the `getDocument`-function in the API, in order to open PDF documents from binary data.
- TypedArrays passed to a `PDFDataRangeTransport`-instance, used to support custom PDF document fetching/loading (see e.g. the Firefox PDF Viewer).
*PLEASE NOTE:* To avoid being affected by this, please simply *copy* any TypedArray data before passing it to either of the functions/methods mentioned above.
Now that we transfer TypedArray data that we previously only copied, we need to be more careful with input validation. Given how the `{IPDFStreamReader, IPDFStreamRangeReader}.read` methods will always return ArrayBuffer data, which is then transferred to the worker-thread[1], the actual TypedArray data passed to the API thus need to have the same exact size as its underlying ArrayBuffer to prevent issues.
Hence we'll check for this and only allow transferring of *safe* TypedArray data, and fallback to simply copying the data just as before. This obviously shouldn't be an issue in the Firefox PDF Viewer, but for the general PDF.js library we need to be more careful here.
---
[1] See e09ad99973/src/display/api.js (L2492-L2506) respectively e09ad99973/src/display/api.js (L2578-L2590)
Note how in the API we're transferring the PDF data that's fetched over the network[1]:
- f28bf23a31/src/display/api.js (L2467-L2480)
- f28bf23a31/src/display/api.js (L2553-L2564)
To support that functionality we have the `PDFDataTransportStream`, `PDFFetchStream`, `PDFNetworkStream`, and `PDFNodeStream` implementations. Here these stream-implementations vary slightly in how they handle `ArrayBuffer`s internally, w.r.t. transferring or copying the data:
- In `PDFDataTransportStream` we optionally, after PR 15908, allow transferring of the PDF data as provided externally (used e.g. in the Firefox PDF Viewer).
- In `PDFFetchStream` we're currenly always copying the PDF data returned by the Fetch API, which seems unnecessary. As discussed in PR 15908, it'd seem very weird if this sort of browser API didn't allow transferring of the returned data.
- In `PDFNetworkStream` we're already, since many years, transferring the PDF data returned by the `XMLHttpRequest` functionality. Note how the `getArrayBuffer` helper function simply returns an `ArrayBuffer` response as-is.
- In `PDFNodeStream` we're currently copying the PDF data, however this is unfortunately necessary since Node.js returns data as a `Buffer` object[2].
Given that the `PDFNetworkStream` has been, indirectly, supporting transferring of PDF data for years it would seem really strange if this didn't also apply to the `PDFFetchStream`-implementation.
Hence this patch simply enables transferring of PDF data, when accessed using the Fetch API, unconditionally to help reduced main-thread memory usage since the `PDFFetchStream`-implementation is used *by default* in browsers (for the GENERIC build).
---
[1] As opposed to PDF data being provided as e.g. a TypedArray when calling `getDocument` in the API.
[2] This is a "special" Node.js object, see https://nodejs.org/api/buffer.html#buffer, which doesn't exist in browsers.
Also, removes the `initialData`-parameter JSDocs for the `getDocument`-function given that this parameter has been completely unused since PR 8982 (over five years ago). Note that the `initialData`-parameter is, and always was, intended to be provided when initializing a `PDFDataRangeTransport`-instance.
Given that this is internal functionality, not exposed in the official API, it's not entirely clear (at least to me) why we can't just initialize this directly in `src/display/api.js` instead.
When testing both the development viewer and all the ways in which we run tests, everthing still appears to work just fine with this patch.
*Please note:* The reduced test-case is *not* a perfect reproduction of the original PDF document, since this one fails to open in e.g. Adobe Reader, but I do believe that it captures the most important points here.
For corrupt *and* encrypted PDF documents, it's possible that only some trailer dictionaries actually contain an /Encrypt-entry. Previously we'd could easily miss that, since we generally pick the first not obviously corrupt trailer dictionary, and the solution implemented here is to simply pre-parse all trailer dictionaries to see if there's any /Encrypt-entries.
This fixes a warning reported by CodeQL, and should also make general sense given that we parse the font-data to determine the *actual* `type`/`subtype` rather than trusting the PDF document.
This was deprecated in PR 15758 but it's unfortunately quite difficult to tell if third-party users are depending on this, e.g. to implement custom error reporting, and if so to what extent.
However, thanks to the pre-processor we can limit *most* of this code to GENERIC builds which still seem like a worthwhile change.
These changes reduce the bundle size of the Firefox PDF Viewer by 3.8 kB in total.
This was deprecated in PR 15758 and given that it's quite unlikely that any third-party users are relying on this functionality, since it was only ever added to support telemetry reporting in the Firefox PDF Viewer, it should hopefully be fine to remove this fairly quickly.
These changes reduce the bundle size of the Firefox PDF Viewer by 4.5 kB in total.
Given that the Fetch API only supports the http/https protocols, worker-thread fetching of CMaps and Standard-fonts may thus fail in certain cases. To improve the default behaviour we'll now also check that the `cMapUrl` and `standardFontDataUrl` options are appropriate, except in Firefox where this should always work.
Note how, in the scripting initialization in the viewer, we only ever invoke `PDFPageProxy.getJSActions` *once* per page in order to improve overall performance; see a575aa13b9/web/pdf_scripting_manager.js (L372-L375)
Hence it really shouldn't be necessary to cache its result in the API, especially when that is done *manually* rather than using something like `shadow`.
When we're destroying a `PDFPageProxy`-instance, during full document destruction, we'll force-abort any worker-thread parsing of operatorLists. Hence we should make sure that any pending cancel timeout is always aborted, since a later `PDFPageProxy._abortOperatorList` call should always "replace" a previous one.
*Please note:* Technically this was always wrong, but with the changes in PR 15825 it became *ever so slightly* easier to trigger this thanks to the potentially longer timeout.
When trying to find incomplete objects, i.e. those missing the "endobj"-string at the end, there's unfortunately a number of possible operators that we need to check for. Otherwise we could miss e.g. the "trailer" at the end of a corrupt PDF document, which is why the referenced document didn't work.
Currently we do all searching on the "raw" bytes of the PDF document, for efficiency, however this doesn't really work when we need to check for *multiple* potential command-strings. To keep the complexity manageable we'll instead use regular expressions here, but we can at least avoid creating lots of substrings thanks to the `RegExp.lastIndex` property; which is well supported across browsers according to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex#browser_compatibility
Note that this repeated regular expression usage could perhaps be slightly less efficient than the old code, however this method is only invoked for corrupt PDF documents.
By moving this code the "pageviewer"-component example will become slightly more usable on its own, it may simplify a future addition of XFA Foreground document support, and finally also serves as preparation for the following patches.
Previously we'd abort all parsing if an Error was encountered, despite the fact that multiple `startXRefQueue`-entries may be available and that continued parsing could thus eventually be able to find usable data.
Note that in the referenced PDF document the `startxref`-operator, at the end of the file, points to a position in the middle of an arbitrary `stream` which is why things break.
This is done to support upcoming viewer-changes, and in order to prevent third-party users from outright breaking things we'll simply ignore too large values.
It's a follow-up of #14950: some format actions are ran when the document is open
but we must be sure we've everything ready for that, hence we have to run some
named actions before runnig the global format.
In playing with the form, I discovered that the blur event wasn't triggered when
JS called `setFocus` (because in such a case the mouse was never down). So I removed
the mouseState thing to just use the correct commitKey when blur is triggered by a
TAB key.
In order to move the annotations in the DOM to have something which corresponds
to the visual order, we need to have their dimensions/positions which means that
the parent must have some dimensions.
While reviewing recent patches, I couldn't help but noticing that we now have a lot of call-sites that manually access the `PageViewport.viewBox`-property.
Rather than repeating that verbatim all over the code-base, this patch adds a lazily computed and cached getter for this data instead.
This was essentially done only to compensate for the viewer calling `PDFPageProxy.getAnnotations` unconditionally on every annotationLayer-rendering invocation. With the previous patch that's no longer happening, and this API-caching should thus no longer be necessary.
An annotation editor layer can be destroyed when it's invisible, hence some
annotations can have a null parent but when printing/saving or when changing
font size, color, ... of all added annotations (when selected with ctrl+a) we
still need to have some parent properties especially the page dimensions, global
scale factor and global rotation angle.
This patch aims to remove all the references to the parent in the editor instances
except in some cases where an editor should obviously have one.
It fixes#15780.
The main issue is due to the fact that an editor's parent can be null when
we want to serialize it and that lead to an exception which break all the
saving/printing process.
So this incomplete patch fixes only the saving/printing issue but not the
underlying problem (i.e. having a null parent) and doesn't bring that much
complexity, so it should help to uplift it the next Firefox release.
Rather than handling these parameters separately, which is a left-over from back when streaming of textContent was originally added, we can simply pass either data directly to the `TextLayer` and let it handle things accordingly.
Also, improves a few JSDoc comments and `typedef`-imports.
Compared to the recent PR 15722 for the `textLayer` this one should be a (comparatively) much a smaller win overall, since most documents don't have any structTree-data and the required parsing should be cheaper. However, it seems to me that it cannot hurt to improve this nonetheless.
Note that by moving the `structTreeLayer` initialization we remove the need for the "textlayerrendered" event listener, which thus simplifies the code a little bit.
Also, removes the API-caching of the structTree-data since this was basically done to offset the lack of caching in the viewer.
*Please note:* I don't really expect that this is will be an observable change, since virtually all PDF documents already order e.g. /MediaBox and /CropBox entries correctly.
By normalizing boundingBoxes already on the worker-thread, we can be sure that even a corrupt document won't cause issues.
Note how we're passing the `view`-getter to the `PartialEvaluator.getTextContent` method, in order to detect textContent which is outside of the page, hence it makes sense to ensure that it's formatted as expected.
Furthermore, by normalizing this once on the worker-tread we should no longer have to worry about a possibly negative width/height in the `PageViewport` constructor.
Finally, the patch also simplifies the `view`-getter a little bit.
The idea is just to resuse what we got on the first draw.
Now, we only update the scaleX of the different spans and the other values
are dependant of --scale-factor.
Move some properties in the CSS in order to avoid any updates in JS.
The deprecation is included in the current release, i.e. version `3.1.81`, and given the edge-case nature of this option I really don't think that we need to keep it deprecated for multiple releases.
This patch has been successfully tested in a local, artifact, Firefox build.
*Please note:* The only thing that'll no longer work for PDF documents opened using "data:"-URLs is middle-clicking on internal/outline links, in order to open the destination in a new tab. This is however an extremely small loss of functionality, and as can be seen in the bug the alternative (i.e. doing nothing) is surely much worse.
Currently both of the `AnnotationElement` and `KeyboardManager` classes contain *identical* `platform` getters, which seems like unnecessary duplication.
With the pre-processor we can also limit the feature-testing to only GENERIC builds, since `navigator` should always be available in browsers.
Add a deprecation notification for PDFDocumentLoadingTask.onUnsupportedFeature and PDFDocumentProxy.stats
which are likely useless.
The unsupported feature stuff have initially been added in (#4048) in order to be able to display a
warning bar and to help to have some numbers to know how a feature was used.
Those data are no more used in Firefox.
This is very old code, which is unused (by default) in browsers nowadays since the Font Loading API will always be preferred.
For Node.js environments we use the same constant as elsewhere throughout the code-base, and we can also simplify the Firefox-specific check given that the lowest supported version is `102` (as of this writing).
Finally the old TODO is removed, since the general availability of the Font Loading API has made it redundant.
The use of `Array.prototype.reduce()` is, in my opinion, hurting overall readability since it's not particularly easy to look at the relevant code and immediately understand what's going on here. Furthermore this code leads to strictly speaking unnecessary allocations and parsing, since we could just track the min/max values directly in the relevant loop instead.
This has never really been used anywhere within the PDF.js library[1], and when streaming of textContent was introduced this parameter was effectively made redundant.
Note that when streaming of textContent is used, all text-layout has already happened by the time that this `timeout`-functionality is actually invoked (thus making it pointless).
While the `timeout`-functionality may still "work" when the textContent is provided upfront, although it's never been used/tested, streaming will generally perform better (in e.g. a viewer setting).
*Please note:* While unrelated here, also removes a now unused property that I forgot in PR 15259.
---
[1] At least not since the code was moved into its current file, which happened in PR 6619 and landed seven years ago.
This can't be a particularly common feature, since we've supported Optional Content for over two years and this is the very first TilingPattern-case we've seen.
The reason for the issue is that we use the generic `getFilenameFromUrl` helper function, which was originally intended for regular URLs.
For the filenames we're dealing with in FileAttachments, we really only want to strip the path when one exists[1].
---
[1] See [bug 1230933](https://bugzilla.mozilla.org/show_bug.cgi?id=1230933) for an example of such a case.
Currently *some* functions in this file have names while others don't, and in a few cases the names are no longer entirely accurate.
For the relevant functions there should really be no need to name them, and if memory serves this was originally done since browsers (many years ago) didn't always handle anonymous functions correctly in stack traces.
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
Given that these functions are virtually identical, with the latter only adding a BOM, we can combine the two. Furthermore, since both functions were only used on the worker-thread, there's no reason to duplicate this functionality in both of the `pdf.js` and `pdf.worker.js` files.
Having just played around with adding FreeText-annotations and then trying to print, there were `FreeTextAnnotation: OffscreenCanvas is not supported, annotation may not render correctly.` messages printed in the console.
The reason for this is that `FreeTextAnnotation` inherits from `MarkupAnnotation`, however only `WidgetAnnotation` actually defines the `_isOffscreenCanvasSupported` property.
Given that this `assert` is only intended to catch any implementation bugs in our code, and not actually to validate the PDF data directly[1], we can avoid making this function call unconditionally.
---
[1] In those cases, for example a `FormatError` should have been thrown instead.
With modern EcmaScript features, we can define these fields directly instead. Please note that for backwards compatibility purposes they are still public as before, however note that this functionality is *disabled* by default (see the `pdfBug` API option).
Also, we can (slightly) simplify the two loops used in the `toString` method.
These fields were never intended to be public, since modifying them manually would lead to inconsistent state, and with modern EcmaScript features we can now enforce this.
Also, this patch removes a couple of JSDoc comments that we generally don't use.
- For text fields
* when printing, we generate a fake font which contains some widths computed thanks to
an OffscreenCanvas and its method measureText.
In order to avoid to have to layout the glyphs ourselves, we just render all of them
in one call in the showText method in using the system sans-serif/monospace fonts.
* when saving, we continue to create the appearance streams if the fonts contain the char
but when a char is missing, we just set, in the AcroForm dict, the flag /NeedAppearances
to true and remove the appearance stream. This way, we let the different readers handle
the rendering of the strings.
- For FreeText annotations
* when printing, we use the same trick as for text fields.
* there is no need to save an appearance since Acrobat is able to infer one from the
Content entry.
These variables are left-over from the initial implementation, back when `String.prototype.padStart` didn't exist and we thus had to pad manually (using a loop).
This helps improve performance for some PDF documents with a huge number of inline images, e.g. the PDF document from issue 2618.
Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).
*Please note:* This only fixes the "wrong letter" part of bug 1799927.
It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
Rather than switching to another hashing algorithm, e.g. the `MurmurHash3_64` class, we simply cache using a stringified version of the inline image data as the cacheKey to prevent any future collisions. While this will (naturally) lead to slightly higher peak memory usage, it'll however be limited to the current `Parser`-instance which means that it's not persistent.
One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.
The purpose of this patch is twofold:
- Initialize the unicode-category data *lazily* during text-extraction, since this is completely unused during general parsing/rendering.
- Stop exposing this data in the API, since it's unused on the main-thread and it seems like it was *accidentally* included.
Obviously these changes are API-observable, but hopefully no user is depending on this. Furthermore, it's trivial for a user to re-create this unicode-category data manually with a regular expression (from the exposed `unicode` property).
Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized.
Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.
*Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.
In order to support opening certain corrupt PDF documents, particularly hand-edited ones, this patch adds support for letting the `Catalog.getAllPageDicts` method fallback to returning an *empty* dictionary to replace (only) the first /Page of the document.
Given that the viewer cannot initialize/load without access to the first page, this will thus allow e.g. document-level scripting to run as expected. Note that by effectively replacing a corrupt or missing first /Page in this way[1], we'll now render nothing but a *blank* page for certain cases of broken/corrupt PDF documents which may look weird.
*Please note:* This functionality is controlled via the existing `stopAtErrors` option, that can be passed to `getDocument`, since it's easy to imagine use-cases where this sort of fallback behaviour isn't desirable.
---
[1] Currently we still require that a /Pages-dictionary is found though, however it *may* be possible to relax even that assumption if that becomes absolutely necessary in future corrupt documents.
Note that the "trailer"-case is already a fallback, since normally we're able to use the "xref"-operator even in corrupt documents. However, when a "trailer"-operator is found we still expect "startxref" to exist and be usable in order to advance the stream position. When that's not the case, as happens in the referenced issue, we use a simple fallback to find the first "obj" occurrence instead.
This *partially* fixes issue 15590, since without this patch we fail to find any objects at all during `XRef.indexObjects`. However, note that the PDF document is still corrupt and won't render since there's no actual /Pages-dictionary and the /Root-entry simply points to the /OpenAction-dictionary instead.
After the clean-up in PR 15616, the `PdfManager.onLoadedStream` method now only has a single call-site.
Hence why this patch suggests that we remove this method and replace it with an *optional* parameter in `PdfManager.requestLoadedStream` instead. By making the new behaviour opt-in, we'll thus not change any existing call-site.
- Some events, which require a user interaction, will allow those functions to be called.
But after few seconds, if there are no more user interaction, it won't be possible
anymore.
The idea is to give an opportunity to the user to leave the pdf.
- Disable print function when we're printing, the same with saving and disallow to save
on open events.
When a form isn't changed, we used the appearances we had in the file, but when
/NeedAppearances is true, all the appearances have to be regenerated whatever they're.
*This is very old code, and it could thus do with some simplification.*
Note how in the `src/core/worker.js` file we're combining both the `PdfManager.requestLoadedStream` and `PdfManager.onLoadedStream` methods in order to access the stream-data. This seems unnecessary, and it's simple enough to always let the `PdfManager.requestLoadedStream` method return the stream-data as well.
PR 13725 was only intended as a temporary work-around, and it seems that we can now revert that.
- Firefox 102 is the currently maintained ESR-branch, and the PDF.js project only supports the active one.
- Node.js now works, thanks to the `node-canvas` package, and I've confirmed locally that following the STR in issue 13724 generates a correct image.
In the referenced PDF document there are "numbers" which consist only of `-.`, and while that's obviously not valid Adobe Reader seems to handle it just fine.
Letting this method ignore more invalid "numbers" was suggested during the review of PR 14543, so let's simply relax our the validation here.
It appears that PR 15593 broke `issue12402`, and we thus need to partially restore the /Count check.
I completely missed this when looking at the test-results for PR 15593, both locally and on the bots, since the `Driver._getLastPageNumber` method would "swallow" an unavailable page number.
- When we're editing some annotations, keeping the role="text-box" make them visible
as editable and VoiceOver (Mac) is able to read the contents when they're focused;
- Add an attribute "aria-activedescendant" in order to make the content discoverable
by NVDA on Windows.
After PR 14311, and follow-up patches, we no longer require that the /Count entry (in the /Pages dictionary) is either present or even valid in order to parse/render a PDF document.
Hence it seems strange to keep this requirement for *corrupt* PDF documents, when trying to find a usable `trailer` in the `XRef.indexObjects` method.
With the changes in the previous patch we can move the glyph-cache lookup to the top of the method and thus avoid a bunch of, in *almost* every case, completely unnecessary re-parsing for every `charCode`.
This method, and its class, was originally added in PR 4453 to reduce memory usage when parsing text. Then PR 13494 extended the `Glyph`-representation slightly to also include the `charCode`, which made the `matchesForCache` method *effectively* redundant since most properties on a `Glyph`-instance indirectly depends on that one. The only exception is potentially `isSpace` in multi-byte strings.
Also, something that I noticed when testing this code: The `matchesForCache` method never worked correctly for `Glyph`s containing `accent`-data, since Objects are passed by reference in JavaScript. For affected fonts, of which there's only a handful of examples in our test-suite, we'd fail to find an already existing `Glyph` because of this.
When we fail to find a usable PDF document `trailer` *and* there were errors during parsing, try and fallback to a *previous* generation as a last resort during fetching of uncompressed references.
*Please note:* This will not affect "normal" PDF documents, with valid /XRef data, and even most *corrupt* documents should be completely unaffected by these changes.
Part of this is very old code, and back when support for parsing the catalog-version was added things became less clear (in my opinion).
Hence this patch tries to improve things, by e.g. validating the header- and catalog-version separately.
Note how we're currently skipping all main-thread cleanup when document destruction has started, but for some reason we're still dispatching the "Cleanup" message.
This seems like a simple oversight, since destruction will already invoke the `BasePdfManager.cleanup` method (on the worker-thread) to fully clear-out all caches.
Given the sheer number of heuristics added to this method over the years, moving the *valid* unicode found case to the top should improve readability of the code.
- Fix Field::getArray in order to collect only the fields which have a value;
- Fix AFSimple_Calculate:
* allow to have a string with a list of field names as argument;
* since a field can be non-terminal, use Field::getArray to collect
the field under it and then apply the calculation on all the descendants.
This code was added all the way back in PR 6698, almost seven years ago, for backwards compatibility reasons. At this point in time, it seems that we can remove that since:
- We have more fine-grained "UnsupportedFeature" reporting elsewhere in the worker-thread code nowadays.
- The GetOperatorList-handling is now using `ReadableStream`s, which means that errors are being forwarded to the main-thread anyway.
- We're also no longer displaying a notification-bar, in the *built-in* Firefox PDF Viewer, for any of these "UnsupportedFeature" messages.
*Please note:* I don't really know what I'm doing here, however the patch appears to fix the referenced issue when comparing the rendering with Adobe Reader (with the caveat that I don't speak the language in question).
When a new PDF document is opened in the GENERIC viewer we (obviously) create a new `AnnotationEditorUIManager`-instance, since those are document-specific, and thus we need to ensure that we actually register the `editorTypes` for each one.
Note how after having found the "%PDF-" prefix we then read both the prefix and the version in the loop, only to then remove the prefix at the end.
It seems better to instead advance the stream position past the "%PDF-" prefix, and then read only the version data.
Finally the loop-condition can also be simplified slightly, to further clean-up some very old code.
*Fixes a regression from PR 15246, sorry about that!*
The return value of all `Annotation.getOperatorList` methods was changed in PR 15246, however I missed updating the error code-path in `Page.getOperatorList` which thus breaks all operatorList-parsing for pages with corrupt Annotations.
Looking at the code on the worker-thread, there doesn't appear to be any particular reason for placing *some* of the properties in a `source`-object when sending them with "GetDocRequest".
As is often the case the explanation for this structure is rather "for historical reasons", since originally we simply sent the `source`-object as-is. Doing that was obviously a bad idea, for a couple of reasons:
- It makes it less clear what is/isn't actually needed on the worker-thread.
- Sending unused properties will unnecessarily increase memory usage.
- The `source`-object may contain unclonable data, which would break the library.
Rather than sending all of these parameters individually and then grouping them together on the worker-thread, we can simply handle that in the API instead.
All of the these constants have been deprecated for a while, and with the upcoming *major* version this seems like a good time to remove them.
For the string-constants we can simply remove them, but the number-constants are left commented out since we don't want to re-number the list to prevent third-party breakage.
I noticed the 256 % 3 (which is equal to 1) so I slighty simplify the code.
The sum of the 16 Uint8 doesn't exceed 2^12, hence we can just take the
sum modulo 3.
This method was originally added in PR 1157 (back in 2012), however its only call-site was then removed in PR 2423 (also in 2012).
Hence this method has been completely unused for nearly a decade, and it should thus be safe to remove it.
This patch first of all makes `isOffscreenCanvasSupported` configurable, defaulting to `true` in browsers and `false` in Node.js environments, with a new `getDocument` parameter. While you normally want to use this, in order to improve performance, it should still be possible for users to control it (similar to e.g. `isEvalSupported`).
The specific problem, as reported in issue 14952, is that the SVG back-end doesn't support the new ImageMask data-format that's introduced in PR 14754. In particular:
- When the SVG back-end is used in Node.js environments, this patch will "just work" without the user needing to make any code changes.
- If the SVG back-end is used in browsers, this patch will require that `isOffscreenCanvasSupported: false` is added to the `getDocument`-call.
*Please note:* The referenced issue is the only mention that I can find, in either GitHub or Bugzilla, of "GoToE" actions.
Hence why I've purposely settled for a very simple, and partial, "GoToE" implementation to avoid complicating things initially.[1] In particular, this patch only supports "GoToE" actions that references the /EmbeddedFiles-dict in the PDF document.
See https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2048909
---
[1] Usually I always prefer having *real-world* test-cases to work with, whenever I'm implementing new features.
This is yet another small piece of clean-up of the `FontLoader`-code, since we've not used this `id`-property for anything ever since PR 6571 (which landed almost seven years ago). Furthermore, by default we're also not even using that code-path now since the Font Loading API will always be used when available.
*Please note:* This is tagged `[api-minor]` since it's technically observable from the outside, however no user ought to be directly interacting with these CSS font rules.
Currently the compatibility-file is loaded using a standard `import`-statement and while its code is enclosed in a pre-processor block, and thus is excluded in e.g. the MOZCENTRAL build-target, it still results in the *built* `pdf.js`/`pdf.worker.js` files having an effectively empty closure as a result.
By moving the checks from `src/shared/compatibility.js` and into `src/shared/util.js` instead, we can load the file using a build-time `require`-statement and thus avoid that closure.
Note that with these changes the compatibility-file will no longer be loaded in development mode, i.e. when `gulp server` is used. However, this shouldn't be a big issue given that none of its included polyfills could be loaded then anyway (since `require`-statements are being used) and that it's really only intended for the `legacy`-builds of the library.
Note that this PR only adds the "underscore"-variant of *actually existing* ligatures, however the referenced PDF document also uses a couple of non-standard ones (e.g. `ft`, `Th`, and `fh`) that we cannot easily support without larger changes (since they don't have official Unicode-entries).
Given that it's clearly the PDF document, and its fonts, that's the culprit here it's not entirely clear to me that we actually want to attempt a larger refactoring/rewriting of the `glyphlist.js` code, assuming it's even generally possible. Especially when this patch alone already improves our copy-paste behaviour when compared to both Adobe Reader and PDFium, and that this is only the *second* time this sort of bug has been reported.
Fewer dependencies shouldn't be a bad idea in general, and given that the `node-canvas` package already include a `DOMMatrix` polyfill we can simply use that one instead.
Given that Firefox supports *synchronous* font loading, when the Font Loading API isn't being used, there's really no point including code which if called would just throw in the MOZCENTRAL build. (This is safe, since the `FontLoader.isSyncFontLoadingSupported`-getter always return `true` there.)
After the changes in PR 10539 (which landed over three years ago) the `FontLoader.bind` method can only be called with *a single* font at a time, hence the `_prepareFontLoadEvent` method obviously don't need to support multiple fonts any more.
By having just *one* class, and using pre-processor blocks directly in the relevant methods, we reduce the size of this code in the *built* `pdf.js` file.
Originally, when the `BaseFontLoader` abstraction was added in PR 9982, the idea was probably that additional build-targets would get their own implementations. Given that this hasn't happened in the four years since that landed, it doesn't seem meaningful to keep it around.
The existing `loadingContext` class-property can be simplified slightly, since we've not been using the `id`-property on the requests ever since PR 3477 (which landed nine years ago).
Furthermore, by default we're also not even using that code-path now since the Font Loading API will always be used when available.
This was done all the way back in PR 8361, for a mozilla-central test that's since been removed. As can be seen in the following search results, there's no `LoopbackPort` invocation outside of the PDF.js code itself: https://searchfox.org/mozilla-central/search?q=LoopbackPort&path=
Given that the `LoopbackPort` is only used in connection with "fake workers", which is something that we don't officially recommend/support, this doesn't seem like functionality that we want to keep exposing in the public API.
OperatorList.addOp can trigger a flush if it's required, hence the values passed to it must
be correctly initialized in order to avoid some wrong values in the renderer.
Because of that a clip path was considered as empty, nothing was clipped, hence the wrong
rendering in bug 1791583.
*This effectively replaces PR 15465.*
As outlined in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map/forEach, the argument order when iterating through a `Map` is actually `value, key`.
Ignoring the incorrect Array used in the old code, I cannot imagine that this would've worked anyway since we didn't use the actual `setTimeout`-functionRefs to clear the timeouts; please refer to the `setTimeout`/`setInterval` methods in the `SandboxSupportBase.createSandboxExternals` method.
Since there are no script engine with XFA, the FormCalc parser is not used irl.
The bug @nmtigor noticed was hidden by another one (the wrong check on `match`).
Most of the `String.prototype.search` call-sites found throughout the code-base is actually not necessary, since we usually only want a *boolean*, and those can be replaced with `RegExp.prototype.test` instead.
This patch updates a bunch of older code, that makes conditional function calls, to use optional chaining rather than `if`-blocks.
These mostly mechanical changes reduce the size of the `gulp mozcentral` build by a little over 1 kB.
*Please note:* This is only a, hopefully generally helpful, work-around rather than a proper solution to issue 15292.
There's something that's "special" about the Type1 fonts in the referenced PDF document, since we don't manage to find any actual font programs and thus cannot render anything.
Given that it shouldn't make sense for a Type1 font program to ever be empty, since that means that there's no glyph-data to render, we simply fallback to a standard font to at least try and render *something* in these rare cases.
Given that the change in PR 13393 was slightly speculative, given the lack of test-cases, let's just revert part of that to fix the referenced issue.
Based on a quick look at old issues and existing test-cases, it seems that most (if not all) PDF documents that benefit from using the font-data in this way lack any /ToUnicode maps which should mean that they're unaffected by these changes.
Note that this patch implements the `SetOCGState`-handling in `PDFLinkService`, rather than as a new method in `OptionalContentConfig`[1], since this action is nothing but a series of `setVisibility`-calls and that it seems quite uncommon in real-world PDF documents.
The new functionality also required some tweaks in the `PDFLayerViewer`, to ensure that the `layersView` in the sidebar is updated correctly when the optional-content visibility changes from "outside" of `PDFLayerViewer`.
---
[1] We can obviously move this code into `OptionalContentConfig` instead, if deemed necessary, but for an initial implementation I figured that doing it this way might be acceptable.
It slightly helps to reduce the code size and its complexity.
But the cool thing is that it allows to copy/paste some anntations from a pdf
to an other.
Apparently this is implemented in e.g. Adobe Reader, and the specification does support it, however it cannot be commonly used in real-world PDF documents since it took over ten years for this feature to be requested.
A number of Annotation-types are currently creating their own PopupAnnotations, since they need to use a custom `trigger`-element. However, because of where that check is currently implemented[1] we end up attaching empty/unused containers for those PopupAnnotations to the DOM[2]; see e.g. the `annotation-line.pdf` file in the test-suite for one example.
By instead moving the types-check into the `PopupAnnotationElement` constructor, we can completely skip those PopupAnnotations that are being explicitly handled elsewhere.
Note that I don't *believe* that this is a new issue, although I've not tried to bisect it, but this likely goes back quite some time (possibly even as far as PR 8228).
---
[1] In the `PopupAnnotationElement.render` method.
[2] Please note that the actual Popup-element *itself* isn't being attached/rendered here, just its container which by itself serves no purpose as far as I can tell.
There's three notable exceptions here:
- The `saveDocument` one is converted into a permanent `warn`, since it still works when the `annotationStorage` is empty although it's (obviously) less efficient than `getData`.
- The `fallbackWorkerSrc` functionality (for browsers), since just removing it would risk too much third-party breakage.
- The SVG back-end, since a final decision is yet to be made. (It might be completely removed, or left as-is in an essentially "frozen" state.)
Note that this patch prepends the document title with "* ", rather than only "*" as suggested in the bug, since there's nothing that says that a PDF document cannot specify a title[1] beginning with an asterisk. To reduce possible confusion, having a space between the "editing marker" and the actual document title thus cannot hurt as far as I'm concerned.
In order to notify the viewer when all `AnnotationEditor`s have been removed, we utilize the existing `onAnnotationEditor`-callback to allow the document title to be updated as necessary.
Finally, this patch makes the following (slightly unrelated) changes:
- Rename the `AnnotationStorage.removeKey` method to just `AnnotationStorage.remove` instead. This is consistent with e.g. the `has`-method and should suffice to explain what it does.
- Remove the `AnnotationStorage.hasAnnotationEditors` getter, since the viewer now tracks the necessary state internally. This avoids unnecessarily having to iterate through the `AnnotationStorage`-instance when saving/printing the document.
---
[1] Using either an /Info dictionary or a /Metadata stream.
This functionality has never been used anywhere in the PDF.js library/viewer itself, since it was added in 2013.
Furthermore this functionality is, and has always been, *completely untested* and also unmaintained.
Finally, there's (at least) one old issue about `appendImage` not returning the correct position; see issue 4182.
All-in-all, it seems that keeping very old, untested, unmaintained, and partially broken code around probably isn't what we want here.
(On the off-chance that any future a11y-work requires getting access to image-positions, it'd likely be much better to re-implement the necessary functionality from scratch and also make sure that it's properly tested from the beginning.)
This old method, which is only used with the `imageLayer` functionality, is essentially just a re-implementation of the existing `Util.applyTransform` method.
*This is a follow-up to PR 14869.*
In the old code we're accidentally "swallowing" part of the event-details, which explains why the annotationLayer didn't render.
One thing that made debugging a lot harder was the lack of error messages, from the viewer, and a few `PDFPageView`-methods were updated to improve this situation.
- Remove the `typeof Worker` check, since all browsers have had `Worker` support for many years now; see https://developer.mozilla.org/en-US/docs/Web/API/Worker#browser_compatibility
Furthermore the `new Worker(...)` call is wrapped in try-catch, which means that we'll still fallback to "fake workers" if necessary.
- Limit the `fallbackWorkerSrc` handling, in the `PDFWorker.workerSrc` getter, to only GENERIC builds since that's the only place where it's defined anyway.
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these change.
By removing this closure the file-size is decreased, even for the *built* `pdf.worker.js` file, since there's now less overall indentation in the code.
This was moved into the `src/display/`-folder in PR 15110, for the initial editor-a11y patch. However, with the changes in PR 15237 we're again only using `binarySearchFirstItem` in the `web/`-folder and it thus seem reasonable to move it back there.
The primary reason for moving it back is that `binarySearchFirstItem` is currently exposed in the public API, and we always want to avoid that unless it's either PDF-related functionality or code that simply must be shared between the `src/`- and `web/`-folders. In this case, `binarySearchFirstItem` is a general helper function that doesn't really satisfy either of those alternatives.
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these changes.
For e.g. the `gulp mozcentral` command the *built* `pdf.worker.js` file-size decreases `~2 kB` with this patch, and most of the improvement comes from having less overall indentation in the code.
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these changes.
By removing this closure the file-size is decreased, even for the *built* `pdf.worker.js` file, since there's now less overall indentation in the code.
This patch doesn't structurally change the text layer: it just adds some aria-owns
attributes to some spans.
The aria-owns attribute expect to have an element id, hence it's why it adds back an
id on the element rendering an annotation, but this id is built in using crypto.randomUUID
to avoid any potential issues with the hash in the url.
The elements in the annotation layer are moved into the DOM in order to have them in the
same "order" as they visually are.
The overall goal is to help screen readers to present to the user the annotations as
they visually are and as they come in the text flow.
It is clearly not perfect, but it should improve readability for some people with visual
disabilities.
According to MDN `Path2D` is available in all browsers that we currently support, see https://developer.mozilla.org/en-US/docs/Web/API/Path2D#browser_compatibility
Hence only Node.js is currently lagging behind here, and requires that we keep the old code as a fallback in the `compileType3Glyph` function. However, there's an open PR in the `node-canvas` repository for adding `Path2D` support.
As far as I'm concerned, there's two possible solutions here:
- We land this patch now, since it removes unnecessary code in e.g. the Firefox PDF Viewer, which means that compilation of Type3 glyphs will be disabled in Node.js until that PR is landed.[1]
If users report bugs about Type3 glyphs looking "inconsistent" in Node.js and/or being slow to render, we could perhaps encourage them to upvote and otherwise help out getting that PR landed?
- We wait for the mentioned PR to land *first*, before moving forward with this patch. Given that there's been no updates on that PR for almost two months, this alternative may possibly take a while.
---
[1] Note that Type3 fonts are first of all not very common in PDF documents, and secondly that compilation only applies specifically to Type3 glyphs that contain /ImageMask-data (i.e. not all Type3 fonts are affected).
While this has always worked, as a consequence of the implementation, it's never been officially supported.
In addition to adding basic unit-tests, this patch also introduces a couple of new JSDoc `@typedef`s in the API to avoid overly long lines.
By doing this in the worker-thread this code will only need to run *once*, whereas currently re-rendering of a page forces this to be repeated (e.g. after it's been scrolled out-of-view and then back into view again).
When a FreeText editor is pasted then it hasn't an editorDiv yet when added
to the layer, hence it's empty.
So this patch just move the call to addToAnnotationStorage to ensure we've
what we need.
An annotation doesn't have to be in the text flow, hence it's likely a bad idea
to insert its text in the text layer. But the text must be visible from a screen
reader point of view so it must somewhere in the DOM.
So with this patch, the text from a FreeText annotation is extracted and added in
a div in its HTML counterpart, and with the patch #15237 the text should be visible
and positioned relatively to the text flow.
To improve performance of the sidebar we use the page-canvases to generate the thumbnails whenever possible, since that avoids unnecessary re-rendering when the sidebar is open. This works generally well, however there's an old problem in PDF documents that contain interactive forms (when those are enabled): Note how the thumbnails become partially (or fully) blank, since those Annotations are not included in the OperatorList.[1]
We obviously want to keep using the `PDFThumbnailView.setImage`-method for most documents, however we need a way to skip it only for those pages that contain interactive forms.
As it turns out it's unfortunately not all that simple to tell, after the fact, from looking only at the OperatorList that some Annotations were skipped. While it might have been possible to try and infer that in the viewer, it'd not have been pretty considering that at the time when rendering finishes the annotationLayer has not yet been built.
The overall simplest solution that I could come up with, was instead to include a *summary* of the interactive form-state when doing the final "flushing" of the OperatorList and expose that information in the API.
---
[1] Some examples from our test-suite: `annotation-tx2.pdf` where the thumbnail is completely blank, and `bug1737260.pdf` where the thumbnail is missing the "buttons" found on the page.
Currently some `OPS.beginAnnotation` arguments will contain a `Number` value for the `isUsingOwnCanvas`-parameter, or in some cases an `undefined` value, which is inconsistent from an API perspective.
We can undo/redo a command which will at some point add a command in the queue: typically
it can happening when redoing an addition.
So the idea is to lock the queue when undoing/redoing.
This will allow us to improve the `PDFThumbnailView.setImage` handling in the viewer, and thanks to the added caching this should be reasonbly efficient.
Given that Optional Content visibility is only intended/supported to be updated via the `OptionalContentConfig.setVisibility`-method, this patch actually enforces that now.
Note that this will be used by the next patch in the series, and will help prevent inconsistent state in the `OptionalContentConfig`-class.
*Please note:* This patch also uncovered a pre-existing bug, related to iterating through the visibility groups in the constructor, for the `baseState === "OFF"` case.