Commit Graph

2918 Commits

Author SHA1 Message Date
Calixte Denizet
fd03cd5493 [api-minor] Generate images in the worker instead of the main thread.
We introduced the use of OffscreenCanvas in #14754 and this patch aims
to use them for all kind of images.
It'll slightly improve performances (and maybe slightly decrease memory use).
Since an image can be rendered in using some transfer maps but because of
OffscreenCanvas we don't have the underlying pixels array the transfer maps
stuff is re-implemented in using the SVG filter feComponentTransfer.
2023-03-01 17:40:12 +01:00
Jonas Jenwald
45c332110e
Check OffscreenCanvas support once on the worker-thread
Currently we repeat the `FeatureTest.isOffscreenCanvasSupported` checks all over the worker-thread code, and with upcoming changes this will become even "worse".

Hence this patch, which changes the *worker-thread* default value for the `isOffscreenCanvasSupported`-parameter to `false` and moves the feature-testing into the `BasePdfManager`-constructor.

*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
2023-02-27 12:27:28 +01:00
Calixte Denizet
3a21423386 [Acroform] Use the full path to find the node in the XFA datasets where to store the value
I noticed several 'Path not found' errors because of a field called #subform[2].
From the XFA specs, the hash is used for a class of elements in the template tree.
When we're looking for a node in the datasets tree, it doesn't make sense to search
for a class. Hence the path element starting with a hash are just skipped.
2023-02-23 12:09:39 +01:00
Tim van der Meij
22618213c7
Merge pull request #16040 from Snuffleupagus/arrayBuffersToBytes
Re-factor the `arraysToBytes` helper function (PR 16032 follow-up)
2023-02-12 11:47:57 +01:00
Jonas Jenwald
6d4d402a78 Move the arrayBuffersToBytes helper function into the worker-thread
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the *built* `pdf.js` and `pdf.worker.js` files.
2023-02-11 21:34:37 +01:00
Jonas Jenwald
18042163ce Improve the consistency between the LocalPdfManager/NetworkPdfManager constructor
Currently these classes take a bunch of parameters (somewhat randomly ordered), probably because this is very old code that's been extended over the years.
Hence this patch changes the constructors to use parameter-objects instead, which improves consistency and (slightly) reduces the amount of code as well.

*Please note:* Also removes the `msgHandler`-property on these classes, since I cannot find a single call-site that accesses it.
2023-02-11 13:39:52 +01:00
Jonas Jenwald
14b0e8c0b6 Ensure that "GetAnnotations" errors are propagated to the main-thread (PR 15267 follow-up)
With the changes in PR 15267 we're now accidentally swallowing "GetAnnotations" errors, rather than propagating them to the main-thread as intended.
2023-02-10 12:18:35 +01:00
Jonas Jenwald
c56f25409d Re-factor the arraysToBytes helper function (PR 16032 follow-up)
Currently this helper function only has two call-sites, and both of them only pass in `ArrayBuffer` data. Given how it's implemented there's a couple of code-paths that are completely unused (e.g. the "string" one), and in particular the intended fast-paths don't actually work.
This patch re-factors and simplifies the helper function, and it'll no longer accept anything except `ArrayBuffer` data (hence why it's also re-named).

Note that at the time when `arraysToBytes` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
2023-02-10 10:26:35 +01:00
Jonas Jenwald
5ba596786c Change WorkerTasks, in WorkerMessageHandler.createDocumentHandler, to a use a Set
This is a tiny bit more compact, thanks to the `Set.prototype.delete` method.
2023-02-09 22:01:16 +01:00
calixteman
0fca6e187c
Merge pull request #16035 from calixteman/fix_combo_value
[Annotation] A combo can have a value other than one in the options
2023-02-09 19:56:16 +01:00
Jonas Jenwald
1fc8350795
Merge pull request #16032 from Snuffleupagus/less-arrayByteLength
Reduce usage of the `arrayByteLength` helper function
2023-02-09 18:56:20 +01:00
Calixte Denizet
cb1638530d [Annotation] A combo can have a value other than one in the options
When printing the pdf in #12233 in Acrobat, we can see that the combo for country
is empty: it's because the V entry doesn't have to be one of the options.
2023-02-09 18:50:57 +01:00
calixteman
972744a68f
Merge pull request #16033 from calixteman/bug1640217
Ignore position of combining diacritics when getting text (bug 1640217)
2023-02-09 18:23:59 +01:00
Calixte Denizet
58e4d92884 [Annotation] For choice widget, use the I entry instead of the V one (bug 1770750)
It isn't really conform to the specifications but Acrobat is working like that...
2023-02-09 17:26:13 +01:00
Calixte Denizet
4e9f26afa3 Ignore position of combining diacritics when getting text (bug 1640217) 2023-02-09 17:13:57 +01:00
Jonas Jenwald
96d338e437 Reduce usage of the arrayByteLength helper function
We're using this helper function when reading data from the [`PDFWorkerStreamReader.read`](a49d1d1615/src/core/worker_stream.js (L90-L98)) and [`PDFWorkerStreamRangeReader.read`](a49d1d1615/src/core/worker_stream.js (L122-L128)) methods, and as can be seen they always return `ArrayBuffer` data. Hence we can simply get the `byteLength` directly, and don't need to use the helper function.

Note that at the time when `arrayByteLength` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
2023-02-09 15:50:38 +01:00
Jonas Jenwald
323d3d246a Re-factor the readChunk function in ChunkedStreamManager.sendRequest
Move the `done` branch to the top of the function, similar to how we usually format things when `ReadableStream`s are used.
2023-02-09 15:33:06 +01:00
Calixte Denizet
a25895bf72 [Annotation] Take into account the stroke alpha for a FreeText without appearance 2023-02-07 22:15:27 +01:00
Calixte Denizet
ea7b4b4d6c [Annotation] Avoid to encrypt the appearance stream two times (bug 1815476) 2023-02-07 19:26:46 +01:00
Jonas Jenwald
3a7fce49a3 A tiny improvement of the MetadataParser._repair method
We can just insert the initial greater-than sign at the start of the buffer, rather than doing that manually at the end.
2023-02-04 12:43:55 +01:00
Jonas Jenwald
808ca828f1 Extend getGlyphMapForStandardFonts with additional entries (issue 15977) 2023-01-30 12:13:21 +01:00
Jonas Jenwald
40a46e4397 Tweak adjustType1ToUnicode for fonts with a predefined *named* encoding (bug 1811668, PR 14050 follow-up)
*Please note:* I cannot reproduce the problem reported in bug 1811668, regarding the context menu, and in any case it's not clear that that part is even a PDF Viewer bug.

Looking at bug 1811668 I couldn't help but noticing that the textLayer isn't correct, and it's unfortunately once again a problem with the `adjustType1ToUnicode` function. That's intended to help improve text-selection for fonts without a /ToUnicode-entry, and in many cases it does help (the original PR fixed lots of issues) however it's also caused some problems.

In order to improve text-selection in bug 1811668, we'll now properly ignore fonts that have a predefined *named* encoding specified since that's really the intention with PR 14050.
2023-01-21 12:21:21 +01:00
Jonas Jenwald
f2fce93826 [JBIG2] Ensure that the decodeInteger function returns valid integers (issue 15942)
The JBIG2 images in this PDF document are corrupt enough that even Adobe Reader warns about it when opening the file.
*Please note:* I don't really know the JBIG2 image format at all, however from a very brief look at the specification it seems that integers should be 32-bit.
2023-01-19 17:14:17 +01:00
Jonas Jenwald
d6be5141e9 Fallback to using the name table to infer the encoding for TrueType fonts missing such data (issue 15910)
The relevant TrueType font is missing both /ToUnicode *and* /Encoding entires, either of which would have prevented the (current) broken textLayer rendering.
My first idea was that we could use the `post` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6post.html, to get the actual glyphNames and amend the fallback ToUnicode-map that way. Unfortunately that didn't work, since the `post` table only contained ".notdef" and "" (i.e. empty string) entries.

Instead we try to use the `name` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6name.html, to determine if the platform is Windows and thus fallback to generate a ToUnicode-map from the `WinAnsiEncoding`.
2023-01-17 16:04:51 +01:00
Jonas Jenwald
cefaecc2e8 Ensure that Annotation appearance-entries are actually Streams
Note how all over the `src/core/annotation.js`-code we're assuming that if an `appearance`-entry exists it's also a Stream. However, we're not actually checking that thoroughly enough which causes issues in some badly generated PDF documents.
2023-01-16 13:02:53 +01:00
calixteman
fcaeb5db88
Merge pull request #15901 from calixteman/15289_followup
Avoid null ExpansionFactor in type1 fonts (follow-up of #15289)
2023-01-07 18:20:31 +01:00
Jonas Jenwald
74e4b515c5
Merge pull request #15897 from Snuffleupagus/issue-15893
Support parsing encrypted documents in `XRef.indexObjects` (issue 15893)
2023-01-07 16:55:41 +01:00
Calixte Denizet
c170245fc0 Avoid null ExpansionFactor in type1 fonts (follow-up of #15289) 2023-01-07 16:25:24 +01:00
Calixte Denizet
e565e455e2 Set ExpansionFactor to 0.06 when it's equals to 0 in the private dict of CFF fonts 2023-01-07 14:53:13 +01:00
Jonas Jenwald
7d94fdeb48 Support parsing encrypted documents in XRef.indexObjects (issue 15893)
*Please note:* The reduced test-case is *not* a perfect reproduction of the original PDF document, since this one fails to open in e.g. Adobe Reader, but I do believe that it captures the most important points here.

For corrupt *and* encrypted PDF documents, it's possible that only some trailer dictionaries actually contain an /Encrypt-entry. Previously we'd could easily miss that, since we generally pick the first not obviously corrupt trailer dictionary, and the solution implemented here is to simply pre-parse all trailer dictionaries to see if there's any /Encrypt-entries.
2023-01-06 13:09:37 +01:00
Jonas Jenwald
6bdbb5c5ca Update the type/subtype at the end of font parsing
This fixes a warning reported by CodeQL, and should also make general sense given that we parse the font-data to determine the *actual* `type`/`subtype` rather than trusting the PDF document.
2023-01-02 16:21:48 +01:00
Jonas Jenwald
1a69d537c1 [api-minor] Limit the PDFDocumentLoadingTask.onUnsupportedFeature functionality to GENERIC builds (PR 15758 follow-up)
This was deprecated in PR 15758 but it's unfortunately quite difficult to tell if third-party users are depending on this, e.g. to implement custom error reporting, and if so to what extent.
However, thanks to the pre-processor we can limit *most* of this code to GENERIC builds which still seem like a worthwhile change.

These changes reduce the bundle size of the Firefox PDF Viewer by 3.8 kB in total.
2023-01-01 17:53:12 +01:00
Jonas Jenwald
0c1fb4e740 [api-minor] Remove the PDFDocumentProxy.stats getter (PR 15758 follow-up)
This was deprecated in PR 15758 and given that it's quite unlikely that any third-party users are relying on this functionality, since it was only ever added to support telemetry reporting in the Firefox PDF Viewer, it should hopefully be fine to remove this fairly quickly.

These changes reduce the bundle size of the Firefox PDF Viewer by 4.5 kB in total.
2023-01-01 17:06:47 +01:00
Jonas Jenwald
2fcf8bb5be Re-factor searching for incomplete objects in XRef.indexObjects (issue 15803)
When trying to find incomplete objects, i.e. those missing the "endobj"-string at the end, there's unfortunately a number of possible operators that we need to check for. Otherwise we could miss e.g. the "trailer" at the end of a corrupt PDF document, which is why the referenced document didn't work.

Currently we do all searching on the "raw" bytes of the PDF document, for efficiency, however this doesn't really work when we need to check for *multiple* potential command-strings. To keep the complexity manageable we'll instead use regular expressions here, but we can at least avoid creating lots of substrings thanks to the `RegExp.lastIndex` property; which is well supported across browsers according to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex#browser_compatibility

Note that this repeated regular expression usage could perhaps be slightly less efficient than the old code, however this method is only invoked for corrupt PDF documents.
2022-12-19 23:01:09 +01:00
Calixte Denizet
f80880ccaa Strip out a reserved operator (9) from CFF char strings (fixes issue #15784) 2022-12-16 15:17:46 +01:00
Jonas Jenwald
26135b0313 Always parse the entire startXRefQueue in XRef.readXRef (issue 15833)
Previously we'd abort all parsing if an Error was encountered, despite the fact that multiple `startXRefQueue`-entries may be available and that continued parsing could thus eventually be able to find usable data.

Note that in the referenced PDF document the `startxref`-operator, at the end of the file, points to a position in the middle of an arbitrary `stream` which is why things break.
2022-12-15 13:46:28 +01:00
Calixte Denizet
0c1ec946aa [JS] Handle correctly choice widgets where the display and the export values are different (issue #15815) 2022-12-13 19:08:26 +01:00
Jonas Jenwald
5f8598abb7 [api-minor] Normalize the view-getter on the worker-thread
*Please note:* I don't really expect that this is will be an observable change, since virtually all PDF documents already order e.g. /MediaBox and /CropBox entries correctly.

By normalizing boundingBoxes already on the worker-thread, we can be sure that even a corrupt document won't cause issues.
Note how we're passing the `view`-getter to the `PartialEvaluator.getTextContent` method, in order to detect textContent which is outside of the page, hence it makes sense to ensure that it's formatted as expected.
Furthermore, by normalizing this once on the worker-tread we should no longer have to worry about a possibly negative width/height in the `PageViewport` constructor.

Finally, the patch also simplifies the `view`-getter a little bit.
2022-12-02 15:46:39 +01:00
Jonas Jenwald
aa5b678f94 Add default icons for FileAttachment annotations (bug 1230933)
*Please note:* This "borrows" the icons from Thunderbird.

According to the PDF specification, see https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2096626, we should be providing default icons for FileAttachment annotations without appearances.
2022-11-26 11:24:59 +01:00
Jonas Jenwald
4b02610e8c Re-factor and simplify the getQuadPoints helper function
The use of `Array.prototype.reduce()` is, in my opinion, hurting overall readability since it's not particularly easy to look at the relevant code and immediately understand what's going on here. Furthermore this code leads to strictly speaking unnecessary allocations and parsing, since we could just track the min/max values directly in the relevant loop instead.
2022-11-25 10:40:16 +01:00
Jonas Jenwald
d1c01b3164 Add a fallback for non-embedded *composite* Tahoma fonts (issue 15719) 2022-11-23 15:51:18 +01:00
Jonas Jenwald
2ff9799e7a Tweak assignment of common parameters in the Annotation classes
This is slightly more compact, and also unifies the format across the various classes.
2022-11-20 12:29:59 +01:00
Jonas Jenwald
c92de947b6 Reduce duplication when creating a fallback appearance for MarkupAnnotations
Currently we repeat the same color-conversion code verbatim in lots of classes, which seems completely unnecessary.
2022-11-20 12:05:25 +01:00
Tim van der Meij
d6908ee145
Merge pull request #15701 from Snuffleupagus/move-string-helpers
Move some string helper functions to the worker-thread
2022-11-19 11:20:07 +01:00
Jonas Jenwald
70d362f22c Remove an unnecessary variable in getPdfManager, in the src/core/worker.js file
Another tiny piece of clean-up, since adding a `catch`-handler to a Promise shouldn't require an intermediate variable.
2022-11-17 15:31:41 +01:00
Jonas Jenwald
a2a200175f Remove unnecessary function names in the src/core/worker.js file
Currently *some* functions in this file have names while others don't, and in a few cases the names are no longer entirely accurate.
For the relevant functions there should really be no need to name them, and if memory serves this was originally done since browsers (many years ago) didn't always handle anonymous functions correctly in stack traces.
2022-11-17 15:12:48 +01:00
Jonas Jenwald
9adc7859c8 Move the escapeString helper function into the worker-thread
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
2022-11-16 12:35:48 +01:00
Jonas Jenwald
e5859e145d Move the isAscii helper function into the worker-thread
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
2022-11-16 12:35:48 +01:00
Jonas Jenwald
2eaa708e3a Combine the stringToUTF16String and stringToUTF16BEString helper functions
Given that these functions are virtually identical, with the latter only adding a BOM, we can combine the two. Furthermore, since both functions were only used on the worker-thread, there's no reason to duplicate this functionality in both of the `pdf.js` and `pdf.worker.js` files.
2022-11-16 12:35:44 +01:00
Jonas Jenwald
f358e76f5b Move the _isOffscreenCanvasSupported property to the base Annotation class
Having just played around with adding FreeText-annotations and then trying to print, there were `FreeTextAnnotation: OffscreenCanvas is not supported, annotation may not render correctly.` messages printed in the console.
The reason for this is that `FreeTextAnnotation` inherits from `MarkupAnnotation`, however only `WidgetAnnotation` actually defines the `_isOffscreenCanvasSupported` property.
2022-11-15 16:30:53 +01:00
Jonas Jenwald
d22eb3591e Change the assert in Parser.findDefaultInlineStreamEnd to a non-PRODUCTION one
Given that this `assert` is only intended to catch any implementation bugs in our code, and not actually to validate the PDF data directly[1], we can avoid making this function call unconditionally.

---
[1] In those cases, for example a `FormatError` should have been thrown instead.
2022-11-12 16:30:58 +01:00
Jonas Jenwald
595711bd7c
Merge pull request #15679 from Snuffleupagus/bug-1799927-2
Use the *full* inline image as the cacheKey in `Parser.makeInlineImage` (bug 1799927)
2022-11-10 22:54:48 +01:00
Calixte Denizet
3ca03603c2 [Annotation] Fix printing/saving for annotations containing some non-ascii chars and with no fonts to handle them (bug 1666824)
- For text fields
 * when printing, we generate a fake font which contains some widths computed thanks to
   an OffscreenCanvas and its method measureText.
   In order to avoid to have to layout the glyphs ourselves, we just render all of them
   in one call in the showText method in using the system sans-serif/monospace fonts.
 * when saving, we continue to create the appearance streams if the fonts contain the char
   but when a char is missing, we just set, in the AcroForm dict, the flag /NeedAppearances
   to true and remove the appearance stream. This way, we let the different readers handle
   the rendering of the strings.
- For FreeText annotations
  * when printing, we use the same trick as for text fields.
  * there is no need to save an appearance since Acrobat is able to infer one from the
    Content entry.
2022-11-10 19:05:39 +01:00
Jonas Jenwald
7abb6429b0 Initialize the dictionary *lazily* when parsing inline images
This helps improve performance for some PDF documents with a huge number of inline images, e.g. the PDF document from issue 2618.
Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).
2022-11-10 18:27:26 +01:00
Jonas Jenwald
b46e0d61cf Use the *full* inline image as the cacheKey in Parser.makeInlineImage (bug 1799927)
*Please note:* This only fixes the "wrong letter" part of bug 1799927.

It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
Rather than switching to another hashing algorithm, e.g. the `MurmurHash3_64` class, we simply cache using a stringified version of the inline image data as the cacheKey to prevent any future collisions. While this will (naturally) lead to slightly higher peak memory usage, it'll however be limited to the current `Parser`-instance which means that it's not persistent.

One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.
2022-11-10 18:27:26 +01:00
Jonas Jenwald
f7449563ef
Merge pull request #15659 from sxyuan/system-font-name-fix
[api-minor] Propagate the translated font name to TextContentItem for system fonts
2022-11-08 21:56:49 +01:00
Samuel Yuan
36fb5c1e2b Propagate the translated font name to TextContentItems.
This allows font data for system fonts to be looked up in the
PDFObjects.
2022-11-08 11:16:21 -08:00
Jonas Jenwald
c8868a1c7a [api-minor] Initialize the unicode-category *lazily* on the Glyph-instance
The purpose of this patch is twofold:
 - Initialize the unicode-category data *lazily* during text-extraction, since this is completely unused during general parsing/rendering.
 - Stop exposing this data in the API, since it's unused on the main-thread and it seems like it was *accidentally* included.

Obviously these changes are API-observable, but hopefully no user is depending on this. Furthermore, it's trivial for a user to re-create this unicode-category data manually with a regular expression (from the exposed `unicode` property).
2022-11-05 10:12:17 +01:00
Jonas Jenwald
c33b8d7692 Cache the normalized unicode-value on the Glyph-instance
Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized.

Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.

*Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.
2022-11-03 22:36:53 +01:00
Jonas Jenwald
23930a249e [api-minor] Let Catalog.getAllPageDicts return an *empty* dictionary when loading the first /Page fails (issue 15590)
In order to support opening certain corrupt PDF documents, particularly hand-edited ones, this patch adds support for letting the `Catalog.getAllPageDicts` method fallback to returning an *empty* dictionary to replace (only) the first /Page of the document.
Given that the viewer cannot initialize/load without access to the first page, this will thus allow e.g. document-level scripting to run as expected. Note that by effectively replacing a corrupt or missing first /Page in this way[1], we'll now render nothing but a *blank* page for certain cases of broken/corrupt PDF documents which may look weird.

*Please note:* This functionality is controlled via the existing `stopAtErrors` option, that can be passed to `getDocument`, since it's easy to imagine use-cases where this sort of fallback behaviour isn't desirable.

---
[1] Currently we still require that a /Pages-dictionary is found though, however it *may* be possible to relax even that assumption if that becomes absolutely necessary in future corrupt documents.
2022-11-03 12:51:48 +01:00
Jonas Jenwald
2516ffa78e Fallback to finding the first "obj" occurrence, when the trailer-dictionary is incomplete (issue 15590)
Note that the "trailer"-case is already a fallback, since normally we're able to use the "xref"-operator even in corrupt documents. However, when a "trailer"-operator is found we still expect "startxref" to exist and be usable in order to advance the stream position. When that's not the case, as happens in the referenced issue, we use a simple fallback to find the first "obj" occurrence instead.

This *partially* fixes issue 15590, since without this patch we fail to find any objects at all during `XRef.indexObjects`. However, note that the PDF document is still corrupt and won't render since there's no actual /Pages-dictionary and the /Root-entry simply points to the /OpenAction-dictionary instead.
2022-11-03 12:46:30 +01:00
calixteman
e42e1cde61
Merge pull request #15615 from calixteman/bug1796741
[Form] Don't use field appearances when /NeedAppearances is set to true (bug 1796741)
2022-10-31 09:58:27 +01:00
Jonas Jenwald
caef47a0cf Remove the PdfManager.onLoadedStream method (PR 15616 follow-up)
After the clean-up in PR 15616, the `PdfManager.onLoadedStream` method now only has a single call-site.
Hence why this patch suggests that we remove this method and replace it with an *optional* parameter in `PdfManager.requestLoadedStream` instead. By making the new behaviour opt-in, we'll thus not change any existing call-site.
2022-10-29 14:42:17 +02:00
Jonas Jenwald
8b970109ea
Merge pull request #15632 from Snuffleupagus/issue-15629-2
[api-minor] Move the handling of unbalanced markedContent to the worker-thread (PR 15630 follow-up)
2022-10-29 09:37:07 +02:00
Jonas Jenwald
ba05e47b3e Combine Array.from and Array.prototype.map calls
This isn't just a tiny bit more compact, but it also avoids an intermediate allocation; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/from#description
2022-10-28 13:46:30 +02:00
Jonas Jenwald
1e7274e9c6 [api-minor] Move the handling of unbalanced markedContent to the worker-thread (PR 15630 follow-up) 2022-10-27 11:14:54 +02:00
Calixte Denizet
9f95a14e91 [Form] Don't use field appearances when /NeedAppearances is set to true (bug 1796741)
When a form isn't changed, we used the appearances we had in the file, but when
/NeedAppearances is true, all the appearances have to be regenerated whatever they're.
2022-10-26 12:10:51 +02:00
Jonas Jenwald
bcffbf74f3 Let the PdfManager.requestLoadedStream method return the stream
*This is very old code, and it could thus do with some simplification.*

Note how in the `src/core/worker.js` file we're combining both the `PdfManager.requestLoadedStream` and `PdfManager.onLoadedStream` methods in order to access the stream-data. This seems unnecessary, and it's simple enough to always let the `PdfManager.requestLoadedStream` method return the stream-data as well.
2022-10-24 17:00:48 +02:00
Jonas Jenwald
71bd8b4de9 Let Lexer.getNumber treat more invalid "numbers" as zero (issue 15604)
In the referenced PDF document there are "numbers" which consist only of `-.`, and while that's obviously not valid Adobe Reader seems to handle it just fine.
Letting this method ignore more invalid "numbers" was suggested during the review of PR 14543, so let's simply relax our the validation here.
2022-10-20 22:36:15 +02:00
Jonas Jenwald
e591378ff1 Restore a weaker version of the /Pages dictionary /Count check for corrupt documents (PR 15593 follow-up)
It appears that PR 15593 broke `issue12402`, and we thus need to partially restore the /Count check.
 I completely missed this when looking at the test-results for PR 15593, both locally and on the bots, since the `Driver._getLastPageNumber` method would "swallow" an unavailable page number.
2022-10-20 14:22:29 +02:00
Jonas Jenwald
36967fcedb
Merge pull request #15586 from Snuffleupagus/rm-matchesForCache
Remove the `Glyph.matchesForCache` method (PR 13494 follow-up)
2022-10-20 10:35:00 +02:00
Jonas Jenwald
3c046c0a21 Extend getSupplementalGlyphMapForCalibri with some umlauts (issue 15594) 2022-10-19 17:49:40 +02:00
Jonas Jenwald
bc13a277ce Relax the /Pages dictionary /Count check for corrupt documents (issue 9105)
After PR 14311, and follow-up patches, we no longer require that the /Count entry (in the /Pages dictionary) is either present or even valid in order to parse/render a PDF document.
Hence it seems strange to keep this requirement for *corrupt* PDF documents, when trying to find a usable `trailer` in the `XRef.indexObjects` method.
2022-10-19 12:28:25 +02:00
Jonas Jenwald
fd35cda8bc Re-factor the glyph-cache lookup in the Font._charToGlyph method
With the changes in the previous patch we can move the glyph-cache lookup to the top of the method and thus avoid a bunch of, in *almost* every case, completely unnecessary re-parsing for every `charCode`.
2022-10-19 09:55:09 +02:00
Jonas Jenwald
3e391aaed9 Remove the Glyph.matchesForCache method (PR 13494 follow-up)
This method, and its class, was originally added in PR 4453 to reduce memory usage when parsing text. Then PR 13494 extended the `Glyph`-representation slightly to also include the `charCode`, which made the `matchesForCache` method *effectively* redundant since most properties on a `Glyph`-instance indirectly depends on that one. The only exception is potentially `isSpace` in multi-byte strings.

Also, something that I noticed when testing this code: The `matchesForCache` method never worked correctly for `Glyph`s containing `accent`-data, since Objects are passed by reference in JavaScript. For affected fonts, of which there's only a handful of examples in our test-suite, we'd fail to find an already existing `Glyph` because of this.
2022-10-19 09:54:35 +02:00
Jonas Jenwald
de99f99a01 Fallback and try a *previous* generation if all else fails in XRef.indexObjects (issue 15577)
When we fail to find a usable PDF document `trailer` *and* there were errors during parsing, try and fallback to a *previous* generation as a last resort during fetching of uncompressed references.
*Please note:* This will not affect "normal" PDF documents, with valid /XRef data, and even most *corrupt* documents should be completely unaffected by these changes.
2022-10-18 20:24:01 +02:00
Tim van der Meij
06599f487f
Merge pull request #15576 from Snuffleupagus/version
Re-factor the PDF version parsing in the worker-thread
2022-10-15 13:03:43 +02:00
Tim van der Meij
2508792f29
Merge pull request #15572 from Snuffleupagus/simpleFontToUnicode-refactor
Slightly re-factor `PartialEvaluator._simpleFontToUnicode`
2022-10-15 12:31:27 +02:00
Jonas Jenwald
d470010293 Re-factor the PDF version parsing in the worker-thread
Part of this is very old code, and back when support for parsing the catalog-version was added things became less clear (in my opinion).
Hence this patch tries to improve things, by e.g. validating the header- and catalog-version separately.
2022-10-15 12:06:39 +02:00
Jonas Jenwald
15d4d80d45
Merge pull request #15563 from Snuffleupagus/issue-15559
Take the /CIDToGIDMap into account when getting the glyph mapping for CFF fonts (issue 15559)
2022-10-14 09:13:41 +02:00
Jonas Jenwald
fa47d4b9b1 Slightly re-factor PartialEvaluator._simpleFontToUnicode
Given the sheer number of heuristics added to this method over the years, moving the *valid* unicode found case to the top should improve readability of the code.
2022-10-13 21:42:57 +02:00
Jonas Jenwald
f2f0a1e871 [api-minor] Stop sending "UnsupportedFeature" from the worker-thread GetOperatorList-handling
This code was added all the way back in PR 6698, almost seven years ago, for backwards compatibility reasons. At this point in time, it seems that we can remove that since:
 - We have more fine-grained "UnsupportedFeature" reporting elsewhere in the worker-thread code nowadays.
 - The GetOperatorList-handling is now using `ReadableStream`s, which means that errors are being forwarded to the main-thread anyway.
 - We're also no longer displaying a notification-bar, in the *built-in* Firefox PDF Viewer, for any of these "UnsupportedFeature" messages.
2022-10-13 11:46:17 +02:00
Jonas Jenwald
858d941ff8 Take the /CIDToGIDMap into account when getting the glyph mapping for CFF fonts (issue 15559)
*Please note:* I don't really know what I'm doing here, however the patch appears to fix the referenced issue when comparing the rendering with Adobe Reader (with the caveat that I don't speak the language in question).
2022-10-13 10:02:25 +02:00
Jonas Jenwald
5bc6f964db Slightly re-factor the version fetching in PDFDocument.checkHeader
Note how after having found the "%PDF-" prefix we then read both the prefix and the version in the loop, only to then remove the prefix at the end.
It seems better to instead advance the stream position past the "%PDF-" prefix, and then read only the version data.

Finally the loop-condition can also be simplified slightly, to further clean-up some very old code.
2022-10-11 13:15:01 +02:00
Jonas Jenwald
081e897588 Ensure that Page.getOperatorList handles Annotation parsing errors correctly (issue 15557)
*Fixes a regression from PR 15246, sorry about that!*

The return value of all `Annotation.getOperatorList` methods was changed in PR 15246, however I missed updating the error code-path in `Page.getOperatorList` which thus breaks all operatorList-parsing for pages with corrupt Annotations.
2022-10-10 09:48:01 +02:00
Tim van der Meij
dff444d441
Merge pull request #15555 from Snuffleupagus/improve-GetDocRequest
Clean-up the data that we're sending with "GetDocRequest"
2022-10-09 14:10:44 +02:00
Jonas Jenwald
8a4f6aca97 Stop using the source-object when sending "GetDocRequest"
Looking at the code on the worker-thread, there doesn't appear to be any particular reason for placing *some* of the properties in a `source`-object when sending them with "GetDocRequest".
As is often the case the explanation for this structure is rather "for historical reasons", since originally we simply sent the `source`-object as-is. Doing that was obviously a bad idea, for a couple of reasons:
 - It makes it less clear what is/isn't actually needed on the worker-thread.
 - Sending unused properties will unnecessarily increase memory usage.
 - The `source`-object may contain unclonable data, which would break the library.
2022-10-09 12:45:24 +02:00
Jonas Jenwald
c84b717773 Group the evaluatorOptions on the main-thread, when sending "GetDocRequest"
Rather than sending all of these parameters individually and then grouping them together on the worker-thread, we can simply handle that in the API instead.
2022-10-09 12:31:03 +02:00
Jonas Jenwald
4cc98de6d7 Remove the unused CMapCompressionType.STREAM value
This was added in PR 8064, over five years ago, for a possible future CMap file-format that was never implemented.
2022-10-08 17:10:05 +02:00
Calixte Denizet
c0e165bf97 Simplify the way to compute the remainder modulo 3 in PDF20Hash function
I noticed the 256 % 3 (which is equal to 1) so I slighty simplify the code.
The sum of the 16 Uint8 doesn't exceed 2^12, hence we can just take the
sum modulo 3.
2022-10-07 14:43:31 +02:00
Jonas Jenwald
3cb119cb32
Merge pull request #15539 from Snuffleupagus/DecryptStream-set
Replace loop with `TypedArray.prototype.set` in the `DecryptStream.readBlock` method
2022-10-07 11:14:28 +02:00
Jonas Jenwald
1ea4c4b519 [api-minor] Make isOffscreenCanvasSupported configurable via the API (issue 14952)
This patch first of all makes `isOffscreenCanvasSupported` configurable, defaulting to `true` in browsers and `false` in Node.js environments, with a new `getDocument` parameter. While you normally want to use this, in order to improve performance, it should still be possible for users to control it (similar to e.g. `isEvalSupported`).

The specific problem, as reported in issue 14952, is that the SVG back-end doesn't support the new ImageMask data-format that's introduced in PR 14754. In particular:
 - When the SVG back-end is used in Node.js environments, this patch will "just work" without the user needing to make any code changes.
 - If the SVG back-end is used in browsers, this patch will require that `isOffscreenCanvasSupported: false` is added to the `getDocument`-call.
2022-10-07 00:10:46 +02:00
Jonas Jenwald
6877d8b9e2 Replace loop with TypedArray.prototype.set in the DecryptStream.readBlock method
There's no reason to use a manual loop, when a native method exists.
2022-10-06 14:43:24 +02:00
Jonas Jenwald
ce66fefbff [api-minor] Add partial support for the "GoToE" action (issue 8844)
*Please note:* The referenced issue is the only mention that I can find, in either GitHub or Bugzilla, of "GoToE" actions.
Hence why I've purposely settled for a very simple, and partial, "GoToE" implementation to avoid complicating things initially.[1] In particular, this patch only supports "GoToE" actions that references the /EmbeddedFiles-dict in the PDF document.

See https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2048909

---
[1] Usually I always prefer having *real-world* test-cases to work with, whenever I'm implementing new features.
2022-10-06 10:33:07 +02:00
Jonas Jenwald
60f6272ed9 Use more for...of loops in the code-base
Most, if not all, of this code is old enough to predate the general availability of `for...of` iteration.
2022-10-03 13:08:38 +02:00
Jonas Jenwald
c87f90102c Add more non-standard ligatures in the glyphlist.js file (issue 15516)
Note that this PR only adds the "underscore"-variant of *actually existing* ligatures, however the referenced PDF document also uses a couple of non-standard ones (e.g. `ft`, `Th`, and `fh`) that we cannot easily support without larger changes (since they don't have official Unicode-entries).
Given that it's clearly the PDF document, and its fonts, that's the culprit here it's not entirely clear to me that we actually want to attempt a larger refactoring/rewriting of the `glyphlist.js` code, assuming it's even generally possible. Especially when this patch alone already improves our copy-paste behaviour when compared to both Adobe Reader and PDFium, and that this is only the *second* time this sort of bug has been reported.
2022-09-27 16:31:51 +02:00
calixteman
da1780f826
Merge pull request #15486 from nmtigor/fix_orders_of_prop
Fix property chain orders of Operators in isDotExpression
2022-09-25 04:13:25 -10:00
Jonas Jenwald
6538409282 Replace some Array.prototype-usage with spread syntax
We have a few, quite old, call-sites that use the `Array.prototype`-format and which can now be replaced with spread syntax instead.
2022-09-23 09:35:30 +02:00
Jonas Jenwald
f1b0dc6f04 Tweak the heuristic that handles JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 15492) 2022-09-22 14:09:04 +02:00
nmtigor
22cc9b7dc7 Fix property chain orders of Operators in isDotExpression and isSomPredicate 2022-09-21 17:20:23 +02:00
Calixte Denizet
198e9a3db1 Initialize values in the path bounding box before flushing the operator list (bug 1791583)
OperatorList.addOp can trigger a flush if it's required, hence the values passed to it must
be correctly initialized in order to avoid some wrong values in the renderer.
Because of that a clip path was considered as empty, nothing was clipped, hence the wrong
rendering in bug 1791583.
2022-09-20 20:01:54 +02:00
Calixte Denizet
f5b835157b [XFA] Fix an hidden issue in the FormCalc lexer
Since there are no script engine with XFA, the FormCalc parser is not used irl.
The bug @nmtigor noticed was hidden by another one (the wrong check on `match`).
2022-09-20 13:53:55 +02:00
Jonas Jenwald
20b9887476 Enable the unicorn/prefer-regexp-test ESLint plugin rule
Please see https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-regexp-test.md
2022-09-19 16:34:01 +02:00
Jonas Jenwald
bb75b36b77 Replace some unnecessary String.prototype.search usage
Most of the `String.prototype.search` call-sites found throughout the code-base is actually not necessary, since we usually only want a *boolean*, and those can be replaced with `RegExp.prototype.test` instead.
2022-09-19 12:51:46 +02:00
Jonas Jenwald
7a19def34c Extend getSupplementalGlyphMapForCalibri with more entries (issue 15443) 2022-09-15 22:19:16 +02:00
Jonas Jenwald
2f2ecad8fd Extend getGlyphMapForStandardFonts with some quote-entries (issue 15441) 2022-09-15 11:37:20 +02:00
Jonas Jenwald
947d390421 Fallback to a standard font when a Type1 font program is empty (issue 15292)
*Please note:* This is only a, hopefully generally helpful, work-around rather than a proper solution to issue 15292.

There's something that's "special" about the Type1 fonts in the referenced PDF document, since we don't manage to find any actual font programs and thus cannot render anything.
Given that it shouldn't make sense for a Type1 font program to ever be empty, since that means that there's no glyph-data to render, we simply fallback to a standard font to at least try and render *something* in these rare cases.
2022-09-05 12:07:19 +02:00
Jonas Jenwald
12d60e0acf Don't allow adjustToUnicode to extend a built-in /ToUnicode map (issue 15352)
Given that the change in PR 13393 was slightly speculative, given the lack of test-cases, let's just revert part of that to fix the referenced issue.
Based on a quick look at old issues and existing test-cases, it seems that most (if not all) PDF documents that benefit from using the font-data in this way lack any /ToUnicode maps which should mean that they're unaffected by these changes.
2022-09-03 23:11:42 +02:00
Jonas Jenwald
cc4baa2fe9 [api-minor] Add basic support for the SetOCGState action (issue 15372)
Note that this patch implements the `SetOCGState`-handling in `PDFLinkService`, rather than as a new method in `OptionalContentConfig`[1], since this action is nothing but a series of `setVisibility`-calls and that it seems quite uncommon in real-world PDF documents.

The new functionality also required some tweaks in the `PDFLayerViewer`, to ensure that the `layersView` in the sidebar is updated correctly when the optional-content visibility changes from "outside" of `PDFLayerViewer`.

---
[1] We can obviously move this code into `OptionalContentConfig` instead, if deemed necessary, but for an initial implementation I figured that doing it this way might be acceptable.
2022-09-01 17:34:24 +02:00
Jonas Jenwald
216b86a082 [api-minor] Support Named-actions in the outline (issue 15367)
Apparently this is implemented in e.g. Adobe Reader, and the specification does support it, however it cannot be commonly used in real-world PDF documents since it took over ten years for this feature to be requested.
2022-08-30 18:47:45 +02:00
Calixte Denizet
c06c5f7cbd [Annotations] charLimit === 0 means unlimited (bug 1782564)
Changing the charLimit in JS had no impact, so this patch aims to fix
that and add an integration test for it.
2022-08-19 11:28:28 +02:00
Jonas Jenwald
6a2c2a646f Remove the remaining closures in the src/core/type1_parser.js file
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these change.
By removing this closure the file-size is decreased, even for the *built* `pdf.worker.js` file, since there's now less overall indentation in the code.
2022-08-14 12:50:26 +02:00
Jonas Jenwald
e5e756c0b4 Remove the remaining closures in the src/core/cff_parser.js file
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these changes.
For e.g. the `gulp mozcentral` command the *built* `pdf.worker.js` file-size decreases `~2 kB` with this patch, and most of the improvement comes from having less overall indentation in the code.
2022-08-13 19:48:17 +02:00
Jonas Jenwald
9dcfdb9578 Remove the remaining closure in the src/core/function.js file
Given that the code is written with JavaScript module-syntax, none of this functionality will "leak" outside of this file with these changes.
By removing this closure the file-size is decreased, even for the *built* `pdf.worker.js` file, since there's now less overall indentation in the code.
2022-08-13 12:52:36 +02:00
Calixte Denizet
04f78c935c Fix OTS issue with empty index (#15289) 2022-08-08 22:56:26 +02:00
Tim van der Meij
2a84a3078b
Merge pull request #15283 from Snuffleupagus/sort-PopupAnnotation
[api-minor] Sort PopupAnnotations already on the worker-thread (PR 11535 follow-up)
2022-08-06 15:07:09 +02:00
Jonas Jenwald
876a02a504 [api-minor] Sort PopupAnnotations already on the worker-thread (PR 11535 follow-up)
By doing this in the worker-thread this code will only need to run *once*, whereas currently re-rendering of a page forces this to be repeated (e.g. after it's been scrolled out-of-view and then back into view again).
2022-08-06 11:42:45 +02:00
Jonas Jenwald
f6db7975c5 Enable the ESLint prefer-spread rule
Note that in a couple of spots the argument could be `undefined` and there we simply disable the rule instead.

Please refer to https://eslint.org/docs/latest/rules/prefer-spread
2022-08-06 10:17:00 +02:00
Calixte Denizet
31155740c3 [Annotation] Add a div containing the text of a FreeText annotation (bug 1780375)
An annotation doesn't have to be in the text flow, hence it's likely a bad idea
to insert its text in the text layer. But the text must be visible from a screen
reader point of view so it must somewhere in the DOM.
So with this patch, the text from a FreeText annotation is extracted and added in
a div in its HTML counterpart, and with the patch #15237 the text should be visible
and positioned relatively to the text flow.
2022-08-04 11:14:05 +02:00
Jonas Jenwald
0c31320c12 [api-minor] Improve thumbnail handling in documents that contain interactive forms
To improve performance of the sidebar we use the page-canvases to generate the thumbnails whenever possible, since that avoids unnecessary re-rendering when the sidebar is open. This works generally well, however there's an old problem in PDF documents that contain interactive forms (when those are enabled): Note how the thumbnails become partially (or fully) blank, since those Annotations are not included in the OperatorList.[1]

We obviously want to keep using the `PDFThumbnailView.setImage`-method for most documents, however we need a way to skip it only for those pages that contain interactive forms.
As it turns out it's unfortunately not all that simple to tell, after the fact, from looking only at the OperatorList that some Annotations were skipped. While it might have been possible to try and infer that in the viewer, it'd not have been pretty considering that at the time when rendering finishes the annotationLayer has not yet been built.
The overall simplest solution that I could come up with, was instead to include a *summary* of the interactive form-state when doing the final "flushing" of the OperatorList and expose that information in the API.

---
[1] Some examples from our test-suite: `annotation-tx2.pdf` where the thumbnail is completely blank, and `bug1737260.pdf` where the thumbnail is missing the "buttons" found on the page.
2022-07-30 16:53:32 +02:00
Calixte Denizet
d092a85b6c Fix wrong order of arguments when calling the CipherTransform ctor (bug 1782186) 2022-07-29 12:46:45 +02:00
Jonas Jenwald
2fb083f3e2 Ensure that the isUsingOwnCanvas-parameter is consistently included in operatorLists (PR 14247 follow-up)
Currently some `OPS.beginAnnotation` arguments will contain a `Number` value for the `isUsingOwnCanvas`-parameter, or in some cases an `undefined` value, which is inconsistent from an API perspective.
2022-07-28 13:37:37 +02:00
Calixte Denizet
7831a100b3 [Editor] Add the possibility to change line opacity in Ink editor 2022-07-27 18:46:25 +02:00
Jonas Jenwald
fc018ea9ea Support images with /Filter-entries that contain Arrays (issue 15220)
This patch "borrows" the code found in the `Parser.makeInlineImage`-method, to ensure that JBIG2 and JPX images can be rendered correctly.
2022-07-25 08:41:37 +02:00
Jonas Jenwald
60bd9580e2 Ignore invalid /CIDToGIDMap-entries when parsing fonts (issue 15139)
In the referenced PDF document the fonts have /CIDToGIDMap-entries that cannot be loaded. Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /CIDToGIDMap-entries and fallback to simply assume that no such data is available.

Given that this is *clearly* a case of a corrupt PDF document, there's no guarantee that this will "fix" things in the general case since a /CIDToGIDMap may be *required* in order for some composite fonts to render correctly. However, attempting to render *something* is surely better than skipping a font altogether.
2022-07-20 11:58:44 +02:00
Jonas Jenwald
37ebc28756 Use more for...of loops in the code-base
Note that these cases, which are all in older code, were found using the [`unicorn/no-for-loop`](https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/no-for-loop.md) ESLint plugin rule.
However, note that I've opted not to enable this rule by default since there's still *some* cases where I do think that it makes sense to allow "regular" for-loops.
2022-07-17 16:18:54 +02:00
Jonas Jenwald
de7d1d2167
Merge pull request #15170 from calixteman/js_rm_null
[JS] Embedded JS scripts can have some null chars
2022-07-15 17:11:29 +02:00
Jonas Jenwald
acd61a138e Handle errors in the "Loading by ref" code-path in PartialEvaluator.loadFont
Note how we currently throw a "raw" Error, which is problematical since all of the `PartialEvaluator.loadFont` call-sites expect a Promise to be returned. Furthermore, this also means that we don't benefit from the fallback code-path that now exists below.

*Please note:* Unfortunately I don't have a test-case that fails without this patch, since it's something I happened to notice when reading the code while working on another patch.
2022-07-15 16:33:36 +02:00
Calixte Denizet
5f0c95e70e [JS] Embedded JS scripts can have some null chars 2022-07-15 16:05:25 +02:00
calixteman
41b2f52f70
Merge pull request #15157 from calixteman/1778484
Add unicode mapping in the font cmap to have correct chars when printing in pdf (bug 1778484)
2022-07-13 14:45:12 +02:00
Calixte Denizet
680c293c34 Add unicode mapping in the font cmap to have correct chars when printing in pdf (bug 1778484)
It aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1778484.
2022-07-13 14:38:27 +02:00
Jonas Jenwald
dcc73423e5 Enable the unicorn/prefer-logical-operator-over-ternary ESLint plugin rule
This leads to ever so slightly more compact code, and can in some cases remove the need for a temporary variable.

Please find additional information here:
https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-logical-operator-over-ternary.md
2022-07-12 10:52:37 +02:00
Jonas Jenwald
c2f7942aea Ensure that the /Resources-entry is actually a dictionary (issue 15150)
Prevent issues in *corrupt* PDF documents, if the /Resources-entry is not of the correct and expected type.
2022-07-08 12:43:43 +02:00
Jonas Jenwald
79cfc548fc Improve text-selection for Type3 fonts with bogus /FontBBox-entries (issue 14999)
This extends PR 13461, by also building a fallback bounding box for Type3 fonts that contain a much too small /FontBBox-entry.

*Please note:* While this patch improves things overall, copy-and-pasting still doesn't work perfectly for this document. In particular the lowercase letter "c" cannot be selected/copied, however this can be reproduced in both Adobe Reader and PDFium (in Google Chrome) too, which is caused by a lack of proper /ToUnicode-data in the PDF document.
2022-07-05 14:27:14 +02:00
Calixte Denizet
1a3ef2a0aa [editor] Add some UI elements in order to set font size & color, and ink thickness & color 2022-06-28 12:05:04 +02:00
Calixte Denizet
3789dab307 Always flush the current item with MarkedContent stuff when getting text (#15094) 2022-06-25 17:19:57 +02:00
calixteman
23fcdabb37
Merge pull request #15088 from calixteman/editor_rotation
Support rotating editor layer
2022-06-25 16:18:07 +02:00
Calixte Denizet
0c420f5135 Support rotating editor layer
- As in the annotation layer, use percent instead of pixels as unit;
- handle the rotation of the editor layer in allowing editing when rotation
  angle is not zero;
- the different editors are rotated counterclockwise in order to be usable
  when the main page is itself rotated;
- add support for saving/printing rotated editors.
2022-06-24 20:02:32 +02:00
Jonas Jenwald
c48dc251e0 Add (basic) support for Optional Content in Annotations
Given that Annotations can also have an `OC`-entry, we need to take that into account when generating their operatorLists.

Note that in order to simplify the patch the `getOperatorList`-methods, for the Annotation-classes, were converted to be `async`.
2022-06-24 15:19:56 +02:00
Calixte Denizet
e49d039853 Correctly order added annotations when saving or printing
- the annotations must be rendered in the same order as the chronological one.
- fix a bug in document.js which avoids to read a saved pdf correctly in Acrobat:
  there is no need to reset the xref state: it's done in worker.js once everything
  has been saved.
2022-06-23 17:39:12 +02:00
Calixte Denizet
30c63eb0ec [Editor] Add support for printing newly added FreeText annotations 2022-06-22 13:26:09 +02:00
Jonas Jenwald
eca939d904
Merge pull request #15076 from Snuffleupagus/prefer-array-index-of
Enable the `prefer-array-index-of` ESLint plugin rule
2022-06-21 18:57:51 +02:00
Calixte Denizet
f27c8c4471 [Editor] Add support for printing newly added Ink annotations 2022-06-21 18:21:49 +02:00
calixteman
8d466f5dac
Merge pull request #15060 from calixteman/annotation_rotation
Rotate annotations based on the MK::R value (bug 1675139)
2022-06-21 18:03:09 +02:00
Calixte Denizet
cdc58b7a52 Rotate annotations based on the MK::R value (bug 1675139)
- it aims to fix: https://bugzilla.mozilla.org/show_bug.cgi?id=1675139;
- An annotation can be rotated (counterclockwise);
- the rotation can be set in using JS.
2022-06-21 17:57:26 +02:00
Jonas Jenwald
1c9a702f73 Enable the prefer-array-index-of ESLint plugin rule
https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-array-index-of.md
2022-06-21 16:54:32 +02:00
Jonas Jenwald
57c10ac213 Simplify the newRefs computation in the "SaveDocument"-handler in the worker-thread
- Let the `Page.save`-method filter out "empty" entries, similar to the `Page._parsedAnnotations`-getter, since that on its own already simplifies the "SaveDocument"-handler a tiny bit.

 - The existing `reduce` and `concat` construction isn't exactly a wonder of readability :-)
   Thanks to modern JavaScript features it should be possible to replace all of this with `Array.prototype.flat()` instead, which at least to me feels a lot easier to understand.
2022-06-19 18:21:51 +02:00
Jonas Jenwald
c21f4faaf8 Reduce unnecessary usage of Array.prototype.concat()
There are obviously cases where using `concat` makes perfect sense, since that method doesn't change any of the existing Arrays; see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/concat

However, in a few cases throughout the code-base that's not an issue and using `concat` only leads to unnecessary intermediate allocations. With modern JavaScript we can thus replace those with a combination of `push` and spread-syntax, which wasn't originally possible when the code was written.
2022-06-19 13:40:52 +02:00
Calixte Denizet
e2db9bacef Get rid of CSS transform on each annotation in the annotation layer
- each annotation has its coordinates/dimensions expressed in percentage,
  hence it's correctly positioned whatever the scale factor is;
- the font sizes are expressed in percentage too and the main font size
  is scaled thanks a css var (--scale-factor);
- the rotation is now applied on the div annotationLayer;
- this patch improve the rendering of some strings where the glyph spacing
  was not correct (it's a Firefox bug);
- it helps to simplify the code and it should slightly improve the update of
  page (on zoom or rotation).
2022-06-18 17:54:59 +02:00
Jonas Jenwald
64cce1269e Add basic support for non-embedded ArialUnicodeMS fonts (issue 15044)
This appears to be a Microsoft-specific version of the regular Arial font, hence we simply map this to Helvetica in the same way that we treat many other Arial-named fonts.
2022-06-15 10:37:20 +02:00
Jonas Jenwald
2dca14028d Extend getGlyphMapForStandardFonts with some Hebrew entries (issue 15033)
This only adds the minimum entries required in order to render the referenced document correctly, rather than trying to support "all" Hebrew glyphs, to ensure that all lines in `getGlyphMapForStandardFonts` are covered by tests.
2022-06-13 10:08:39 +02:00
Tim van der Meij
26ae50e449
Merge pull request #15023 from Snuffleupagus/prefer-array-flat
Enable the `unicorn/prefer-array-flat` and `unicorn/prefer-array-flat-map` ESLint plugin rules
2022-06-12 20:10:52 +02:00
Jonas Jenwald
bbf857d635 [api-minor] Stop using the beginAnnotations/endAnnotations operators (PR 14998 follow-up)
After the changes in PR 14998, these operators are now no-ops in the `src/display/canvas.js` code and should no longer be necessary.
Given that `beginAnnotations`/`endAnnotations` are not in the PDF specification, but are rather *custom* PDF.js operators, it seems reasonable to stop using them now that they've become no-ops.
2022-06-11 14:21:26 +02:00
Jonas Jenwald
010d996b74 Enable the unicorn/prefer-array-flat and unicorn/prefer-array-flat-map ESLint plugin rules
These rules will help enforce shorter and more readable code, and according to MDN these Array-methods are available in all browsers/environments that we currently support:
 - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat#browser_compatibility
 - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flatMap#browser_compatibility

Please find additional information about these ESLint rules here:
 - https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-array-flat.md
 - https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-array-flat-map.md
2022-06-11 11:33:43 +02:00
Jonas Jenwald
9ac4536693 Enable the unicorn/prefer-at ESLint plugin rule (PR 15008 follow-up)
Please find additional information here:
 - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/at
 - https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-at.md
2022-06-09 21:21:19 +02:00
calixteman
5d88233fbb
Merge pull request #15006 from calixteman/ink2
[editor] Add support for saving newly added Ink
2022-06-09 21:13:53 +02:00
Jonas Jenwald
66bbc0e7ee Call WidgetAnnotation._getTextWidth correctly from the ChoiceWidgetAnnotation-class (PR 14720 follow-up)
In the "no fontSize available" code-path, in the `ChoiceWidgetAnnotation._getAppearance` method, we don't provide the necessary second argument when calling the `_getTextWidth`-method which will cause errors to be thrown.
2022-06-09 10:11:01 +02:00
Calixte Denizet
36aae436bf [editor] Add support for saving newly added Ink 2022-06-08 22:16:01 +02:00
calixteman
2fbf14ace8
Merge pull request #14978 from calixteman/editor2
[editor] Add support for saving a newly added FreeText
2022-06-08 15:51:03 +02:00
Calixte Denizet
7773b3f5be [edition] Add support for saving a newly added FreeText 2022-06-08 14:34:09 +02:00
Calixte Denizet
2dd0c861bf Outline fields which are required (bug 1724918)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1724918;

- it applies for both Acroform and XFA.
2022-06-07 17:02:11 +02:00
Jonas Jenwald
e82ad79eb9 Conditionally bundle gulp image_decoders-specific code in src/core/jbig2.js (PR 9729 follow-up)
This method/function was added only for the `gulp image_decoders`-builds, and is completely unused elsewhere (e.g. in the Firefox PDF Viewer).
While this only reduces the size of the *built* `pdf.worker.js` file by a little over 1 kB, it can't hurt to remove completely unused code from the "normal" builds.
2022-06-05 15:38:28 +02:00
Calixte Denizet
9d82106d20 Set the text fields font size based on their height
- right now we're using the font size from the pdf itself but we use an other font
  in the annotation layer. So this size doesn't really make sense and leads to bad
  rendering (see pdf in #14928);
- use a sans-serif font for the fields containing text (fix issue #14736);
- remove useless padding in text-based fields (fix issue #14301);
- text fields allow/disallow scrolling bars (see bit 24 in Ff entry), so use this
  value to hide/show scrollbars in annotation layer.
2022-05-28 18:00:39 +02:00
Jonas Jenwald
5a2899c57e Skip bogus d1 operators in Type3-glyphs (issue 14953)
In the `src/display/canvas.js` code the `d1` operator will be used to set the clipping region, and it obviously cannot be empty since that prevents the Type3-glyph from rendering.

Also, the patch removes an outdated comment; refer to PR 12718.
2022-05-24 12:20:31 +02:00
Calixte Denizet
60498c67e4 Display background when printing or saving a text widget (issue #14928) 2022-05-19 16:41:54 +02:00
Jonas Jenwald
5a774b7ed3 Adjust the heuristics for handling of incomplete path operators (issue 14917)
This limits the heuristics for handling of incomplete path operators, see PR 9838, to only apply to *sequences* of such operators. In practice a couple of invalid path operators are (hopefully) unlikely to completely break rendering, whereas a sequence of them will easily lead to fairly chaotic rendering artifacts.
2022-05-15 11:24:39 +02:00
Jonas Jenwald
d540df0582 Use TypedArray.prototype.fill() a bit more in the code-base
Please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/fill, which is implemented in all browsers that we currently support.
2022-05-13 12:42:51 +02:00
Jonas Jenwald
6bcc5b615d [api-minor] Include line endings in Line/Polyline Annotation-data (issue 14896)
Please refer to:
 - https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2109792
 - https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2096489
 - https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2096447

Note that we still won't attempt to use the /LE-data when creating fallback appearance streams, as mentioned in PR 13448, since custom line endings aren't common enough to warrant the added complexity.
Finally, note that according to the PDF specification we should *potentially* also take the line endings into account for FreeText Annotations. However, in that case their use is conditional on other parameters that we currently don't support.
2022-05-12 11:08:30 +02:00
Jonas Jenwald
38c82357b2
Merge pull request #14890 from calixteman/14889
[JS] Formatted value has to be a string when neither null nor undefined
2022-05-08 17:25:29 +02:00
Calixte Denizet
ab3958d6e8 [JS] Formatted value has to be a string when neither null nor undefined 2022-05-08 16:43:57 +02:00
Jonas Jenwald
6e7e9d83d8 Add support for TrueType format 12 cmaps (issue 14881)
This is, as far as I can tell, the first case we've seen of a format 12 `cmap`.
Please see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
2022-05-06 11:11:38 +02:00
Jonas Jenwald
8267fd8a52 Replace the AnnotationStorage.lastModified-getter with a proper hash-method
The current `lastModified`-getter, which only contains a time-stamp, is a fairly crude way of detecting if the stored data has actually been changed. In particular, when the `getRawValue`-method is used, the `lastModified`-getter doesn't cope with data being modified from the "outside".

To fix these issues[1], and to prevent any future bugs in this code, this patch introduces a new `AnnotationStorage.hash`-getter which computes a hash of the currently stored data. To simplify things this re-uses the existing `MurmurHash3_64`-implementation, which required moving that file into the `src/shared/`-folder, since its performance should be good enough here.

---
[1] Given how the `AnnotationStorage.lastModified`-getter was used, this would have been limited to *printing* of forms.
2022-05-04 15:21:30 +02:00
Jonas Jenwald
8135d7ccf6
Merge pull request #14869 from calixteman/14862
[JS] Fix few bugs present in the pdf for issue #14862
2022-05-03 18:31:31 +02:00
Calixte Denizet
094ff38da0 [JS] Fix few bugs present in the pdf for issue #14862
- since resetForm function reset a field value a calculateNow is consequently triggered.
  But the calculate callback can itself call resetForm, hence an infinite recursive loop.
  So basically, prevent calculeNow to be triggered by itself.
- in Firefox, the letters entered in some fields were duplicated: "AaBb" instead of "AB".
  It was mainly because beforeInput was triggering a Keystroke which was itself triggering
  an input value update and then the input event was triggered.
  So in order to avoid that, beforeInput calls preventDefault and then it's up to the JS to
  handle the event.
- fields have a property valueAsString which returns the value as a string. In the
  implementation it was wrongly used to store the formatted value of a field (2€ when the user
  entered 2). So this patch implements correctly valueAsString.
- non-rendered fields can be updated in using JS but when they're, they must take some properties
  in the annotationStorage. It was implemented for field values, but it wasn't for
  display, colors, ...
- it fixes #14862 and #14705.
2022-05-03 15:48:44 +02:00
Jonas Jenwald
df5a4fd0a7 Support encoded dest-strings in /GoTo destination dictionaries (issue 14864)
Interestingly enough this appears to be the very first case of *encoded* dest-strings, in /GoTo destination dictionaries, that we've actually come across. What's really fascinating is that it's less than a week after issue 14847, given that these issues are *somewhat* similar.
2022-05-02 10:14:32 +02:00
Jonas Jenwald
fbf6dee8ee [api-minor] Remove the forceClamped-functionality in the Streams (issue 14849)
As it turns out, most of the code-paths in the `PDFImage`-class won't actually pass the TypedArray (containing the image-data) to the `ColorSpace`-code. Hence we *generally* don't need to force the image-data to be a `Uint8ClampedArray`, and can just as well directly use a `Uint8Array` instead.

In the following cases we're returning the data without any `ColorSpace`-parsing, and the exact TypedArray used shouldn't matter:
 - b72a448327/src/core/image.js (L714)
 - b72a448327/src/core/image.js (L751)

In the following cases the image-data is only used *internally*, and again the exact TypedArray used shouldn't matter:
 - b72a448327/src/core/image.js (L762) with the actual image-data being defined (as `Uint8ClampedArray`) further below
 - b72a448327/src/core/image.js (L837)

*Please note:* This is tagged `api-minor` because it's API-observable, given that *some* image/mask-data will now be returned as `Uint8Array` rather than using `Uint8ClampedArray` unconditionally. However, that seems like a small price to pay to (slightly) reduce memory usage during image-conversion.
2022-04-29 14:46:30 +02:00
Jonas Jenwald
71370d012b Support destinations in NameTrees with encoded keys (issue 14847)
Initially I considered updating the `NameOrNumberTree`-implementation to handle encoded keys, however that quickly became somewhat messy (especially in the `NameOrNumberTree.get`-method) since only NameTrees using string-keys.
Hence the easiest solution, as far as I'm concerned, was thus to just update the `Catalog.destinations`-getter instead. Please note that in the referenced PDF document the `Catalog.destination`-method will thus fallback to fetch all destinations, which should be fine since this is the very first case of encoded keys that we've seen.

Also changes the `NameOrNumberTree.getAll`-method to prevent a possible run-time error, although we've so far not seen such a case, for any non-Array Kids-entries found in a NameTree/NumberTree.

Finally, to improve overall consistency and to hopefully prevent future bugs, the patch also updates a couple of other `NameTree` call-sites to correctly handle encoded keys. (Note that the `Catalog.attachments`-getter was already doing this.)
2022-04-27 11:19:55 +02:00
Jonas Jenwald
e18edf38db Add a helper function for incrementing the count of cached ImageMasks
While working on PR 14825, I couldn't help noticing that the code to increment the `count` for cached ImageMasks was repeated multiple times. Hence it makes sense, as far as I'm concerned, to move this into a helper function instead.
2022-04-24 11:10:02 +02:00
Tim van der Meij
752dee5caa
Merge pull request #14825 from Snuffleupagus/issue-14824
Ensure that worker-thread image caching doesn't break optional content (issue 14824)
2022-04-23 13:19:56 +02:00
Tim van der Meij
f9e54d9226
Merge pull request #14823 from Snuffleupagus/issue-14821
Ignore invalid /Encoding-entries when parsing fonts (issue 14821)
2022-04-23 13:19:26 +02:00
Jonas Jenwald
6c229dffb1 Ensure that worker-thread image caching doesn't break optional content (issue 14824)
Currently we only insert optionalContent-data into the operatorList the first time that an image is parsed, which will (in hindsight) obviously cause problems for cached images.
Hence we also need to insert the optionalContent-data in the various worker-thread image caches, such that it can be accessed in the fast-paths that are used to skip re-parsing of images.

In order to reduce the amount of repeated code, this patch also adds a new `OperatorList`-method that takes care of inserting the necessary data in the operatorList.
2022-04-22 14:49:16 +02:00
Jonas Jenwald
e723da7261 Ignore invalid /Encoding-entries when parsing fonts (issue 14821)
In the referenced PDF document the fonts have /Encoding-entries that are Streams (containing completely bogus data), which are thus obviously not valid here.
Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /Encoding-entries and fallback to the existing code to try and infer a usable encoding.

Given that this is *clearly* a case of corrupt PDF documents, there's no guarantee that this will "fix" all such cases, however it's the best that we do here and shouldn't really be worse than ignoring an entire font.
2022-04-22 11:49:03 +02:00
Tim van der Meij
f39219cd45
Merge pull request #14815 from Snuffleupagus/issue-14814
Ignore non-Stream /SMask-entries when parsing images (issue 14814)
2022-04-22 11:39:13 +02:00
Sean Wei
6bf978404e Use correct case for JavaScript 2022-04-21 23:56:28 +08:00
Jonas Jenwald
39d1bdde09 Ignore non-Stream /SMask-entries when parsing images (issue 14814)
This is similar to the pre-existing check used in the /Mask-case below, to handle *corrupt* PDF documents that include non-Stream /SMask-entries in images; please refer to the PDF specification:
https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=216

*Please note:* Adobe Reader also fails to render the image on the second page, and displays an error message.
2022-04-21 12:14:08 +02:00
Jonas Jenwald
5bc7339c1b Add support for the /Catalog Base-URI when resolving URLs (issue 14802)
As far as I can tell, this is actually the very first time that we've seen a PDF document with a Base-URI specified in the /Catalog; please refer to the specification:
https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2097122

To simplify the overall implementation, this new parameter is accessed via the existing `BasePdfManager.docBaseUrl`-getter and will thus override any user-specified `docBaseUrl` API-parameter.
2022-04-19 17:14:52 +02:00
Calixte Denizet
4b7691baf6 Simplify min/max computations in constructPath (bug 1135277)
- most of the time the current transform is a scaling one (modulo translation),
  hence it's possible to avoid to apply the transform on each bbox and then apply
  it a posteriori;
- compute the bbox when it's possible in the worker.
2022-04-17 17:25:54 +02:00
Calixte Denizet
f62d961dfe Improve performances with image masks (bug 857031)
- it's the second part of the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- some image masks can be used several times but at different positions;
- an image need to be pre-process before to be rendered:
  * rescale it;
  * use the fill color/pattern.
- the two operations above are time consuming so we can cache the generated canvas;
- the cache key is based on the current transform matrix (without the translation part)
  and the current fill color when it isn't a pattern.
- the rendering of the pdf in the above bug is really faster than without this patch.
2022-04-16 20:48:39 +02:00
Tim van der Meij
cdb3481d6c
Merge pull request #14764 from apeltop/correct-typos
Correct typos
2022-04-10 14:55:08 +02:00
Calixte Denizet
040fcae5ab Improve performance with image masks (bug 857031)
- it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw:
  * an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread:
    - this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them
      for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330;
    - or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so
      it's slightly slower than using an OffscreenCanvas.
  * it's transfered from the worker to the main thread by "reference";
  * the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally
    less than before.
- Use the localImageCache for the mask;
- Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image;
- Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set
  as defined in operator_list.
2022-04-09 18:26:26 +02:00
apeltop
a97dd26389 Correct typos 2022-04-09 09:43:18 +09:00
Jonas Jenwald
a919959d83 Slightly simplify the Catalog._readMarkInfo method
We don't need to first check if the Dictionary contains the key, since trying to get a non-existent key simply returns `undefined` and we're already ensuring that the value is a boolean.
Furthermore, we shouldn't need to worry about the `Object.prototype` containing enumerable properties since the checks (in `src/core/worker.js`) done for `Array.prototype` *indirectly* also cover `Object`s. (Keep in mind that an `Array` is just a special kind of `Object` in JavaScript.)
2022-04-05 16:37:51 +02:00
Jonas Jenwald
1dc4713a0b Re-factor the isLittleEndian/isEvalSupported caching
This functionality is very old, hence we should be able to improve the caching a little bit with modern JavaScript features.
2022-04-05 16:01:01 +02:00
Calixte Denizet
f4fcb59a5e Refactor some xfa*** getters in document.js
- it's a follow-up of PR #14735.
2022-04-03 20:38:12 +02:00
Jonas Jenwald
f33ce5fc2d Decode non-ASCII values found in the xfa:datasets (PR 14735 follow-up)
*Please note:* This is possibly bad/wrong in general, but I figured that submitting it for review wouldn't hurt.

It seems that even Adobe Reader doesn't handle the non-ASCII characters that appear in some of the fields correctly, however it should be pretty easy to improve things on the PDF.js side.
2022-04-01 11:54:34 +02:00
Jonas Jenwald
36a289d747
Merge pull request #14735 from calixteman/14685
[Annotations] Some annotations can have their values stored in the xfa:datasets
2022-04-01 11:30:16 +02:00
Calixte Denizet
0b597304c1 [Annotations] Some annotations can have their values stored in the xfa:datasets
- it aims to fix #14685;
- add a basic object to get values from the parsed datasets;
- these annotations don't have an appearance so we must create one when printing or saving.
2022-04-01 10:28:04 +02:00
Jonas Jenwald
addb4cb12b Use String.prototype.repeat() in a couple of spots
Rather than using a temporary Array to manually create repeated strings, we can use `String.prototype.repeat()` instead.
The reason that we didn't use this from the start is most likely because some browsers, notably IE, didn't support this; note https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/repeat#browser_compatibility
2022-03-30 15:42:40 +02:00
Calixte Denizet
ad3fb71a02 [Annotations] Add support for printing/saving choice list with multiple selections
- it aims to fix issue #12189.
2022-03-29 18:59:44 +02:00
Calixte Denizet
18e79e3c0b [text selection] Add the whitespaces present in the pdf in the text chunk
- it aims to fix issue #14627;
- the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
  But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
  they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
  we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
  So basically, this patch removes the constraint to have the chars in the same Tj
  (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
  too small (hence `NOT_A_SPACE_FACTOR`).
2022-03-27 14:34:56 +02:00
Jonas Jenwald
73d2ddac0d Update npm packages
Note that the Prettier update made it possible to move a couple of comments after `default:`-cases back to their original/intended positions, please see https://prettier.io/blog/2022/03/16/2.6.0.html
2022-03-20 10:59:13 +01:00
Tim van der Meij
5de6af4e64
Merge pull request #14683 from Snuffleupagus/sendTest-cleanup
[src/display/api.js] Simplify the `sendTest` function, used with Worker initialization (PR 14291 follow-up)
2022-03-19 13:38:05 +01:00
Jonas Jenwald
c0736647f9 Add general iteration support in the RefSet and RefSetCache classes
This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.
2022-03-18 14:27:34 +01:00
Jonas Jenwald
be2b1d5d2a [src/display/api.js] Simplify the sendTest function, used with Worker initialization (PR 14291 follow-up)
Given that we now only use Workers when `postMessage` transfers are supported, there's really no point in trying to send a "test" message *without* transfers present.
Hence, if `postMessage` transfers are not supported by the browser, we'll now fallback to "fake" Workers immediately instead. The comment about Opera is also removed, since it was originally added back in PR 983 and mentions Opera `11.60` [which was released in 2011](https://en.wikipedia.org/wiki/History_of_the_Opera_web_browser#Version_11).
2022-03-16 13:25:41 +01:00
Jonas Jenwald
6a78f20b17 Simplify the PDFDocument constructor
Originally the code in the `src/`-folder was shared between the main/worker-threads, and back then it probably made sense that the `PDFDocument` constructor accepted different arguments.
However, for many years we've not been passing anything *except* Streams to `PDFDocument` and we should thus be able to slightly simplify that code. Note that for e.g. unit-tests of this code, using either a `NullStream` or a `StringStream` works just fine.
2022-03-08 17:13:47 +01:00
Tim van der Meij
5242c38af5
Merge pull request #14628 from Snuffleupagus/issue-14626
When `stopAtErrors` is set, throw rather than warn when exceeding `maxImageSize` (issue 14626)
2022-03-05 13:09:36 +01:00
Tim van der Meij
5d12ac576b
Merge pull request #14631 from Snuffleupagus/typedef-fixes
Fix a couple of small typos in JSDoc `typedef` comments
2022-03-05 13:06:53 +01:00
Jonas Jenwald
939e6f0c4c Fix a couple of small typos in JSDoc typedef comments
While this doesn't affect the official API documentation, these cases should nonetheless be fixed.
2022-03-04 12:11:52 +01:00
Jonas Jenwald
1a7921dbf0 Compute the loca table endOffset, of the "first" glyph, correctly (issue 14618)
When there are *multiple* empty glyphs at the start of the data, ensure that the "first" glyph gets a correct `endOffset` to avoid skipping it during parsing in the `sanitizeGlyph` function.
2022-03-03 14:22:45 +01:00
Jonas Jenwald
d0d5c596fb When stopAtErrors is set, throw rather than warn when exceeding maxImageSize (issue 14626)
The situation described in issue 14626 seems like a fairly special case, and it thus seem reasonable that we simply follow the same pattern as elsewhere in the `PartialEvaluator` when the `stopAtErrors` API-option is being used.
2022-03-03 13:11:29 +01:00
calixteman
046ff07ee3
Merge pull request #14610 from Snuffleupagus/jpx-resetContextProbabilities
[JPEG 2000] Add support for resetContextProbabilities (bug 1731483)
2022-02-26 18:26:39 +01:00
Jonas Jenwald
99cd24ce3e Remove the isString helper function
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isString`-calls.
2022-02-26 16:33:41 +01:00
Jonas Jenwald
6bd4e0f5af Re-factor the PDFDocument.documentInfo method
This removes the `DocumentInfoValidators` structure, and thus (slightly) simplifies the code overall. With these changes we only have to iterate through, and validate, the actually available Dictionary entries.
2022-02-26 16:33:21 +01:00
Tim van der Meij
cf7ce0aa7e
Merge pull request #14600 from Snuffleupagus/getPageIndex-more-validation
[api-minor] Add validation for the  `PDFDocumentProxy.getPageIndex` method
2022-02-26 15:30:00 +01:00
Jeff Muizelaar
9b9609a6d8 [JPEG 2000] Add support for resetContextProbabilities (bug 1731483) 2022-02-26 13:05:23 +01:00
Jonas Jenwald
172d007598 [api-minor] Add validation for the PDFDocumentProxy.getPageIndex method
Currently we'll happily attempt to send any argument passed to this method over to the worker-thread, without doing any sort of validation.
That could obviously be quite bad, since there's first of all no protection against sending unclonable data. Secondly, it's also possible to pass data that will cause the `Ref.get` call in the worker-thread to fail immediately.

In order to address all of these issues, we'll now properly validate the argument passed to `PDFDocumentProxy.getPageIndex` and when necessary reject already on the main-thread instead.
2022-02-24 12:01:51 +01:00
Jonas Jenwald
ec87995050 Ensure that Cmd/Name is only initialized with string arguments
Trying to use a non-string argument in either a `Cmd` or a `Name` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, similar to the existing one used e.g. in the `Dict.set` method.
2022-02-23 22:39:12 +01:00
Tim van der Meij
2bb96a708c
Merge pull request #14598 from Snuffleupagus/rm-isBool
Re-factor the `Catalog.viewerPreferences` method and remove the `isBool` helper function
2022-02-23 20:36:56 +01:00
Jonas Jenwald
3704283f5b Remove the isBool helper function
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls.
2022-02-23 13:31:03 +01:00
Jonas Jenwald
82f1ee1755 Re-factor the Catalog.viewerPreferences method
This removes the `ViewerPreferencesValidators` structure, and thus (slightly) simplifies the code overall. With these changes we only have to iterate through, and validate, the actually available Dictionary entries.
2022-02-23 13:25:56 +01:00
Jonas Jenwald
a2f9031e9a Ensure that Dict.set only accepts string keys
Trying to use a non-string `key` in a `Dict` is not intended, and would basically be an implementation error. Hence we can add a non-PRODUCTION check to enforce this, complementing the existing `value` check added in PR 11672.
2022-02-22 16:35:20 +01:00
Jonas Jenwald
05edd91bdb Remove the isNum helper function
The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isNum`-calls.

These changes were *mostly* done using regular expression search-and-replace, with two exceptions:
 - In `Font._charToGlyph` we no longer unconditionally update the `width`, since that seems completely unnecessary.
 - In `PDFDocument.documentInfo`, when parsing custom entries, we now do the `typeof`-check once.
2022-02-22 11:55:34 +01:00
Jonas Jenwald
b282814e38 Prefer instanceof Name rather than calling isName() with one argument
Unless you actually need to check that something is both a `Name` and also of the *correct* type, using `instanceof Name` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.

This patch uses ESLint to enforce this, since we obviously still want to keep the `isName` helper function for where it makes sense.
2022-02-21 12:45:00 +01:00
Jonas Jenwald
4df82ad31e Prefer instanceof Dict rather than calling isDict() with one argument
Unless you actually need to check that something is both a `Dict` and also of the *correct* type, using `instanceof Dict` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.

This patch uses ESLint to enforce this, since we obviously still want to keep the `isDict` helper function for where it makes sense.
2022-02-21 12:44:56 +01:00
Jonas Jenwald
67b658e8d5 Prefer instanceof Cmd rather than calling isCmd() with *one* argument
Unless you actually need to check that something is both a `Cmd` and also of the *correct* type, using `instanceof Cmd` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check.

This patch uses ESLint to enforce this, since we obviously still want to keep the `isCmd` helper function for where it makes sense.
2022-02-21 12:44:51 +01:00
Jonas Jenwald
2cb2f633ac Remove the isRef helper function
This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.
2022-02-19 15:33:42 +01:00
Jonas Jenwald
1a31855977 Remove the isStream helper function
At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.
2022-02-17 13:51:36 +01:00
Jonas Jenwald
fd319e94b3 Add a missing string-check in the _collectJS helper function
Unfortunately I don't have a test-case that breaks without this change, however the `stringToPDFString` helper function will fail if anything other than a string is passed to it.
The changes in this patch thus make this code more-or-less identical to that found in the `Catalog.{_collectJavaScript, parseDestDictionary}` methods.
2022-02-16 13:43:42 +01:00
Calixte Denizet
18e3a98c2b [api-minor] Don't add in the text content the chars which are out-of-page (bug 1755201)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1755201;
- if the glyph position is not within the view then skip it.
2022-02-13 21:07:11 +01:00
Jonas Jenwald
b89595fd20 [api-minor] Remove the, in legacy builds, bundled ReadableStream polyfill
According to the MDN compatibility data, see https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#browser_compatibility, all browsers that we support have native `ReadableStream` implementations (since quite some time too).

Hence only Node.js is now lagging behind w.r.t. `ReadableStream` support, and its experimental implementation doesn't really help us given the life-span of the LTS releases (see https://en.wikipedia.org/wiki/Node.js#Releases).
It seems quite unfortunate to bundle a `ReadableStream` polyfill in the `legacy` builds when it's unnecessary in browsers, given its overall size, but fortunately we can avoid that by simply listing `web-streams-polyfill` as a dependency for the `pdfjs-dist` library.
2022-02-13 10:15:58 +01:00
Jonas Jenwald
64f3dbeb48 Let Lexer.getNumber treat a single minus sign as zero (bug 1753983)
This appears to be consistent with the behaviour in both Adobe Reader and PDFium (in Google Chrome); this is essentially the same approach as used for a single decimal point in PR 9827.
2022-02-07 17:09:47 +01:00
Jonas Jenwald
403baa7bba [api-minor] Remove the normalizeWhitespace option in the PDFPageProxy.{getTextContent, streamTextContent} methods (issue 14519, PR 14428 follow-up)
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
2022-02-03 09:17:22 +01:00
Calixte Denizet
ae842e1c3a [api-minor] Annotations - Adjust the font size in text field in considering the total width (bug 1721335)
- it aims to fix #14502 and bug 1721335;
 - Acrobat and Pdfium do the same;
 - it'll avoid to have truncated data when printed;
 - change the factor to compute font size in using field height: lineHeight = 1.35*fontSize
  - this is the value used by Acrobat.
 - in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.
2022-01-30 15:53:31 +01:00
Calixte Denizet
3a7004ca25 Take into account all rotations before comparing glyph positions
- it aims to fix #14497;
 - previously, only rotations with an angle 0, 90, 180 or 270 were taken into account;
 - so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.
2022-01-26 17:19:00 +01:00
Jonas Jenwald
8836593b9e Add a (global) cache to the getCharUnicodeCategory function
Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing.
Obviously the `Glyph`-instances are being cached *per* font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1].

Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` *unique* unicode-chars being checked.

*Please note:* In practice I suppose that this won't have a *huge* effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review.

---
[1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.
2022-01-25 09:59:34 +01:00
Calixte Denizet
e1d3a3b414 Remove the invisible format marks from the text chunks
- it aims to fix issue #9186.
2022-01-24 13:47:24 +01:00
Tim van der Meij
23b6fde9fc
Merge pull request #14464 from Snuffleupagus/issue-14462
Support Type1 font files with incomplete /CharStrings definitions (issue 14462)
2022-01-19 20:38:46 +01:00
calixteman
b0231cc887
Merge pull request #14456 from calixteman/1749563
Font renderer - get int8 instead of uint8 in composite glyphes (bug 1749563)
2022-01-19 01:20:49 -08:00
Calixte Denizet
74f25d2755 Font renderer - get int8 instead of uint8 in composite glyphes (bug 1749563)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563;
 - use some helper functions to get (u|i)int** values in buffer: it helps to have a clearer code;
 - in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8;
 - add few TODOs.
2022-01-18 22:06:23 +01:00
Jonas Jenwald
a13ae5d97d Support Type1 font files with incomplete /CharStrings definitions (issue 14462)
Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries.
In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.
2022-01-17 18:55:22 +01:00
Jonas Jenwald
ba37d600d7 Make the normalizeWhitespace handling, in the PartialEvaluator, more efficient (PR 14428 follow-up)
After the changes in PR 14428 we can *directly*, and more efficiently, handle whitespace conversion in `PartialEvaluator.getTextContent` when the `normalizeWhitespace` option is being used.
This way we no longer need a separate helper function for this, and can avoid having to (again) iterate through the text and checking each character. Finally, this also removes the need for using a regular expression on e.g. all non-ASCII text.
2022-01-16 08:29:21 +01:00
calixteman
da953f4b64
Merge pull request #14428 from calixteman/typo
Use the correct dimension to know if we have to add an EOL in vertical mode
2022-01-15 12:47:10 -08:00
Calixte Denizet
9dae421a0d Handle all the whitespaces the same way when creating text chunks 2022-01-15 21:44:00 +01:00
Jonas Jenwald
53d4ee7990 Prevent circular references in Type3 fonts
In corrupt PDF documents Type3 fonts may introduce circular dependencies, thus resulting in the affected font(s) never loading and parsing/rendering never completing.
Note that I've not seen any real-world examples of this kind of font corruption, but the attached PDF document was rather found in https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs

*Please note:* That repository contains a number of reduced test-cases that are specifically intended to test interoperability (between PDF viewer) and parsing/rendering for various kinds of strange/corrupt PDF documents.
Some of the test-cases found there may thus not make sense to try and "fix" upfront, in my opinion, unless the problems are also found in real-world PDF documents.
2022-01-13 17:58:37 +01:00
Calixte Denizet
9bb636402a Use the correct dimension to know if we have to add an EOL in vertical mode 2022-01-07 15:19:03 +01:00
Calixte Denizet
6cdae5ac4d Use positive dimensions for text chunks in the text layer (issue #14415). 2022-01-05 10:49:56 +01:00
Jonas Jenwald
b0e774d9c5 Convert Catalog.getAllPageDicts to an async method
The patch in PR 14335 *essentially* re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.

While this method is currently only used as a *fallback* in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the *entire* /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a *much* easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method *directly* instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s).

Furthermore, this patch contains the following other changes:
 - Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead.
 - Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error *immediately* afterwards.

---
[1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.
2021-12-31 22:03:10 +01:00
Jonas Jenwald
1491459dea Improve caching for the Catalog.getPageIndex method (PR 13319 follow-up)
This method is now being used a lot more, compared to when it's added, since it's now used together with scripting as part of the `PDFDocument.fieldObjects` parsing (called during viewer initialization).
For /Page Dictionaries that we've already parsed, the `pageIndex` corresponding to a particular Reference is already known and we're thus able to skip *all* parsing in the `Catalog.getPageIndex` method for those cases.
2021-12-29 20:29:14 +01:00
Jonas Jenwald
a20393e6e4 Update PDFDocument._getLinearizationPage to do the /Type-check correctly (PR 14400 follow-up)
I forgot about this in PR 14400, since we should obviously be consistent *and* given that the existing check is actually wrong; sorry about this!
2021-12-29 13:26:58 +01:00
Tim van der Meij
e42d54e1b5
Merge pull request #14400 from Snuffleupagus/getPageDict-async
[api-minor] Convert `Catalog.getPageDict` to an asynchronous method
2021-12-28 19:40:34 +01:00