Commit Graph

510 Commits

Author SHA1 Message Date
Calixte Denizet
a96f10e55d Create a new chunk when the char is too rised compared to the previouse one 2023-03-28 13:56:46 +02:00
Jonas Jenwald
1fc09f0235 Enable the unicorn/prefer-string-replace-all ESLint plugin rule
Note that the `replaceAll` method still requires that a *global* regular expression is used, however by using this method it's immediately obvious when looking at the code that all occurrences will be replaced; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll#parameters

Please find additional details at https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-string-replace-all.md
2023-03-23 12:57:10 +01:00
Jonas Jenwald
137a2d6e30 Add even more non-standard ligatures (PR 15517 follow-up)
Given that we already create multi-byte ToUnicode entries in other cases, see e.g. the `getNormalizedUnicodes` table, this is hopefully fine.
2023-03-22 10:42:52 +01:00
Jonas Jenwald
d4bcfe8c16 Support multi-byte ToUnicode entries, when using predefined CMaps (issue 16176)
Hopefully this makes sense, since we already "create" multi-byte ToUnicode entries in other cases (see e.g. the `getNormalizedUnicodes` table).
2023-03-21 21:35:57 +01:00
Jonas Jenwald
6839f15a32
Merge pull request #16128 from Snuffleupagus/issue-16127
Support (rare) Type3 fonts with Pattern resources (issue 16127)
2023-03-08 12:21:53 +01:00
Jonas Jenwald
e5427ab11b
Merge pull request #16122 from Snuffleupagus/rm-onUnsupportedFeature
[api-minor] Remove the deprecated `onUnsupportedFeature` functionality (PR 15758 follow-up)
2023-03-08 12:16:27 +01:00
Calixte Denizet
e9474f1c84 [api-minor] Add an option to set the max canvas area 2023-03-08 10:37:06 +01:00
Jonas Jenwald
471aef5fc6 Support (rare) Type3 fonts with Pattern resources (issue 16127)
This simply extends the approach in PR 10727 to also cover Patterns, which shouldn't be a common occurrence in Type3 fonts (since this is the first issue we've seen).
2023-03-08 09:20:52 +01:00
Calixte Denizet
b8dda089e2 Slightly modify the max width of a tracking space 2023-03-07 19:38:49 +01:00
Jonas Jenwald
2f3dcc2327 [api-minor] Remove the deprecated onUnsupportedFeature functionality (PR 15758 follow-up)
This was deprecated in PR 15758, which has now been included in three official PDF.js releases.
While PR 15880 did limit the bundle-size impact of this functionality on e.g. the Firefox PDF Viewer, it still leads to some unnecessary "bloat" that these changes remove.
Furthermore, with this being deprecated there'd also be no effort put into e.g. extending the `UNSUPPORTED_FEATURES` list when handling future error cases.
2023-03-07 10:18:43 +01:00
Calixte Denizet
05b0c9d7e6 Render large images even if they're larger than the canvas limits (bug 1720282)
The idea is to encode large image in BMP format (which is very simple and doesn't
require to compute any checksums) and then use createImageBitmap with a BMP blob
(which doesn't suffer of the Canvas/ImageData limits).
From a performance point of view, it isn't crazy (generating a large blob + decoding
it on the main thread is really not ideal) but at least we've something to display
which is a way better than a blank page (and one can notice that most of the time is
spent in decoding the image from the pdf stream).
2023-03-05 14:07:07 +01:00
Calixte Denizet
fd03cd5493 [api-minor] Generate images in the worker instead of the main thread.
We introduced the use of OffscreenCanvas in #14754 and this patch aims
to use them for all kind of images.
It'll slightly improve performances (and maybe slightly decrease memory use).
Since an image can be rendered in using some transfer maps but because of
OffscreenCanvas we don't have the underlying pixels array the transfer maps
stuff is re-implemented in using the SVG filter feComponentTransfer.
2023-03-01 17:40:12 +01:00
Jonas Jenwald
45c332110e
Check OffscreenCanvas support once on the worker-thread
Currently we repeat the `FeatureTest.isOffscreenCanvasSupported` checks all over the worker-thread code, and with upcoming changes this will become even "worse".

Hence this patch, which changes the *worker-thread* default value for the `isOffscreenCanvasSupported`-parameter to `false` and moves the feature-testing into the `BasePdfManager`-constructor.

*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
2023-02-27 12:27:28 +01:00
Calixte Denizet
4e9f26afa3 Ignore position of combining diacritics when getting text (bug 1640217) 2023-02-09 17:13:57 +01:00
Jonas Jenwald
1a69d537c1 [api-minor] Limit the PDFDocumentLoadingTask.onUnsupportedFeature functionality to GENERIC builds (PR 15758 follow-up)
This was deprecated in PR 15758 but it's unfortunately quite difficult to tell if third-party users are depending on this, e.g. to implement custom error reporting, and if so to what extent.
However, thanks to the pre-processor we can limit *most* of this code to GENERIC builds which still seem like a worthwhile change.

These changes reduce the bundle size of the Firefox PDF Viewer by 3.8 kB in total.
2023-01-01 17:53:12 +01:00
Jonas Jenwald
0c1fb4e740 [api-minor] Remove the PDFDocumentProxy.stats getter (PR 15758 follow-up)
This was deprecated in PR 15758 and given that it's quite unlikely that any third-party users are relying on this functionality, since it was only ever added to support telemetry reporting in the Firefox PDF Viewer, it should hopefully be fine to remove this fairly quickly.

These changes reduce the bundle size of the Firefox PDF Viewer by 4.5 kB in total.
2023-01-01 17:06:47 +01:00
Jonas Jenwald
f7449563ef
Merge pull request #15659 from sxyuan/system-font-name-fix
[api-minor] Propagate the translated font name to TextContentItem for system fonts
2022-11-08 21:56:49 +01:00
Samuel Yuan
36fb5c1e2b Propagate the translated font name to TextContentItems.
This allows font data for system fonts to be looked up in the
PDFObjects.
2022-11-08 11:16:21 -08:00
Jonas Jenwald
c8868a1c7a [api-minor] Initialize the unicode-category *lazily* on the Glyph-instance
The purpose of this patch is twofold:
 - Initialize the unicode-category data *lazily* during text-extraction, since this is completely unused during general parsing/rendering.
 - Stop exposing this data in the API, since it's unused on the main-thread and it seems like it was *accidentally* included.

Obviously these changes are API-observable, but hopefully no user is depending on this. Furthermore, it's trivial for a user to re-create this unicode-category data manually with a regular expression (from the exposed `unicode` property).
2022-11-05 10:12:17 +01:00
Jonas Jenwald
c33b8d7692 Cache the normalized unicode-value on the Glyph-instance
Currently, during text-extraction, we're repeatedly normalizing and (when necessary) reversing the unicode-values every time. This seems a little unnecessary, since the result won't change, hence this patch moves that into the `Glyph`-instance and makes it *lazily* initialized.

Taking the `tracemonkey.pdf` document as an example: When extracting the text-content there's a total of 69236 characters but only 595 unique `Glyph`-instances, which mean a 99.1 percent cache hit-rate. Generally speaking, the longer a PDF document is the more beneficial this should be.

*Please note:* The old code is fast enough that it unfortunately seems difficult to measure a (clear) performance improvement with this patch, so I completely understand if it's deemed an unnecessary change.
2022-11-03 22:36:53 +01:00
Jonas Jenwald
1e7274e9c6 [api-minor] Move the handling of unbalanced markedContent to the worker-thread (PR 15630 follow-up) 2022-10-27 11:14:54 +02:00
Jonas Jenwald
fa47d4b9b1 Slightly re-factor PartialEvaluator._simpleFontToUnicode
Given the sheer number of heuristics added to this method over the years, moving the *valid* unicode found case to the top should improve readability of the code.
2022-10-13 21:42:57 +02:00
Jonas Jenwald
1ea4c4b519 [api-minor] Make isOffscreenCanvasSupported configurable via the API (issue 14952)
This patch first of all makes `isOffscreenCanvasSupported` configurable, defaulting to `true` in browsers and `false` in Node.js environments, with a new `getDocument` parameter. While you normally want to use this, in order to improve performance, it should still be possible for users to control it (similar to e.g. `isEvalSupported`).

The specific problem, as reported in issue 14952, is that the SVG back-end doesn't support the new ImageMask data-format that's introduced in PR 14754. In particular:
 - When the SVG back-end is used in Node.js environments, this patch will "just work" without the user needing to make any code changes.
 - If the SVG back-end is used in browsers, this patch will require that `isOffscreenCanvasSupported: false` is added to the `getDocument`-call.
2022-10-07 00:10:46 +02:00
Jonas Jenwald
60f6272ed9 Use more for...of loops in the code-base
Most, if not all, of this code is old enough to predate the general availability of `for...of` iteration.
2022-10-03 13:08:38 +02:00
Jonas Jenwald
6538409282 Replace some Array.prototype-usage with spread syntax
We have a few, quite old, call-sites that use the `Array.prototype`-format and which can now be replaced with spread syntax instead.
2022-09-23 09:35:30 +02:00
Calixte Denizet
198e9a3db1 Initialize values in the path bounding box before flushing the operator list (bug 1791583)
OperatorList.addOp can trigger a flush if it's required, hence the values passed to it must
be correctly initialized in order to avoid some wrong values in the renderer.
Because of that a clip path was considered as empty, nothing was clipped, hence the wrong
rendering in bug 1791583.
2022-09-20 20:01:54 +02:00
Jonas Jenwald
bb75b36b77 Replace some unnecessary String.prototype.search usage
Most of the `String.prototype.search` call-sites found throughout the code-base is actually not necessary, since we usually only want a *boolean*, and those can be replaced with `RegExp.prototype.test` instead.
2022-09-19 12:51:46 +02:00
Jonas Jenwald
f6db7975c5 Enable the ESLint prefer-spread rule
Note that in a couple of spots the argument could be `undefined` and there we simply disable the rule instead.

Please refer to https://eslint.org/docs/latest/rules/prefer-spread
2022-08-06 10:17:00 +02:00
Jonas Jenwald
60bd9580e2 Ignore invalid /CIDToGIDMap-entries when parsing fonts (issue 15139)
In the referenced PDF document the fonts have /CIDToGIDMap-entries that cannot be loaded. Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /CIDToGIDMap-entries and fallback to simply assume that no such data is available.

Given that this is *clearly* a case of a corrupt PDF document, there's no guarantee that this will "fix" things in the general case since a /CIDToGIDMap may be *required* in order for some composite fonts to render correctly. However, attempting to render *something* is surely better than skipping a font altogether.
2022-07-20 11:58:44 +02:00
Jonas Jenwald
acd61a138e Handle errors in the "Loading by ref" code-path in PartialEvaluator.loadFont
Note how we currently throw a "raw" Error, which is problematical since all of the `PartialEvaluator.loadFont` call-sites expect a Promise to be returned. Furthermore, this also means that we don't benefit from the fallback code-path that now exists below.

*Please note:* Unfortunately I don't have a test-case that fails without this patch, since it's something I happened to notice when reading the code while working on another patch.
2022-07-15 16:33:36 +02:00
Jonas Jenwald
79cfc548fc Improve text-selection for Type3 fonts with bogus /FontBBox-entries (issue 14999)
This extends PR 13461, by also building a fallback bounding box for Type3 fonts that contain a much too small /FontBBox-entry.

*Please note:* While this patch improves things overall, copy-and-pasting still doesn't work perfectly for this document. In particular the lowercase letter "c" cannot be selected/copied, however this can be reproduced in both Adobe Reader and PDFium (in Google Chrome) too, which is caused by a lack of proper /ToUnicode-data in the PDF document.
2022-07-05 14:27:14 +02:00
Calixte Denizet
3789dab307 Always flush the current item with MarkedContent stuff when getting text (#15094) 2022-06-25 17:19:57 +02:00
Jonas Jenwald
9ac4536693 Enable the unicorn/prefer-at ESLint plugin rule (PR 15008 follow-up)
Please find additional information here:
 - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/at
 - https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-at.md
2022-06-09 21:21:19 +02:00
Jonas Jenwald
5a2899c57e Skip bogus d1 operators in Type3-glyphs (issue 14953)
In the `src/display/canvas.js` code the `d1` operator will be used to set the clipping region, and it obviously cannot be empty since that prevents the Type3-glyph from rendering.

Also, the patch removes an outdated comment; refer to PR 12718.
2022-05-24 12:20:31 +02:00
Jonas Jenwald
5a774b7ed3 Adjust the heuristics for handling of incomplete path operators (issue 14917)
This limits the heuristics for handling of incomplete path operators, see PR 9838, to only apply to *sequences* of such operators. In practice a couple of invalid path operators are (hopefully) unlikely to completely break rendering, whereas a sequence of them will easily lead to fairly chaotic rendering artifacts.
2022-05-15 11:24:39 +02:00
Jonas Jenwald
8267fd8a52 Replace the AnnotationStorage.lastModified-getter with a proper hash-method
The current `lastModified`-getter, which only contains a time-stamp, is a fairly crude way of detecting if the stored data has actually been changed. In particular, when the `getRawValue`-method is used, the `lastModified`-getter doesn't cope with data being modified from the "outside".

To fix these issues[1], and to prevent any future bugs in this code, this patch introduces a new `AnnotationStorage.hash`-getter which computes a hash of the currently stored data. To simplify things this re-uses the existing `MurmurHash3_64`-implementation, which required moving that file into the `src/shared/`-folder, since its performance should be good enough here.

---
[1] Given how the `AnnotationStorage.lastModified`-getter was used, this would have been limited to *printing* of forms.
2022-05-04 15:21:30 +02:00
Jonas Jenwald
fbf6dee8ee [api-minor] Remove the forceClamped-functionality in the Streams (issue 14849)
As it turns out, most of the code-paths in the `PDFImage`-class won't actually pass the TypedArray (containing the image-data) to the `ColorSpace`-code. Hence we *generally* don't need to force the image-data to be a `Uint8ClampedArray`, and can just as well directly use a `Uint8Array` instead.

In the following cases we're returning the data without any `ColorSpace`-parsing, and the exact TypedArray used shouldn't matter:
 - b72a448327/src/core/image.js (L714)
 - b72a448327/src/core/image.js (L751)

In the following cases the image-data is only used *internally*, and again the exact TypedArray used shouldn't matter:
 - b72a448327/src/core/image.js (L762) with the actual image-data being defined (as `Uint8ClampedArray`) further below
 - b72a448327/src/core/image.js (L837)

*Please note:* This is tagged `api-minor` because it's API-observable, given that *some* image/mask-data will now be returned as `Uint8Array` rather than using `Uint8ClampedArray` unconditionally. However, that seems like a small price to pay to (slightly) reduce memory usage during image-conversion.
2022-04-29 14:46:30 +02:00
Jonas Jenwald
e18edf38db Add a helper function for incrementing the count of cached ImageMasks
While working on PR 14825, I couldn't help noticing that the code to increment the `count` for cached ImageMasks was repeated multiple times. Hence it makes sense, as far as I'm concerned, to move this into a helper function instead.
2022-04-24 11:10:02 +02:00
Tim van der Meij
752dee5caa
Merge pull request #14825 from Snuffleupagus/issue-14824
Ensure that worker-thread image caching doesn't break optional content (issue 14824)
2022-04-23 13:19:56 +02:00
Jonas Jenwald
6c229dffb1 Ensure that worker-thread image caching doesn't break optional content (issue 14824)
Currently we only insert optionalContent-data into the operatorList the first time that an image is parsed, which will (in hindsight) obviously cause problems for cached images.
Hence we also need to insert the optionalContent-data in the various worker-thread image caches, such that it can be accessed in the fast-paths that are used to skip re-parsing of images.

In order to reduce the amount of repeated code, this patch also adds a new `OperatorList`-method that takes care of inserting the necessary data in the operatorList.
2022-04-22 14:49:16 +02:00
Jonas Jenwald
e723da7261 Ignore invalid /Encoding-entries when parsing fonts (issue 14821)
In the referenced PDF document the fonts have /Encoding-entries that are Streams (containing completely bogus data), which are thus obviously not valid here.
Hence, only when `ignoreErrors` is set, we'll now ignore these corrupt /Encoding-entries and fallback to the existing code to try and infer a usable encoding.

Given that this is *clearly* a case of corrupt PDF documents, there's no guarantee that this will "fix" all such cases, however it's the best that we do here and shouldn't really be worse than ignoring an entire font.
2022-04-22 11:49:03 +02:00
Calixte Denizet
4b7691baf6 Simplify min/max computations in constructPath (bug 1135277)
- most of the time the current transform is a scaling one (modulo translation),
  hence it's possible to avoid to apply the transform on each bbox and then apply
  it a posteriori;
- compute the bbox when it's possible in the worker.
2022-04-17 17:25:54 +02:00
Calixte Denizet
f62d961dfe Improve performances with image masks (bug 857031)
- it's the second part of the fix for https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- some image masks can be used several times but at different positions;
- an image need to be pre-process before to be rendered:
  * rescale it;
  * use the fill color/pattern.
- the two operations above are time consuming so we can cache the generated canvas;
- the cache key is based on the current transform matrix (without the translation part)
  and the current fill color when it isn't a pattern.
- the rendering of the pdf in the above bug is really faster than without this patch.
2022-04-16 20:48:39 +02:00
Tim van der Meij
cdb3481d6c
Merge pull request #14764 from apeltop/correct-typos
Correct typos
2022-04-10 14:55:08 +02:00
Calixte Denizet
040fcae5ab Improve performance with image masks (bug 857031)
- it aims to partially fix performance issue reported: https://bugzilla.mozilla.org/show_bug.cgi?id=857031;
- the idea is too avoid to use byte arrays but use ImageBitmap which are a way faster to draw:
  * an ImageBitmap is Transferable which means that it can be built in the worker instead of in the main thread:
    - this is achieved in using an OffscreenCanvas when it's available, there is a bug to enable them
      for pdf.js: https://bugzilla.mozilla.org/show_bug.cgi?id=1763330;
    - or in using createImageBitmap: in Firefox a task is sent to the main thread to build the bitmap so
      it's slightly slower than using an OffscreenCanvas.
  * it's transfered from the worker to the main thread by "reference";
  * the byte buffers used to create the image data have a very short lifetime and ergo the memory used is globally
    less than before.
- Use the localImageCache for the mask;
- Fix the pdf issue4436r.pdf: it was expected to have a binary stream for the image;
- Move the singlePixel trick from operator_list to image: this way we can use this trick even if it isn't in a set
  as defined in operator_list.
2022-04-09 18:26:26 +02:00
apeltop
a97dd26389 Correct typos 2022-04-09 09:43:18 +09:00
Calixte Denizet
18e79e3c0b [text selection] Add the whitespaces present in the pdf in the text chunk
- it aims to fix issue #14627;
- the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
  But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
  they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
  we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
  So basically, this patch removes the constraint to have the chars in the same Tj
  (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
  too small (hence `NOT_A_SPACE_FACTOR`).
2022-03-27 14:34:56 +02:00
Jonas Jenwald
73d2ddac0d Update npm packages
Note that the Prettier update made it possible to move a couple of comments after `default:`-cases back to their original/intended positions, please see https://prettier.io/blog/2022/03/16/2.6.0.html
2022-03-20 10:59:13 +01:00
Jonas Jenwald
c0736647f9 Add general iteration support in the RefSet and RefSetCache classes
This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.
2022-03-18 14:27:34 +01:00
Jonas Jenwald
d0d5c596fb When stopAtErrors is set, throw rather than warn when exceeding maxImageSize (issue 14626)
The situation described in issue 14626 seems like a fairly special case, and it thus seem reasonable that we simply follow the same pattern as elsewhere in the `PartialEvaluator` when the `stopAtErrors` API-option is being used.
2022-03-03 13:11:29 +01:00