pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	e4758beaaa	Move `IsLittleEndianCached` and `IsEvalSupportedCached` to `src/shared/util.js` Rather than duplicating the lookup and caching in multiple files, it seems easier to simply move all of this functionality into `src/shared/util.js` instead. This will also help avoid a bunch of ESLint errors once the `no-shadow` rule is eventually enabled.	2020-03-12 11:36:26 +01:00
Jonas Jenwald	3adbba55b2	Limit the number of warning messages printed by any one `Lexer.getHexString` invocation This patch fixes something that's annoyed me every now and then over the years, when debugging/fixing corrupt PDF documents. For corrupt PDF documents where `Lexer.getHexString` encounters invalid characters, there's very rarely just a handful of them. In practice it's not uncommon for there to be many hundreds, or even many thousands, invalid hex characters found. Not only is the resulting console warning spam utterly useless in these cases, there's often enough of it that performance may even suffer; hence this patch which limits the amount of messages that any one `Lexer.getHexString` invocation may print.	2020-03-09 13:34:53 +01:00
Jonas Jenwald	65e6ea2cb2	Prevent lookup errors in `PartialEvaluator.hasBlendModes` from breaking all parsing/rendering of a page (issue 11678) The PDF document in question is corrupt, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.	2020-03-09 12:00:12 +01:00
Tim van der Meij	1a97c142b3	Merge pull request #11523 from Snuffleupagus/issue-10880 Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)	2020-03-06 23:03:09 +01:00
Tim van der Meij	1bb25f5cb8	Merge pull request #11673 from Snuffleupagus/FontLoader-bind-more-await Update the `FontLoader.bind` method to avoid explicitly returning `undefined`	2020-03-06 22:51:39 +01:00
Jonas Jenwald	7d4be08dad	Update the `FontLoader.bind` method to avoid explicitly returning `undefined` The only reason for the `return undefined;` lines was to appease the ESLint `consistent-return` rule, but that's not actually necessary if you make use of the fact that the method is `async` and that we can thus await the Promise rather than returning it.	2020-03-06 17:45:24 +01:00
Jonas Jenwald	160cfc4084	Slightly simplify the lookup of data in `Dict.{get, getAsync, has}` Note that `Dict.set` will only be called with values returned through `Parser.getObj`, and thus indirectly via `Lexer.getObj`. Since neither of those methods will ever return `undefined`, we can simply assert that that's the case when inserting data into the `Dict` and thus get rid of `in` checks when doing the data lookups. In this case, since `Dict.set` is fairly hot, the patch utilizes an inline check and when necessary a direct call to `unreachable` to not affect performance of `gulp server/test` too much (rather than always just calling `assert`). For very large and complex PDF files this will help performance slightly, since `Dict.{get, getAsync, has}` is called a lot during parsing in the worker. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 250, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 250 \| 2838 \| 2820 \| -18 \| -0.65 \| faster Firefox \| Page Request \| 250 \| 1 \| 2 \| 0 \| 11.92 \| slower Firefox \| Rendering \| 250 \| 2837 \| 2818 \| -19 \| -0.65 \| faster ```	2020-03-06 14:12:14 +01:00
Tim van der Meij	c95b9b1e17	Merge pull request #11653 from Snuffleupagus/ensureStateFont Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651)	2020-03-03 23:33:13 +01:00
Jani Pehkonen	71e7686950	Fix Type1 font parsing when .notdef is not at index zero Fixes #11477 The PDF draws many space characters but the embedded fonts don't have a glyph named `space`, so `.notdef` should be drawn instead. PDF.js assumed that Type1 fonts define `.notdef` as the first glyph (index 0). However, now the fonts have the glyph `A` at index 0 and `.notdef` is the last one, so `A` appears where spaces are expected. Because the rest of the font machinery in `core/fonts.js` assumes `.notdef` is at index zero, it's easiest to modify `core/type1_parser.js` so that it "repairs" fonts and makes sure `.notdef` is at index 0.	2020-03-03 21:55:51 +02:00
Jonas Jenwald	65e514e063	Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651) The PDF document in question is corrupt, since it contains multiple instances of incorrect operators. We obviously don't want to slow down parsing of all documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.	2020-03-03 10:05:18 +01:00
Jonas Jenwald	1ad65cf405	Simplify the `BaseFontLoader.isFontLoadingAPISupported` getter It's no longer necessary to special-case this getter in the `GenericFontLoader` case, since the GENERIC build hasn't been using `mozPrintCallback` for years now (furthermore Firefox 63 is really old as well).	2020-03-02 23:14:48 +01:00
Takashi Tamura	d8c9f119b0	Fix the vertical writing mode with horizontal scaling. #11555 . It is not valid to multiply textHScale when the writing mode is vertical. See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762	2020-02-29 07:48:29 +09:00
Tim van der Meij	e1586016c5	Merge pull request #11577 from Snuffleupagus/Pages-tree-refs Prevent circular references in the /Pages tree	2020-02-27 23:36:11 +01:00
Tim van der Meij	965ebe63fd	Merge pull request #11540 from tamuratak/charspacing Fix text spacing with vertical fonts. #7687 and #11526.	2020-02-26 22:26:27 +01:00
Jonas Jenwald	c55d30a715	Use the same non-embedded Wingdings fallback for fonts named "Wingdings-Regular" too (PR 5463 follow-up, issue 11451) This patch extends the existing heuristics, which are really the best that we can do in general for these kinds of non-embedded and non-standard fonts. Furthermore, this patch also tries to improve the copy-and-paste behaviour for non-embedded Wingdings fonts by also using the `ZapfDingbatsEncoding` in this case. Note: I'm not sure that adding additional tests for Wingdings fonts matters that much, given how limited our "support" for them really is.	2020-02-24 17:40:06 +01:00
Jonas Jenwald	bf09d79eea	Use the ESLint `no-restricted-syntax` rule to prevent direct usage of `new Cmd()`/`new Name()`/`new Ref()` Given that all of these primitives implement caching, to avoid unnecessarily duplicating those objects a lot during parsing, it would thus be good to actually enforce usage of `Cmd.get()`/`Name.get()`/`Ref.get()` in the code-base. Luckily it turns out that there's an ESLint rule, which is fairly easy to use, that can be used to disallow arbitrary JavaScript syntax. Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-restricted-syntax	2020-02-22 21:15:00 +01:00
Jonas Jenwald	c3c3b8cd81	Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else. To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data and the "actual" `scanLines` value is at least one order of magnitude smaller than expected.	2020-02-22 14:16:07 +01:00
Jonas Jenwald	5494f7d5bc	Add basic validation of the `scanLines` parameter in JPEG images, before delegating decoding to the browser In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter. Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render. In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing. The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks. --- Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea. There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker. Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing. In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.	2020-02-22 14:16:07 +01:00
Jonas Jenwald	6b44ae2170	Remove the unused `thisArg` from `RefSetCache.forEach` Given that this is completely unused, and that a "normal" function call may be a tiny bit more efficient, there's no good reason as far as I can tell to keep it.	2020-02-21 14:23:05 +01:00
Jonas Jenwald	3c7b7be100	Prevent circular references in the /Pages tree	2020-02-19 01:49:39 +01:00
Tim van der Meij	4092aa9fbd	Merge pull request #11604 from Snuffleupagus/eslint-prefer-starts-ends-with Enable the `unicorn/prefer-starts-ends-with` ESLint plugin rule	2020-02-16 13:17:27 +01:00
Jonas Jenwald	bc31a4be5d	Enable the `unicorn/prefer-starts-ends-with` ESLint plugin rule This complements the existing `mozilla/use-includes-instead-of-indexOf` plugin rule, by also disallowing unnecessary regular expressions when comparing strings. Please see https://github.com/sindresorhus/eslint-plugin-unicorn/blob/master/docs/rules/prefer-starts-ends-with.md for additional information.	2020-02-16 12:41:53 +01:00
Jonas Jenwald	6ebd851d27	Enable the `no-buffer-constructor` ESLint rule According to https://nodejs.org/api/buffer.html#buffer_class_buffer: `new Buffer(...)` is deprecated in up-to-date versions of Node.js, hence you want to prevent it from being accidentally used. Please see https://eslint.org/docs/rules/no-buffer-constructor for additional information.	2020-02-16 12:21:40 +01:00
Tim van der Meij	f6ffc2bf37	Merge pull request #11598 from Snuffleupagus/polyfill-Map-Set-iteration Add polyfills to support iteration of `Map` and `Set`	2020-02-14 23:24:20 +01:00
Jonas Jenwald	c97c778f8f	[api-minor] Produce non-translated/non-polyfilled builds by default	2020-02-14 18:12:07 +01:00
Jonas Jenwald	4a76ab352c	Add polyfills to support iteration of `Map` and `Set` Without this, things such as e.g. `Metadata.getAll` is broken in IE11 (see PR 11596). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map#Browser_compatibility https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set#Browser_compatibility	2020-02-14 15:53:02 +01:00
Tim van der Meij	cd3f2d49e6	Merge pull request #11596 from Snuffleupagus/metadata-map Re-factor how `Metadata` class instances store its data internally	2020-02-13 23:01:51 +01:00
Jonas Jenwald	5cdfff4a47	Re-factor how `Metadata` class instances store its data internally Please note that these changes do not affect the public interface of the `Metadata` class, but only touches internal structures.[1] These changes were prompted by looking at the `getAll` method, which simply returns the "private" metadata object to the consumer. This seems wrong conceptually, since it allows way too easy/accidental changes to the internal parsed metadata. As part of fixing this, the internal metadata was changed to use a `Map` rather than a plain Object. --- [1] Basically, we shouldn't need to worry about someone depending on internal implementation details.	2020-02-13 18:23:15 +01:00
Jonas Jenwald	3f1568b51a	A couple of small improvements of the `Metadata._repair` method - Remove the "capturing group" in the regular expression that removes leading "junk" from the raw metadata, since it's not necessary here (it's simply a case of too much copy-pasting in a prior patch). According to [MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#Groups_and_ranges) you want to, for performance reasons, avoid "capturing groups" unless actually needed. - Add inline comments to document a bunch of magic values in the code.	2020-02-13 17:20:52 +01:00
Jonas Jenwald	a5db4e985a	Remove `LoopbackPort.postMessage` special-case for polyfilled `TypedArray`s Given that all `TypedArray` polyfills were removed in PDF.js version `2.0`, since native support is now required, this branch has been dead code for awhile.	2020-02-13 12:50:41 +01:00
Jonas Jenwald	7b0836ca75	[TextLayer] Immediately set the padding, rather than checking if it's empty, in `expandTextDivs` In practice it's extremely rare[1] for the padding to be zero in all components, hence it seems better to just set it directly rather than creating a temporary variable and checking for the "no padding"-case. --- [1] In the `tracemonkey.pdf` file that only happens with `0.08%` of all text elements.	2020-02-11 15:52:36 +01:00
Takashi Tamura	512dbe3060	Fix text spacing with vertical fonts. #7687 and #11526 . When the writing mode is vertical, we have to reverse the sign of spacing since we are subtracting it from current.y. We have to add it to current.y. See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762	2020-02-11 08:49:23 +09:00
Jonas Jenwald	ae5a34c520	[api-minor] Ensure that the `Array.prototype` doesn't contain any enumerable properties Over the years there's been a fair number of issues/PRs opened, where people have wanted to add `hasOwnProperty` checks in (hot) loops in the font parsing code. This has always been rejected, since we don't want to risk reducing performance in the Firefox PDF viewer simply because some users of the general PDF.js library are incorrectly extending the `Array.prototype` with enumerable properties. With this patch the general PDF.js library will now fail immediately with a hopefully useful Error message, rather than having (some) fonts fail to render, when the `Array.prototype` is incorrectly extended. Note that I did consider making this a warning, but ultimately decided against it since it's first of all possible to disable those (with the `verbosity` parameter). Secondly, even when printed, warnings can be easy to overlook and finally a warning may also seem OK to ignore (as opposed to an actual Error).	2020-02-10 14:17:27 +01:00
Tim van der Meij	dced0a3821	Merge pull request #11579 from Snuffleupagus/issue-11578 Ignore spaces when normalizing the font name in `Font.fallbackToSystemFont` (issue 11578)	2020-02-09 17:33:09 +01:00
Tim van der Meij	61056a9238	Merge pull request #11551 from Snuffleupagus/issue-11549 Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)	2020-02-09 17:32:35 +01:00
Tim van der Meij	2fb4076e05	Merge pull request #11568 from Snuffleupagus/PDF-header-validation Ensure that the PDF header contains an actual number (PR 11463 follow-up)	2020-02-09 17:16:25 +01:00
Tim van der Meij	102af0f915	Merge pull request #11547 from Snuffleupagus/convertCmykToRgb-scale Use fewer multiplications in `JpegImage._convertCmykToRgb`	2020-02-09 17:06:23 +01:00
Tim van der Meij	f178805412	Merge pull request #11557 from Snuffleupagus/_getLinearizedBlockData-xScaleBlockOffset Avoid re-calculating the `xScaleBlockOffset` when not necessary in `JpegImage._getLinearizedBlockData`	2020-02-09 16:54:28 +01:00
Tim van der Meij	7948faf675	Merge pull request #11573 from Snuffleupagus/api-cleanup-returns [api-minor] Change `PDFDocumentProxy.cleanup`/`PDFPageProxy.cleanup` to return data	2020-02-08 20:42:28 +01:00
Tim van der Meij	a73a38029c	Merge pull request #11569 from Snuffleupagus/rm-most-setAttribute Replace most remaining `Element.setAttribute("style", ...)` usage with `Element.style = ...` instead	2020-02-08 20:13:56 +01:00
Jonas Jenwald	7937165537	Ignore spaces when normalizing the font name in `Font.fallbackToSystemFont` (issue 11578)	2020-02-08 19:59:04 +01:00
Jonas Jenwald	7117ee03d6	[api-minor] Change `PDFDocumentProxy.cleanup`/`PDFPageProxy.cleanup` to return data This patch makes the following changes, to improve these API methods: - Let `PDFPageProxy.cleanup` return a boolean indicating if clean-up actually happened, since ongoing rendering will block clean-up. Besides being used in other parts of this patch, it seems that an API user may also be interested in the return value given that clean-up isn't guaranteed to happen. - Let `PDFDocumentProxy.cleanup` return the promise indicating when clean-up is finished. - Improve the JSDoc comment for `PDFDocumentProxy.cleanup` to mention that clean-up is triggered on both threads (without going into unnecessary specifics regarding what exactly said data actually is). Add a note in the JSDoc comment about not calling this method when rendering is ongoing. - Change `WorkerTransport.startCleanup` to throw an `Error` if it's called when rendering is ongoing, to prevent rendering from breaking. Please note that this won't stop worker-thread clean-up from happening (since there's no general "something is rendering"-flag), however I'm not sure if that's really a problem; but please don't quote me on that :-) All of the caches that's being cleared in `Catalog.cleanup`, on the worker-thread, should be re-filled automatically even if cleared during parsing/rendering, and the only thing that probably happens is that e.g. font data would have to be re-parsed. On the main-thread, on the other hand, clearing the caches is more-or-less guaranteed to cause rendering errors, since the rendering code in `src/display/canvas.js` isn't able to re-request any image/font data that's suddenly being pulled out from under it. - Last, but not least, add a couple of basic unit-tests for the clean-up functionality.	2020-02-07 17:00:29 +01:00
Jonas Jenwald	88c35d872f	Ensure that the PDF header contains an actual number (PR 11463 follow-up) While it would be nice to change the `PDFFormatVersion` property, as returned through `PDFDocumentProxy.getMetadata`, to a number (rather than a string) that would unfortunately be a breaking API change. However, it does seem like a good idea to at least validate the PDF header version on the worker-thread, rather than potentially returning an arbitrary string.	2020-02-07 12:25:07 +01:00
Tim van der Meij	e12e83702d	Merge pull request #11559 from bhasto/curveto2-fix Fix how curveTo2 (v operator) is translated to SVG	2020-02-06 23:10:41 +01:00
Brendan Dahl	09a6e17d22	Merge pull request #11528 from janpe2/type1-nonemb-notdef Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths	2020-02-06 13:30:07 -08:00
Jonas Jenwald	5cbd44b628	Replace most remaining `Element.setAttribute("style", ...)` usage with `Element.style = ...` instead This should hopefully be useful in environments where restrictive CSPs are in effect. In most cases the replacement is entirely straighforward, and there's only a couple of special cases: - For the `src/display/font_loader.js` and `web/pdf_outline_viewer.js `cases, since the elements aren't appended to the document yet, it shouldn't matter if the style properties are set one-by-one rather than all at once. - For the `web/debugger.js` case, there's really no need to set the `padding` inline at all and the definition was simply moved to `web/viewer.css` instead. Please note: There's still a single case left, in `web/toolbar.js` for setting the width of the zoom dropdown, which is left intact for now. The reasons are that this particular case shouldn't matter for users of the general PDF.js library, and that it'd make a lot more sense to just try and re-factor that very old code anyway (thus fixing the `setAttribute` usage in the process).	2020-02-05 22:26:47 +01:00
Branislav Hašto	393aed9978	Fix how curveTo2 (v operator) is translated to SVG Based on the PDF spec, with `v` operator, current point should be used as the first control point of the curve. Do not overwrite current point before an SVG curve is built, so it can b actually used as first control point.	2020-02-02 17:03:29 +01:00
Jonas Jenwald	a4440a1c6b	Avoid re-calculating the `xScaleBlockOffset` when not necessary in `JpegImage._getLinearizedBlockData` As can be seen in the code, the `xScaleBlockOffset` typed array doesn't depend on the actual image data but only on the width and x-scale. The width is obviously consistent for an image, and it turns out that in practice the `componentScaleX` is quite often identical between two (or more) adjacent image components. All-in-all it's thus not necessary to unconditionally re-compute the `xScaleBlockOffset` when getting the JPEG image data. While avoiding, in many cases, one or more loops can never be a bad thing these changes are unfortunately completely dominated by the rest of the JpegImage code and consequently doesn't really show up in benchmark results. Hence I'd understand if this patch is ultimately deemed not necessary.	2020-02-01 11:58:50 +01:00
Jonas Jenwald	4c54395ff6	Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549) This will allow font loading/parsing to continue, rather than immediately failing, when broken/corrupt CMap data is encountered.	2020-01-30 13:19:05 +01:00
Jonas Jenwald	ce4f41d06a	Use fewer multiplications in `JpegImage._convertCmykToRgb` Note: This is inspired by PR 5473, which made similar changes for another kind of JPEG data. Since the implementation in `src/core/jpg.js` only supports 8-bit data, as opposed to similar code in `src/core/colorspace.js`, the computations can be further simplified since the `scale` is always constant. By updating the coefficients, effectively inlining the `scale`, we'll thus avoid four multiplications for each loop iteration. Unfortunately I wasn't able, based on a quick look through the test-files, to find a sufficiently large CMYK JPEG image in order for these changes to really show up in benchmark results. However, when testing the `cmykjpeg.pdf` manually there's a total of `120 000` fewer multiplication with this patch.	2020-01-29 18:34:58 +01:00

1 2 3 4 5 ...

3845 Commits