Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	18e0b10d3c	[api-minor] Remove the `disableCreateObjectURL` option from the `getDocument` parameters, since it's now unused in the API With the changes in previous patches, the `disableCreateObjectURL` option/functionality is no longer used for anything in the API and/or in the Worker code. Note however that there's some functionality, mainly related to file loading/downloading, in the GENERIC version of the default viewer which still depends on this option. Hence the `disableCreateObjectURL` option (and related compatibility code) is moved into the viewer, see e.g. `web/app_options.js`, such that it's still available in the default viewer.	2020-05-22 00:22:48 +02:00
Jonas Jenwald	0351852d74	[api-minor] Decode all JPEG images with the built-in PDF.js decoder in `src/core/jpg.js` Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons: - It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library. - The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image. - While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first. In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should always be the case. - The native decoding, for anything except the simplest of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707). Furthermore this also leads to data being parsed on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons. - Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests. - Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used. Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues. At this point in time, there's two kinds of failure with this patch: - Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder. This type of "failure" accounts for the vast majority of the total number of changes in the reference tests. - Changes where the JPEG images now looks ever so slightly blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough). Basically if you disable [this downscaling in canvas.js](`8fb82e939c/src/display/canvas.js (L2356-L2395)`), which is what happens when zooming in, the differences simply vanish! Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that all images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.	2020-05-22 00:22:48 +02:00
Jonas Jenwald	dda6626f40	Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the same images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be twenty copies of the image data). While this obviously benefit both CPU and memory usage in this case, for very large image data this patch may possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will only cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] Please note: The patch will lead to small movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images before parsing started and thus avoid handling any duplicate image resources.	2020-05-21 18:13:45 +02:00
Brendan Dahl	b1be33c96f	Add more categories of unsupported features. Fixes #11815	2020-05-04 11:02:16 -07:00
Jonas Jenwald	911c33f025	Move the `maybeValidDimensions` check, used with JPEG images, to occur earlier (PR 11523 follow-up) Given that the `NativeImageDecoder.{isSupported, isDecodable}` methods require both dictionary lookups and ColorSpace parsing, in hindsight it actually seems more reasonable to the `JpegStream.maybeValidDimensions` checks first.	2020-04-26 12:07:46 +02:00
Jonas Jenwald	1cc3dbb694	Enable the `dot-notation` ESLint rule Please note: These changes were done automatically, using the `gulp lint --fix` command. This rule is already enabled in mozilla-central, see https://searchfox.org/mozilla-central/rev/567b68b8ff4b6d607ba34a6f1926873d21a7b4d7/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#103-104 The main advantage, besides improved consistency, of this rule is that it reduces the size of the code (by 3 bytes for each case). In the PDF.js code-base there's close to 8000 instances being fixed by the `dot-notation` ESLint rule, which end up reducing the size of even the built files significantly; the total size of the `gulp mozcentral` build target changes from `3 247 456` to `3 224 278` bytes, which is a reduction of `23 178` bytes (or ~0.7%) for a completely mechanical change. A large number of these changes affect the (large) lookup tables used on the worker-thread, but given that they are still initialized lazily I don't think that the new formatting this patch introduces should undo any of the improvements from PR 6915. Please find additional details about the ESLint rule at https://eslint.org/docs/rules/dot-notation	2020-04-17 12:24:46 +02:00
Jonas Jenwald	44b4a74f48	A couple of small `String.fromCodePoint` improvements (PR 11698 and 11769 follow-up) - Add a reduced test-case for issue 11768, to prevent future regressions. (Given that PR 11769 is only a work-around, rather than a proper solution, it may not be entirely accurate for the issue to be closed as fixed.) - Add more validation of the charCode, as found by the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode` to prevent future issues.	2020-04-15 13:45:08 +02:00
Jonas Jenwald	426945b480	Update Prettier to version 2.0 Please note that these changes were done automatically, using `gulp lint --fix`. Given that the major version number was increased, there's a fair number of (primarily whitespace) changes; please see https://prettier.io/blog/2020/03/21/2.0.0.html In order to reduce the size of these changes somewhat, this patch maintains the old "arrowParens" style for now (once mozilla-central updates Prettier we can simply choose the same formatting, assuming it will differ here).	2020-04-14 12:28:14 +02:00
Jonas Jenwald	2d46230d23	[api-minor] Change `Font.exportData` to, by default, stop exporting properties which are completely unused on the main-thread and/or in the API (PR 11773 follow-up) For years now, the `Font.exportData` method has (because of its previous implementation) been exporting many properties despite them being completely unused on the main-thread and/or in the API. This is unfortunate, since among those properties there's a number of potentially very large data-structures, containing e.g. Arrays and Objects, which thus have to be first structured cloned and then stored on the main-thread. With the changes in this patch, we'll thus by default save memory for every `Font` instance created (there can be a lot in longer documents). The memory savings obviously depends a lot on the actual font data, but some approximate figures are: For non-embedded fonts it can save a couple of kilobytes, for simple embedded fonts a handful of kilobytes, and for composite fonts the size of this auxiliary can even be larger than the actual font program itself. All-in-all, there's no good reason to keep exporting these properties by default when they're unused. However, since we cannot be sure that every property is unused in custom implementations of the PDF.js library, this patch adds a new `getDocument` option (named `fontExtraProperties`) that still allows access to the following properties: - "cMap": An internal data structure, only used with composite fonts and never really intended to be exposed on the main-thread and/or in the API. Note also that the `CMap`/`IdentityCMap` classes are a lot more complex than simple Objects, but only their "internal" properties survive the structured cloning used to send data to the main-thread. Given that CMaps can often be very large, not exporting them can also save a fair bit of memory. - "defaultEncoding": An internal property used with simple fonts, and used when building the glyph mapping on the worker-thread. Considering how complex that topic is, and given that not all font types are handled identically, exposing this on the main-thread and/or in the API most likely isn't useful. - "differences": An internal property used with simple fonts, and used when building the glyph mapping on the worker-thread. Considering how complex that topic is, and given that not all font types are handled identically, exposing this on the main-thread and/or in the API most likely isn't useful. - "isSymbolicFont": An internal property, used during font parsing and building of the glyph mapping on the worker-thread. - "seacMap": An internal map, only potentially used with some Type1/CFF fonts and never intended to be exposed in the API. The existing `Font.{charToGlyph, charToGlyphs}` functionality already takes this data into account when handling text. - "toFontChar": The glyph map, necessary for mapping characters to glyphs in the font, which is built upon the various encoding information contained in the font dictionary and/or font program. This is not directly used on the main-thread and/or in the API. - "toUnicode": The unicode map, necessary for text-extraction to work correctly, which is built upon the ToUnicode/CMap information contained in the font dictionary, but not directly used on the main-thread and/or in the API. - "vmetrics": An array of width data used with fonts which are composite and vertical, but not directly used on the main-thread and/or in the API. - "widths": An array of width data used with most fonts, but not directly used on the main-thread and/or in the API.	2020-04-06 11:47:09 +02:00
Jonas Jenwald	2619272d73	Change the signature of `TranslatedFont`, and convert it to a proper class In preparation for the next patch, this changes the signature of `TranslatedFont` to take an object rather than individual parameters. This also, in my opinion, makes the call-sites easier to read since it essentially provides a small bit of documentation of the arguments. Finally, since it was necessary to touch `TranslatedFont` anyway it seemed like a good idea to also convert it to a proper `class`.	2020-04-05 20:53:48 +02:00
Jonas Jenwald	59f54b946d	Ensure that all `Font` instances have the `vertical` property set to a boolean Given that the `vertical` property is always accessed on the main-thread, ensuring that the property is explicitly defined seems like the correct thing to do since it also avoids boolean casting elsewhere in the code-base.	2020-04-05 16:27:50 +02:00
Jonas Jenwald	dcb16af968	Whitelist closure related cases to address the remaining `no-shadow` linting errors Given the way that "classes" were previously implemented in PDF.js, using regular functions and closures, there's a fair number of false positives when the `no-shadow` ESLint rule was enabled. Note that while some of these `eslint-disable` statements can be removed if/when the relevant code is converted to proper `class`es, we'll probably never be able to get rid of all of them given our naming/coding conventions (however I don't really see this being a problem).	2020-03-25 11:57:12 +01:00
Jonas Jenwald	216cbca16c	Remove variable shadowing from the JavaScript files in the `src/core/` folder This is part of a series of patches that will try to split PR 11566 into smaller chunks, to make reviewing more feasible. Once all the code has been fixed, we'll be able to eventually enable the ESLint no-shadow rule; see https://eslint.org/docs/rules/no-shadow	2020-03-23 18:28:30 +01:00
Jonas Jenwald	1cd9d5a8fd	Remove the unused `wideChars` property on `Font` instances This property was added in PR 1599 (almost eight years ago), but has been unused ever since PR 3674 (six and a half years ago).	2020-03-20 10:37:32 +01:00
Jonas Jenwald	15e8692eff	Don't accidentally accept invalid glyphNames which appear to follow the Cdd{d}/cdd{d} format in `PartialEvaluator._buildSimpleFontToUnicode` (issue 11697) The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a possible Cdd{d}/cdd{d} glyphName by the existing heuristics. Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point. To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds additional validation of the charCode found by the heuristics.	2020-03-13 23:35:47 +01:00
Jonas Jenwald	65e6ea2cb2	Prevent lookup errors in `PartialEvaluator.hasBlendModes` from breaking all parsing/rendering of a page (issue 11678) The PDF document in question is corrupt, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.	2020-03-09 12:00:12 +01:00
Tim van der Meij	1a97c142b3	Merge pull request #11523 from Snuffleupagus/issue-10880 Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)	2020-03-06 23:03:09 +01:00
Jonas Jenwald	160cfc4084	Slightly simplify the lookup of data in `Dict.{get, getAsync, has}` Note that `Dict.set` will only be called with values returned through `Parser.getObj`, and thus indirectly via `Lexer.getObj`. Since neither of those methods will ever return `undefined`, we can simply assert that that's the case when inserting data into the `Dict` and thus get rid of `in` checks when doing the data lookups. In this case, since `Dict.set` is fairly hot, the patch utilizes an inline check and when necessary a direct call to `unreachable` to not affect performance of `gulp server/test` too much (rather than always just calling `assert`). For very large and complex PDF files this will help performance slightly, since `Dict.{get, getAsync, has}` is called a lot during parsing in the worker. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 250, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 250 \| 2838 \| 2820 \| -18 \| -0.65 \| faster Firefox \| Page Request \| 250 \| 1 \| 2 \| 0 \| 11.92 \| slower Firefox \| Rendering \| 250 \| 2837 \| 2818 \| -19 \| -0.65 \| faster ```	2020-03-06 14:12:14 +01:00
Jonas Jenwald	65e514e063	Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651) The PDF document in question is corrupt, since it contains multiple instances of incorrect operators. We obviously don't want to slow down parsing of all documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.	2020-03-03 10:05:18 +01:00
Jonas Jenwald	c55d30a715	Use the same non-embedded Wingdings fallback for fonts named "Wingdings-Regular" too (PR 5463 follow-up, issue 11451) This patch extends the existing heuristics, which are really the best that we can do in general for these kinds of non-embedded and non-standard fonts. Furthermore, this patch also tries to improve the copy-and-paste behaviour for non-embedded Wingdings fonts by also using the `ZapfDingbatsEncoding` in this case. Note: I'm not sure that adding additional tests for Wingdings fonts matters that much, given how limited our "support" for them really is.	2020-02-24 17:40:06 +01:00
Jonas Jenwald	5494f7d5bc	Add basic validation of the `scanLines` parameter in JPEG images, before delegating decoding to the browser In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter. Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render. In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing. The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks. --- Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea. There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker. Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing. In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.	2020-02-22 14:16:07 +01:00
Tim van der Meij	61056a9238	Merge pull request #11551 from Snuffleupagus/issue-11549 Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)	2020-02-09 17:32:35 +01:00
Brendan Dahl	09a6e17d22	Merge pull request #11528 from janpe2/type1-nonemb-notdef Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths	2020-02-06 13:30:07 -08:00
Jonas Jenwald	4c54395ff6	Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549) This will allow font loading/parsing to continue, rather than immediately failing, when broken/corrupt CMap data is encountered.	2020-01-30 13:19:05 +01:00
Tim van der Meij	cbbda9d883	Merge pull request #11515 from Snuffleupagus/cache-fallback-font Cache the fallback font dictionary on the `PartialEvaluator` (PR 11218 follow-up)	2020-01-25 21:32:28 +01:00
Jonas Jenwald	83bdb525a4	Fix remaining linting errors, from enabling the `prefer-const` ESLint rule globally This covers cases that the `--fix` command couldn't deal with, and in a few cases (notably `src/core/jbig2.js`) the code was changed to use block-scoped variables instead.	2020-01-25 00:20:23 +01:00
Jonas Jenwald	9e262ae7fa	Enable the ESLint `prefer-const` rule globally (PR 11450 follow-up) Please find additional details about the ESLint rule at https://eslint.org/docs/rules/prefer-const With the recent introduction of Prettier this sort of mass enabling of ESLint rules becomes a lot easier, since the code will be automatically reformatted as necessary to account for e.g. changed line lengths. Note that this patch is generated automatically, by using the ESLint `--fix` argument, and will thus require some additional clean-up (which is done separately).	2020-01-25 00:20:22 +01:00
Jani Pehkonen	809b96b40c	Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths Fixes #11403 The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this. In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`. The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.	2020-01-21 21:35:25 +02:00
Jonas Jenwald	9ab7c280aa	Cache the fallback font dictionary on the `PartialEvaluator` (PR 11218 follow-up) This way we'll benefit from the existing font caching, and can thus avoid re-creating a fallback font over and over again during parsing. (Thece changes necessitated the previous patch, since otherwise breakage could occur e.g. with fake workers.)	2020-01-16 15:12:05 +01:00
Jonas Jenwald	36881e3770	Ensure that all `import` and `require` statements, in the entire code-base, have a `.js` file extension In order to eventually get rid of SystemJS and start using native `import`s instead, we'll need to provide "complete" file identifiers since otherwise there'll be MIME type errors when attempting to use `import`.	2020-01-04 13:01:43 +01:00
Jonas Jenwald	a63f7ad486	Fix the linting errors, from the Prettier auto-formatting, that ESLint `--fix` couldn't handle This patch makes the follow changes: - Remove no longer necessary inline `// eslint-disable-...` comments. - Fix `// eslint-disable-...` comments that Prettier moved down, thus causing new linting errors. - Concatenate strings which now fit on just one line. - Fix comments that are now too long. - Finally, and most importantly, adjust comments that Prettier moved down, since the new positions often is confusing or outright wrong.	2019-12-26 12:35:12 +01:00
Jonas Jenwald	de36b2aaba	Enable auto-formatting of the entire code-base using Prettier (issue 11444) Note that Prettier, purposely, has only limited [configuration options](https://prettier.io/docs/en/options.html). The configuration file is based on [the one in `mozilla central`](https://searchfox.org/mozilla-central/source/.prettierrc) with just a few additions (to avoid future breakage if the defaults ever changes). Prettier is being used for a couple of reasons: - To be consistent with `mozilla-central`, where Prettier is already in use across the tree. - To ensure a consistent coding style everywhere, which is automatically enforced during linting (since Prettier is used as an ESLint plugin). This thus ends "all" formatting disussions once and for all, removing the need for review comments on most stylistic matters. Many ESLint options are now redundant, and I've tried my best to remove all the now unnecessary options (but I may have missed some). Note also that since Prettier considers the `printWidth` option as a guide, rather than a hard rule, this patch resorts to a small hack in the ESLint config to ensure that comments won't become too long. Please note: This patch is generated automatically, by appending the `--fix` argument to the ESLint call used in the `gulp lint` task. It will thus require some additional clean-up, which will be done in a separate commit. (On a more personal note, I'll readily admit that some of the changes Prettier makes are extremely ugly. However, in the name of consistency we'll probably have to live with that.)	2019-12-26 12:34:24 +01:00
Jonas Jenwald	835d8c2be5	Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287) This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.	2019-11-01 09:01:24 +01:00
Jonas Jenwald	0496ea61f5	Ensure that `PartialEvaluator.hasBlendModes` handles Blend Modes in Arrays (PR 11281 follow-up) I completely overlooked this in PR 11281, but you obviously need to make similar changes in `PartialEvaluator.hasBlendModes` since it will otherwise ignore valid Blend Modes.	2019-10-28 11:37:05 +01:00
Jonas Jenwald	5c266f0e8c	Support Blend Modes which are specified in an Array of Names (issue 11279) According to the specification, the first supported Blend Mode should be choosen in this case; please see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G10.4848607	2019-10-26 14:24:31 +02:00
Tim van der Meij	ca3a58f93a	Consistently use `@returns` for returned data types in JSDoc comments Sometimes we also used `@return`, but `@returns` is what the JSDoc documentation recommends. Even though `@return` works as an alias, it's good to use the recommended syntax and to be consistent within the project.	2019-10-13 13:58:17 +02:00
Jonas Jenwald	bfcbf2d78d	Cache processed 'ExtGState's in `PartialEvaluator.hasBlendModes` to avoid unnecessary parsing/lookups This simply extends the already existing caching of processed resources to avoid duplicated parsing of 'ExtGState's, which should help with badly generated PDF documents. This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf, with the following manifest file: ``` [ { "id": "issue6961", "file": "../web/pdfs/issue6961.pdf", "md5": "", "rounds": 200, "type": "eq" } ] ``` which gave the following overall results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 400 \| 1063 \| 1051 \| -12 \| -1.17 \| faster Firefox \| Page Request \| 400 \| 552 \| 543 \| -9 \| -1.69 \| faster Firefox \| Rendering \| 400 \| 511 \| 508 \| -3 \| -0.61 \| ``` and the following page-specific results: ``` -- Grouped By page, stat -- page \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ---- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- 0 \| Overall \| 200 \| 1122 \| 1110 \| -12 \| -1.03 \| 0 \| Page Request \| 200 \| 552 \| 544 \| -8 \| -1.48 \| faster 0 \| Rendering \| 200 \| 570 \| 566 \| -4 \| -0.62 \| 1 \| Overall \| 200 \| 1005 \| 992 \| -13 \| -1.33 \| faster 1 \| Page Request \| 200 \| 552 \| 542 \| -11 \| -1.91 \| faster 1 \| Rendering \| 200 \| 452 \| 450 \| -3 \| -0.61 \| ```	2019-10-12 12:35:42 +02:00
Jonas Jenwald	af71f9b40a	Inline all the possible type checks in `PartialEvaluator.hasBlendModes` to avoid unnecessary function calls For badly generated PDF documents, with issue 6961 being one example, there's well over one hundred thousand function calls being made in total for just the two pages.	2019-10-12 11:24:37 +02:00
huzjakd	94171d9d72	Attempt to fallback to a default font, for non-available ones, in `PartialEvaluator.loadFont` This handles the two different ways that fonts can be loaded, either by Name (which is the common case) or by Reference. Furthermore, this also takes the `ignoreErrors` option into account when deciding whether to fallback or Error. Finally, by creating a minimal but valid Font dictionary, there's no special-cases necessary in any of the font parsing code. Co-authored-by: huzjakd <huzjakd@gmail.com> Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>	2019-10-10 16:49:46 +02:00
Jonas Jenwald	f5be2d62a3	Improve the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode`, for glyphNames of the Cdd{d}/cdd{d} format (issue 9655) Please note: I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code. However, it occured to me that we're essentially trying to add a heuristic on top of another heuristic, and that it shouldn't matter how efficient the code is as long as it works. In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.	2019-10-06 10:47:29 +02:00
Jonas Jenwald	f11a4ba750	Transfer, rather than copy, CMap data to the worker-thread It recently occurred to me that the CMap data should be an excellent candidate for transfering. This will help reduce peak memory usage for PDF documents using CMaps, since transfering of data avoids duplicating it on both the main- and worker-threads. Unfortunately it's not possible to actually transfer data when returning data through `sendWithPromise`, and another solution had to be used. Initially I looked at using one message for requesting the data, and another message for returning the actual CMap data. While that should have worked, it would have meant adding a lot more complexity particularly on the worker-thread. Hence the simplest solution, at least in my opinion, is to utilize `sendWithStream` since that makes it really easy to transfer the CMap data. (This required PR 11115 to land first, since otherwise CMap fetch errors won't propagate correctly to the worker-thread.) Please note that the patch purposely only changes the API to Worker communication, and not the API itself since changing the interface of `CMapReaderFactory` would be a breaking change. Furthermore, given the relatively small size of the `.bcmap` files (the largest one is smaller than the default range-request size) streaming doesn't really seem necessary either.	2019-09-04 11:46:04 +02:00
Yury Delendik	66e0dd1b06	Use streams for OperatorList chunking (issue 10023) Please note: The majority of this patch was written by Yury, and it's simply been rebased and slightly extended to prevent issues when dealing with `RenderingCancelledException`. By leveraging streams this (finally) provides a simple way in which parsing can be aborted on the worker-thread, which will ultimately help save resources. With this patch worker-thread parsing will only be aborted when the document is destroyed, and not when rendering is cancelled. There's a couple of reasons for this: - The API currently expects the entire OperatorList to be extracted, or an Error to occur, once it's been started. Hence additional re-factoring/re-writing of the API code will be necessary to properly support cancelling and re-starting of OperatorList parsing in cases where the `lastChunk` hasn't yet been seen. - Even with the above addressed, immediately cancelling when encountering a `RenderingCancelledException` will lead to worse performance in e.g. the default viewer. When zooming and/or rotation of the document occurs it's very likely that `cancel` will be (almost) immediately followed by a new `render` call. In that case you'd obviously not want to abort parsing on the worker-thread, since then you'd risk throwing away a partially parsed Page and thus be forced to re-parse it again which will regress perceived performance. - This patch is already somewhat risky, given that it touches fundamentally important/critical code, and trying to keep it somewhat small should hopefully reduce the risk of regressions (and simplify reviewing as well). Time permitting, once this has landed and been in Nightly for awhile, I'll try to work on the remaining points outlined above. Co-Authored-By: Yury Delendik <ydelendik@mozilla.com> Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>	2019-08-24 15:56:40 +02:00
Jonas Jenwald	5ac9c7c384	Support corrupt PDF files with invalid/non-existent Group /CS entries (issue 11045) The PDF file in question tries to reference a non-existent ColorSpace, which should be quite rare in practice.	2019-08-06 14:33:05 +02:00
Jonas Jenwald	38ccb43436	Reduce the number of function calls in `EvaluatorPreprocessor.read` For very large and complex PDF files this will help performance slightly, since `EvaluatorPreprocessor.read` is called a lot during parsing in the worker. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, using the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 200, "type": "eq" } ] ``` This gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 200 \| 3402 \| 3358 \| -43 \| -1.28 \| faster Firefox \| Page Request \| 200 \| 1 \| 1 \| 0 \| 26.71 \| Firefox \| Rendering \| 200 \| 3401 \| 3357 \| -44 \| -1.28 \| faster ```	2019-07-29 08:43:36 +02:00
Jonas Jenwald	f710eb56e4	Change the signature of the `Parser` constructor to take a parameter object A lot of the `new Parser()` call-sites look quite unwieldy/ugly as-is, with a bunch of somewhat randomly ordered arguments, which we can avoid by changing the constructor to accept an object instead. As an added bonus, this provides better documentation without having to add inline argument comments in the code.	2019-06-23 16:01:45 +02:00
Tim van der Meij	c8c937c257	Merge pull request #10794 from janpe2/cidtogidmap-zero Fix glyph at index zero in CIDFontType2 that has a CIDToGIDMap stream	2019-05-15 00:04:39 +02:00
Jonas Jenwald	173fbef05b	Enable the `consistent-return` ESLint rule This rule is already enabled in mozilla-central, and helps ensure more consistent functions/methods, see https://searchfox.org/mozilla-central/rev/b9da45f63cb567244933c77b2c7e827a057d3f9b/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#119-120 Please see https://eslint.org/docs/rules/consistent-return for additional information.	2019-05-11 14:27:21 +02:00
Jani Pehkonen	05c527f035	Fix glyph 0 in CIDFontType2 that has a CIDToGIDMap stream	2019-05-07 18:44:37 +03:00
Jonas Jenwald	007fab6ab5	Change `PartialEvaluator.handleColorN` to throw when no valid pattern is found Currently `handleColorN` will fallback to add a completely unparsed/unvalidated operator when no valid pattern was found. This is unfortunate, since it could very easily lead to a couple of different errors: - `DataCloneError`s when attempting to send the data to the main-thread, e.g. when `args` is `Dict`/`Stream`. - Errors in `getShadingPatternFromIR` on the main-thread, unless `args` just happens to have the expected format. - Errors when actually attempting to render the pattern on the main-thread, since the `args` will most likely not have the expected format. Hence it probably makes sense to error in `PartialEvaluator.handleColorN`, and having invalid patterns fail gracefully via the existing `ignoreErrors` code-paths instead.	2019-05-04 12:53:18 +02:00
Jonas Jenwald	5335285cda	Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to). Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of path operators in PDF documents just for a handful of corrupt ones.	2019-04-30 23:35:33 +02:00

1 2 3 4 5 ...