pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	168c6aecae	Stop caching Streams in `XRef.fetchCompressed` I'm slightly surprised that this hasn't actually caused any (known) bugs, but that may be more luck than anything else since it fortunately doesn't seem common for Streams to be defined inside of an 'ObjStm'.[1] Note that in the `XRef.fetchUncompressed` method we're not caching Streams, and that for very good reasons too. - Streams, especially the `DecodeStream` ones, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so. - Attempting to read from the same Stream more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position. - Given that even the `src/core/` code is now fairly asynchronous, see e.g. the `PartialEvaluator`, it's generally impossible to assert that any one Stream isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Streams isn't going to work in the general case. All in all, I cannot understand why it'd ever be correct to cache Streams in the `XRef.fetchCompressed` method. --- [1] One example where that happens is the `issue3115r.pdf` file in the test-suite, where the streams in question are not actually used for anything within the PDF.js code.	2019-11-30 10:21:08 +01:00
Jonas Jenwald	06412a557b	Slighthly re-factor `XRef.fetchCompressed` - Change all occurences of `var` to `let`/`const`. - Initialize the (temporary) Arrays with the correct sizes upfront. - Inline the `isCmd` check. Obviously this won't make a huge difference, but given that the check is only relevant for corrupt documents it cannot hurt.	2019-11-30 09:49:51 +01:00
Jonas Jenwald	725566cfea	Remove the `Number.isInteger` checks from `XRef.fetchUncompressed` (PR 8857 follow-up) Having ran the entire test-suite locally with these `Number.isInteger` checks removed, there wasn't a single test failure anywhere; see also PR 8857. Hence everything points to this being completely unnecessary now, and by removing this code there's thus fewer function calls being made in `XRef.fetchUncompressed`.	2019-11-28 23:25:39 +01:00
Jonas Jenwald	cc76132c24	Remove outdated, and misleading, JSDoc comment from the `PDFDocument` class The contents of this comment hasn't been correct for years, ever since the library was properly split into main/worker-threads, so it's probably high time for this to be updated.	2019-11-25 11:36:29 +01:00
Jonas Jenwald	a965662184	Enable the `getter-return`, `no-dupe-else-if`, and `no-setter-return` ESLint rules All of these rules can help catch errors during development. Please note that only `getter-return` required a few changes, which was limited to disabling the rule in a couple of spots; please find additional details about these rules at: - https://eslint.org/docs/rules/getter-return - https://eslint.org/docs/rules/no-dupe-else-if - https://eslint.org/docs/rules/no-setter-return	2019-11-23 11:40:30 +01:00
Tim van der Meij	be02e67972	Merge pull request #11335 from Snuffleupagus/issue-11330 Subtract `stream.start` when getting the `startXRef` property for documents with a Linearization dictionary (issue 11330)	2019-11-16 13:56:01 +01:00
Jonas Jenwald	9199b02a42	Subtract `stream.start` when getting the `startXRef` property for documents with a Linearization dictionary (issue 11330) For documents with a Linearization dictionary the computed `startXRef` position will be relative to the raw file, rather than the actual PDF document itself (which begins with `%PDF-`). Hence it's necessary to subtract `stream.start` in this case, since otherwise the `XRef.readXRef` method will increment the position too far resulting in parsing errors.	2019-11-16 09:29:10 +01:00
Jonas Jenwald	688d15526e	Use `getBytes`, rather than looping over `getByte`, in `FlateStream.prototype.readBlock` Please note: A a similar change was attempted in PR 5005, but it was subsequently backed out (in PR 5069) since other parts of the patch caused issues. With these changes, it's possible to replace repeated function calls within a loop with just a single function call and subsequent assignment instead.	2019-11-15 15:45:31 +01:00
Jonas Jenwald	74e00ed93c	Change `isNodeJS` from a function to a constant Given that this shouldn't change after the `pdf.js`/`pdf.worker.js` files have been loaded, it doesn't seems necessary to keep this as a function.	2019-11-10 16:44:29 +01:00
Jonas Jenwald	2817121bc1	Convert `globalScope` and `isNodeJS` to proper modules Slightly unrelated to the rest of the patch, but this also removes an out-of-place `globals` definition from the `web/viewer.js` file.	2019-11-10 16:44:29 +01:00
Jonas Jenwald	0233fc07b6	Revert "Convert `Catalog.getPageDict` to an `async` method"	2019-11-09 22:36:23 +01:00
Jonas Jenwald	79d7c002de	Inline a couple of `isRef`/`isDict` checks in the `getPageDict` method As we've seen in numerous other cases, avoiding unnecessary function calls is never a bad thing (even if the effect is probably tiny here). With these changes we also avoid potentially two back-to-back `isDict` checks when evaluating possible Page nodes, and can also no longer accidentally pick a dictionary with an incorrect /Type.	2019-11-08 17:53:00 +01:00
Jonas Jenwald	0d89006bf1	Convert `Catalog.getPageDict` to an `async` method This makes it possible to remove the internal `next` helper function, and also gets rid of the need to manually resolve/reject a `PromiseCapability`.	2019-11-08 17:45:28 +01:00
Jonas Jenwald	04497bcb3c	Re-factor the `ObjectLoader._walk` method to be properly asynchronous Rather than having to store a `PromiseCapability` on the `ObjectLoader` instances, we can simply convert `_walk` to be `async` and thus have the same functionality with native JavaScript instead.	2019-11-03 15:04:20 +01:00
Tim van der Meij	bbd2386bd9	Merge pull request #11296 from Snuffleupagus/parseColorSpace-stopAtErrors Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287)	2019-11-01 22:47:50 +01:00
Jonas Jenwald	829d6ba2dc	Ensure that the `peekByte` methods, on the various Streams, handles end of data correctly (PR 5286 follow-up) When the end of data has already been reached for the various Streams, the `getByte` methods will return `-1` to signal that to the caller. Note however that the current position obviously won't be incremented in this case, meaning that the `peekByte` methods will in this case incorrectly decrement the position. Thankfully the corresponding `peekBytes` shouldn't be affected by this bug, since they decrement the current position with the actually returned number of bytes. I'm not aware of any bugs caused by this blatant oversight, but that doesn't mean this shouldn't be fixed :-)	2019-11-01 18:22:33 +01:00
Jonas Jenwald	835d8c2be5	Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287) This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.	2019-11-01 09:01:24 +01:00
Jonas Jenwald	2d35a49dd8	Inline a couple of `isRef`/`isDict` checks in the `ObjectLoader` code As we've seen in numerous other cases, avoiding unnecessary function calls is never a bad thing (even if the effect is probably tiny here).	2019-10-29 23:20:10 +01:00
Jonas Jenwald	1133dbac33	Make the `ObjectLoader` use more efficient methods when determining if data needs to be loaded Currently, for data in `ChunkedStream` instances, the `getMissingChunks` method is used in a couple of places to determine if data is already available or if it needs to be loaded. When looking at how `ChunkedStream.getMissingChunks` is being used in the `ObjectLoader` you'll notice that we don't actually care about which specific chunks are missing, but rather only want essentially a yes/no answer to the "Is the data available?" question. Furthermore, when looking at how `ChunkedStream.getMissingChunks` itself is implemented you'll notice that it (somewhat expectedly) always iterates over all chunks. All in all, using `ChunkedStream.getMissingChunks` in the `ObjectLoader` seems like an unnecessary "heavy" and roundabout way to obtain a boolean value. However, it turns out there already exists a `ChunkedStream.allChunksLoaded` method, consisting of a single simple check, which seems like a perfect fit for the `ObjectLoader` use cases. In particular, once the entire PDF document has been loaded (which is usually fairly quick with streaming enabled), you'd really want the `ObjectLoader` to be as simple/quick as possible (similar to e.g. loading a local files) which this patch should help with. Note that I wouldn't expect this patch to have a huge effect on performance, but it will nonetheless save some CPU/memory resources when the `ObjectLoader` is used. (As usual this should help larger PDF documents, w.r.t. both file size and number of pages, the most.)	2019-10-29 23:20:09 +01:00
Jonas Jenwald	0496ea61f5	Ensure that `PartialEvaluator.hasBlendModes` handles Blend Modes in Arrays (PR 11281 follow-up) I completely overlooked this in PR 11281, but you obviously need to make similar changes in `PartialEvaluator.hasBlendModes` since it will otherwise ignore valid Blend Modes.	2019-10-28 11:37:05 +01:00
Jonas Jenwald	5c266f0e8c	Support Blend Modes which are specified in an Array of Names (issue 11279) According to the specification, the first supported Blend Mode should be choosen in this case; please see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G10.4848607	2019-10-26 14:24:31 +02:00
Jonas Jenwald	df0e1edab5	Re-factor sending of various Exceptions from the worker to the API As can be seen in the API, there's a number of document loading Exception handlers which are both really simple and highly similar. Hence these are changed such that all the relevant Exceptions are sent via one message instead. Furthermore, the patch also avoids unnecessarily re-creating `UnknownErrorException`s at the worker side and removes an unnecessary `bind` call.	2019-10-19 12:54:54 +02:00
Tim van der Meij	11f3851a97	Merge pull request #11243 from Snuffleupagus/issue-11242 Add a fallback for non-embedded composite Verdana fonts (issue 11242)	2019-10-18 23:56:46 +02:00
Tim van der Meij	c54bb222ca	Merge pull request #11231 from Snuffleupagus/indexObjects-entries-gen Allow over-writing entries, in `XRef.indexObjects`, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303)	2019-10-17 23:56:26 +02:00
Jonas Jenwald	2fcb5afc7b	Add a fallback for non-embedded composite Verdana fonts (issue 11242) Obviously this won't look exactly right, but considering that the PDF file doesn't bother embedding non-standard fonts this is the best that we can do here.	2019-10-17 17:00:55 +02:00
Pedro Luiz Cabral Salomon Prado	4d0c759b7f	Change variable assignment (#11247 ) Remove unused variable assignment in `src/core/fonts.js`	2019-10-16 00:39:25 +02:00
Jonas Jenwald	ffc847eaa5	Allow over-writing entries, in `XRef.indexObjects`, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303) This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1] Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break. The only reason that I'm even submitting this patch, is because of the number of open issues that it would address. Generally speaking though, the best course of action would probably be if `XRef.indexObjects` was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them. --- [1] Especially given that it's reverting part of PR 5910, however in the case of issue 5909 it seems that other (more recent) changes have actually made that PR redundant.	2019-10-14 22:10:04 +02:00
Tim van der Meij	ca3a58f93a	Consistently use `@returns` for returned data types in JSDoc comments Sometimes we also used `@return`, but `@returns` is what the JSDoc documentation recommends. Even though `@return` works as an alias, it's good to use the recommended syntax and to be consistent within the project.	2019-10-13 13:58:17 +02:00
Tim van der Meij	8b4ae6f3eb	Consistently use `@type` for getter data types in JSDoc comments Sometimes we also used `@return` or `@returns`, but `@type` is what the JSDoc documentation recommends. This also improves the documentation because before this commit the types were not shown and now they are.	2019-10-13 13:58:17 +02:00
Tim van der Meij	f4daafc077	Consistently use square brackets for optional parameters in JSDoc comments Square brackets are recommended to indicate optional parameters. Using them helps for automatically generating correct documentation.	2019-10-13 13:58:17 +02:00
Tim van der Meij	e75991b49e	Consistently use `number` for numeric data types in JSDoc comments Sometimes we also used `Number` and `integer`, but `number` is what the JSDoc documentation recommends.	2019-10-13 13:58:13 +02:00
Jonas Jenwald	bfcbf2d78d	Cache processed 'ExtGState's in `PartialEvaluator.hasBlendModes` to avoid unnecessary parsing/lookups This simply extends the already existing caching of processed resources to avoid duplicated parsing of 'ExtGState's, which should help with badly generated PDF documents. This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf, with the following manifest file: ``` [ { "id": "issue6961", "file": "../web/pdfs/issue6961.pdf", "md5": "", "rounds": 200, "type": "eq" } ] ``` which gave the following overall results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 400 \| 1063 \| 1051 \| -12 \| -1.17 \| faster Firefox \| Page Request \| 400 \| 552 \| 543 \| -9 \| -1.69 \| faster Firefox \| Rendering \| 400 \| 511 \| 508 \| -3 \| -0.61 \| ``` and the following page-specific results: ``` -- Grouped By page, stat -- page \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ---- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- 0 \| Overall \| 200 \| 1122 \| 1110 \| -12 \| -1.03 \| 0 \| Page Request \| 200 \| 552 \| 544 \| -8 \| -1.48 \| faster 0 \| Rendering \| 200 \| 570 \| 566 \| -4 \| -0.62 \| 1 \| Overall \| 200 \| 1005 \| 992 \| -13 \| -1.33 \| faster 1 \| Page Request \| 200 \| 552 \| 542 \| -11 \| -1.91 \| faster 1 \| Rendering \| 200 \| 452 \| 450 \| -3 \| -0.61 \| ```	2019-10-12 12:35:42 +02:00
Jonas Jenwald	af71f9b40a	Inline all the possible type checks in `PartialEvaluator.hasBlendModes` to avoid unnecessary function calls For badly generated PDF documents, with issue 6961 being one example, there's well over one hundred thousand function calls being made in total for just the two pages.	2019-10-12 11:24:37 +02:00
huzjakd	94171d9d72	Attempt to fallback to a default font, for non-available ones, in `PartialEvaluator.loadFont` This handles the two different ways that fonts can be loaded, either by Name (which is the common case) or by Reference. Furthermore, this also takes the `ignoreErrors` option into account when deciding whether to fallback or Error. Finally, by creating a minimal but valid Font dictionary, there's no special-cases necessary in any of the font parsing code. Co-authored-by: huzjakd <huzjakd@gmail.com> Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>	2019-10-10 16:49:46 +02:00
Jonas Jenwald	f5be2d62a3	Improve the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode`, for glyphNames of the Cdd{d}/cdd{d} format (issue 9655) Please note: I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code. However, it occured to me that we're essentially trying to add a heuristic on top of another heuristic, and that it shouldn't matter how efficient the code is as long as it works. In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.	2019-10-06 10:47:29 +02:00
Jonas Jenwald	572abdcb4a	Convert the various image decoder `...Error`s to classes extending `BaseException` (PR 11185 follow-up) Somehow I missed these in PR 11185, but there's no good reason not to convert them as well.	2019-10-01 13:10:14 +02:00
Jonas Jenwald	5d93fda4f2	Convert the various `...Exception`s to proper classes, to reduce code duplication By utilizing a base "class", things become significantly simpler. Unfortunately the new `BaseException` cannot be a proper ES6 class and just extend `Error`, since the SystemJS dependency doesn't seem to play well with that. Note also that we (generally) need to keep the `name` property on the actual `...Exception` object, rather than on its prototype, since the property will otherwise be dropped during the structured cloning used with `postMessage`.	2019-09-29 10:16:20 +02:00
Jonas Jenwald	2cac68467f	Reduce the number of function calls in the `Dict` class The following changes were made: - Remove unnecessary `typeof` checks in the `get`/`getAsync` methods. - Reduce unnecessary code duplication in the `get`/`getAsync` methods. - Inline the `Ref` checks in the `get`/`getAsync`/`getArray` methods, since it helps avoid many unnecessary functions calls. I.e. this way it's possible to directly call `XRef.{fetch, fetchAsync)` only when necessary, rather than always having to call `XRef.{fetchIfRef, fetchIfRefAsync)`. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, using the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 250, "type": "eq" } ] ``` This gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 250 \| 2821 \| 2790 \| -32 \| -1.12 \| faster Firefox \| Page Request \| 250 \| 2 \| 2 \| 0 \| 6.68 \| Firefox \| Rendering \| 250 \| 2820 \| 2788 \| -32 \| -1.13 \| faster ```	2019-09-24 08:31:39 +02:00
Jonas Jenwald	7f18c57c12	Fix the inconsistent return types for `Dict.{get, getAsync}` Having these methods fallback to returning `null` in only one particular case seems outright wrong, since a "falsy" value will thus be handled incorrectly. The only reason that this hasn't caused issues in practice is that there's only one call-site passing in three keys, and in that case we're trying to read a font file where falling back to `null` isn't a problem.	2019-09-23 11:41:19 +02:00
Tim van der Meij	3da680cdfc	Merge pull request #11158 from janpe2/gradient-stops Avoid floating point inaccuracy in gradient color stops	2019-09-19 13:15:11 +02:00
Jonas Jenwald	af22dc9b0c	For Type1 fonts, replace missing font dictionary /Widths entries with ones from the font data (issue 11150) Hopefully this patch makes sense, and in order to reduce the regression risk the implementation ensures that only completely missing widths are being replaced.	2019-09-18 10:15:09 +02:00
Jani Pehkonen	911df237f3	Avoid floating point inaccuracy in gradient color stops	2019-09-17 21:01:17 +03:00
Jonas Jenwald	12e1c91f73	Don't `enqueue` unused properties when sending 'GetOperatorList' data from the worker-thread (PR 11069 follow-up) With the changes made in PR 11069, it's no longer necessary to include the `pageIndex`/`intent` parameters when sending 'GetOperatorList' data. In the previous implementation these properties were used to associate the `OperatorList` with the correct `RenderTask`, however now that `ReadableStream`s are used that's handled automatically and it's thus dead code at this point.	2019-09-09 17:41:26 +02:00
Tim van der Meij	37d5b80ba8	Merge pull request #11118 from Snuffleupagus/FetchBuiltInCMap-sendWithStream Transfer, rather than copy, CMap data to the worker-thread	2019-09-06 22:56:14 +02:00
Jonas Jenwald	f0534b9b51	Adjust the values sent, with the 'test' message, by the `WorkerMessageHandler.setup` method Note how the sent values have inconsistent types, with a boolean in one case and an object in the other (normal) case. Furthermore, explicitly sending a `supportTypedArray: true` property seems superfluous at least to me.	2019-09-05 11:27:27 +02:00
Jonas Jenwald	7212ff4eea	Stop checking for the `response` property, on `XMLHttpRequest`, when setting up the `WorkerMessageHandler` This check was added in PR 2445, however it's no longer necessary since all data[1] is now loaded on the main-thread (and then transferred to the worker-thread). Furthermore, by default the Fetch API is now (usually) used rather than `XMLHttpRequest`. All in all, while these checks were necessary at one point that's no longer the case and they can thus be removed. --- [1] This includes both the actual PDF data, as well as the CMap data.	2019-09-05 11:27:22 +02:00
Jonas Jenwald	f11a4ba750	Transfer, rather than copy, CMap data to the worker-thread It recently occurred to me that the CMap data should be an excellent candidate for transfering. This will help reduce peak memory usage for PDF documents using CMaps, since transfering of data avoids duplicating it on both the main- and worker-threads. Unfortunately it's not possible to actually transfer data when returning data through `sendWithPromise`, and another solution had to be used. Initially I looked at using one message for requesting the data, and another message for returning the actual CMap data. While that should have worked, it would have meant adding a lot more complexity particularly on the worker-thread. Hence the simplest solution, at least in my opinion, is to utilize `sendWithStream` since that makes it really easy to transfer the CMap data. (This required PR 11115 to land first, since otherwise CMap fetch errors won't propagate correctly to the worker-thread.) Please note that the patch purposely only changes the API to Worker communication, and not the API itself since changing the interface of `CMapReaderFactory` would be a breaking change. Furthermore, given the relatively small size of the `.bcmap` files (the largest one is smaller than the default range-request size) streaming doesn't really seem necessary either.	2019-09-04 11:46:04 +02:00
Tim van der Meij	e59b11860d	Merge pull request #11108 from timvandermeij/es6-annotations Use more ES6 syntax in the annotation code	2019-09-02 23:13:24 +02:00
Tim van der Meij	2866c8a39e	Use more ES6 syntax in `src/core/annotation.js` `let` is converted to `const` where possible.	2019-09-02 22:37:27 +02:00
Jonas Jenwald	229f6f34d1	Remove the API/Worker version warning message in `TESTING` mode The warning messages turn out to be more annoying than helpful when looking at the `console` during tests, so let's just remove them.	2019-09-01 16:47:26 +02:00

1 2 3 4 5 ...

1559 Commits