pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	1d66fce781	Tweak the heuristic, in `src/core/jpg.js`, that handles JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10989)	2020-07-06 13:06:49 +02:00
Jonas Jenwald	fef24658e7	Adjust the heuristics used when dealing with rectangles, i.e. `re` operators, with zero width/height (issue 12010)	2020-07-02 00:02:49 +02:00
Tim van der Meij	75fed02630	Merge pull request #12043 from Snuffleupagus/issue-4260-test Add a reduced test-case for issue 4260 (PR 4521 follow-up)	2020-07-01 23:51:21 +02:00
Jonas Jenwald	e451cabe37	Add a reduced test-case for issue 4260 (PR 4521 follow-up)	2020-06-30 09:26:41 +02:00
Jonas Jenwald	4a5b68e077	Add at least some test-coverage for the `RenderTask.onContinue` functionality The default viewer, and thus Firefox, depends on the `RenderTask.onContinue` functionality to pause/continue rendering (such that the most visible page always renders first). Despite this functionality thus being very important, it has however never actually been tested at all as far as I can tell. Hence this patch which adds a new boolean `renderTaskOnContinue` parameter (`false` by default), that can be used to force a reference-test to use the `RenderTask.onContinue` code-path in the `InternalRenderTask` class. Note that I purposely made this new reference-test behaviour optional, since I didn't want to negatively affect the general runtime of the tests (given that there's a slight delay added to the rendering). Also, for e.g. benchmarking you'd most likely want to stay away from the `RenderTask.onContinue` functionality for similar reasons.	2020-06-29 00:38:34 +02:00
Jonas Jenwald	28d2ada59c	Attempt to detect inline images which contain "EI" sequence in the actual image data (issue 11124) This should reduce the possibility of accidentally truncating some inline images, while not causing the "EI" detection to become significantly slower.[1] There's obviously a possibility that these added checks are not sufficient to catch every single case of "EI" sequences within the actual inline image data, but without specific test-cases I decided against over-engineering the solution here. Please note: The interpolation issues are somewhat orthogonal to the main issue here, which is the truncated image, and it's already tracked elsewhere. --- [1] I've looked at the issue a few times, and this is the first approach that I was able to come up with that didn't cause unacceptable performance regressions in e.g. issue 2618.	2020-06-26 13:15:06 +02:00
Jonas Jenwald	e18fa3fc45	Tweak the `QueueOptimizer` to recognize `OPS.paintImageMaskXObject` operators as repeated when the "skew" transformation matrix elements are non-zero (issue 8078) First of all, I should mention that my understanding of the finer details of the `QueueOptimizer` (and its related `CanvasGraphics` methods) is somewhat limited. Hence I'm not sure if there's actually a very good reason for only considering ImageMasks where the "skew" transformation matrix elements are zero as repeated, however simply looking at the code I just don't see why these elements cannot be non-zero as long as they are all identical for the ImageMasks. Furthermore, looking at the group case (which is what we're currently falling back to), there's no particular limitation placed upon the transformation matrix elements. While this patch obviously isn't enough to completely fix the issue, since there should be a visible Pattern rendered as well[1], it seem (at least to me) like enough of an improvement that submitting this is justified. With these changes the referenced PDF document will no longer hang the entire browser, and rendering also finishes in a reasonable time (< 10 seconds for me) which seem fine given the huge number of identical inline images present.[2] --- [1] Temporarily changing the Pattern to a solid color does render the correct/expected area, which suggests that the remaining problem is a pre-existing issue related to the Pattern-handling itself rather than the `QueueOptimizer` functionality. [2] The document isn't exactly rendered immediately in e.g. Adobe Reader either.	2020-06-20 12:18:48 +02:00
Jonas Jenwald	4b51bcc733	Ensure that `PDFImage.buildImage` won't accidentally swallow errors, e.g. from ColorSpace parsing (issue 6707, PR 11601 follow-up) Because of a really stupid `Promise`-related mistake on my part, when re-factoring `PDFImage.buildImage` during the `NativeImageDecoder` removal, we're no longer re-throwing errors occuring during image parsing/decoding as intended. The result is that some (fairly) corrupt documents will never finish loading, and unfortunately there were apparently no sufficiently corrupt images in the test-suite to catch this.	2020-06-13 15:02:37 +02:00
Carlos Rodríguez	802aa14a99	Jpeg encoded with RGB -instead of YCbCr- write the components index as "RGB" in ASCII to say it so On ISO/IEC 10918-6:2013 (E), section 6.1: (http://www.itu.int/rec/T-REC-T.872-201206-I/en) "Images encoded with three components are assumed to be RGB data encoded as YCbCr unless the image contains an APP14 marker segment as specified in 6.5.3, in which case the colour encoding is considered either RGB or YCbCr according to the application data of the APP14 marker segment" But common jpeg libraries consider RGB too if components index are ASCII R (0x52), G (0x47) and B (0x42): https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048 Issue #11931	2020-06-04 15:08:47 +02:00
Tim van der Meij	3b615e4ca3	Merge pull request #11601 from Snuffleupagus/rm-nativeImageDecoderSupport [api-minor] Decode all JPEG images with the built-in PDF.js decoder in `src/core/jpg.js`	2020-05-23 15:33:46 +02:00
Jonas Jenwald	56ebf01ae0	Avoid hanging the worker-thread for CMap data with ridiculously large ranges (issue 11922) This patch was inspired by `ad2b64f124/xpdf/CharCodeToUnicode.cc (L480-L484)`	2020-05-22 15:23:17 +02:00
Jonas Jenwald	0351852d74	[api-minor] Decode all JPEG images with the built-in PDF.js decoder in `src/core/jpg.js` Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons: - It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library. - The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image. - While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first. In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should always be the case. - The native decoding, for anything except the simplest of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707). Furthermore this also leads to data being parsed on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons. - Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests. - Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used. Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues. At this point in time, there's two kinds of failure with this patch: - Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder. This type of "failure" accounts for the vast majority of the total number of changes in the reference tests. - Changes where the JPEG images now looks ever so slightly blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough). Basically if you disable [this downscaling in canvas.js](`8fb82e939c/src/display/canvas.js (L2356-L2395)`), which is what happens when zooming in, the differences simply vanish! Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that all images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.	2020-05-22 00:22:48 +02:00
Jonas Jenwald	4aabd063fc	Gracefully handle annotation parsing errors in `Page.getOperatorList` (issue 11871) This should ensure that a page will always render successfully, even if there's errors during the Annotation fetching/parsing. Additionally the `OperatorList.addOpList` method is also adjusted to ignore invalid data, to make it slightly more robust.	2020-05-04 17:09:48 +02:00
Tim van der Meij	96923eb2a6	Merge pull request #11805 from Snuffleupagus/issue-11794 Always skip over any additional, unexpected, RSTx (restart) markers in corrupt JPEG images (issue 11794)	2020-04-16 00:08:58 +02:00
Jonas Jenwald	44b4a74f48	A couple of small `String.fromCodePoint` improvements (PR 11698 and 11769 follow-up) - Add a reduced test-case for issue 11768, to prevent future regressions. (Given that PR 11769 is only a work-around, rather than a proper solution, it may not be entirely accurate for the issue to be closed as fixed.) - Add more validation of the charCode, as found by the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode` to prevent future issues.	2020-04-15 13:45:08 +02:00
Jonas Jenwald	06f6f8719f	Always skip over any additional, unexpected, RSTx (restart) markers in corrupt JPEG images (issue 11794)	2020-04-14 23:27:08 +02:00
Jonas Jenwald	91efde5246	Add a heuristic to scale even single-char text, when the horizontal/vertical scaling differs significantly (issue 11713) At this point in time, compared to when the "ignore single-char" code was added, we should generally be doing a much better job of combining text into as few chunks as possible. However, there's still bad cases where we're not able to combine text as much as one would like, which is why I'm not proposing to simply measure/scale all text. Instead this patch will to only measure/scale single-char text in cases where the horizontal/vertical scale is off significantly, since that's were you'd expect bad text-selection behaviour otherwise. Note that most of the movement caused by this patch is with Type3 fonts, which is a somewhat special font type and one where our current text-selection behaviour is probably the least good.	2020-04-07 00:36:23 +02:00
Jonas Jenwald	938d519192	Create the glyph mapping correctly for composite Type1, i.e. CIDFontType0, fonts (issue 11740) This updates `Type1Font.getGlyphMapping` with a code-path "borrowed" from `CFFFont.getGlyphMapping`.	2020-04-06 11:21:02 +02:00
Jani Pehkonen	a22c0eab48	The first glyph in CFF CIDFonts must be named 0 instead of ".notdef" Fixes #11718 in which the `ff` ligature glyph is at index zero in a CFF font. Beacuse this is a CIDFont, glyph names are CIDs, which are integers. Thus the string `".notdef"` is not correct. The rest of the charset data is already parsed correctly as integers when the boolean argument `cid` is true.	2020-03-24 15:56:50 +02:00
Jonas Jenwald	15e8692eff	Don't accidentally accept invalid glyphNames which appear to follow the Cdd{d}/cdd{d} format in `PartialEvaluator._buildSimpleFontToUnicode` (issue 11697) The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a possible Cdd{d}/cdd{d} glyphName by the existing heuristics. Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point. To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds additional validation of the charCode found by the heuristics.	2020-03-13 23:35:47 +01:00
Jonas Jenwald	65e6ea2cb2	Prevent lookup errors in `PartialEvaluator.hasBlendModes` from breaking all parsing/rendering of a page (issue 11678) The PDF document in question is corrupt, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.	2020-03-09 12:00:12 +01:00
Tim van der Meij	1a97c142b3	Merge pull request #11523 from Snuffleupagus/issue-10880 Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)	2020-03-06 23:03:09 +01:00
Tim van der Meij	c95b9b1e17	Merge pull request #11653 from Snuffleupagus/ensureStateFont Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651)	2020-03-03 23:33:13 +01:00
Jani Pehkonen	71e7686950	Fix Type1 font parsing when .notdef is not at index zero Fixes #11477 The PDF draws many space characters but the embedded fonts don't have a glyph named `space`, so `.notdef` should be drawn instead. PDF.js assumed that Type1 fonts define `.notdef` as the first glyph (index 0). However, now the fonts have the glyph `A` at index 0 and `.notdef` is the last one, so `A` appears where spaces are expected. Because the rest of the font machinery in `core/fonts.js` assumes `.notdef` is at index zero, it's easiest to modify `core/type1_parser.js` so that it "repairs" fonts and makes sure `.notdef` is at index 0.	2020-03-03 21:55:51 +02:00
Jonas Jenwald	65e514e063	Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651) The PDF document in question is corrupt, since it contains multiple instances of incorrect operators. We obviously don't want to slow down parsing of all documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.	2020-03-03 10:05:18 +01:00
Takashi Tamura	d8c9f119b0	Fix the vertical writing mode with horizontal scaling. #11555 . It is not valid to multiply textHScale when the writing mode is vertical. See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762	2020-02-29 07:48:29 +09:00
Jonas Jenwald	c3c3b8cd81	Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else. To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data and the "actual" `scanLines` value is at least one order of magnitude smaller than expected.	2020-02-22 14:16:07 +01:00
Takashi Tamura	512dbe3060	Fix text spacing with vertical fonts. #7687 and #11526 . When the writing mode is vertical, we have to reverse the sign of spacing since we are subtracting it from current.y. We have to add it to current.y. See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762	2020-02-11 08:49:23 +09:00
Tim van der Meij	dced0a3821	Merge pull request #11579 from Snuffleupagus/issue-11578 Ignore spaces when normalizing the font name in `Font.fallbackToSystemFont` (issue 11578)	2020-02-09 17:33:09 +01:00
Tim van der Meij	61056a9238	Merge pull request #11551 from Snuffleupagus/issue-11549 Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)	2020-02-09 17:32:35 +01:00
Jonas Jenwald	7937165537	Ignore spaces when normalizing the font name in `Font.fallbackToSystemFont` (issue 11578)	2020-02-08 19:59:04 +01:00
Brendan Dahl	09a6e17d22	Merge pull request #11528 from janpe2/type1-nonemb-notdef Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths	2020-02-06 13:30:07 -08:00
Jonas Jenwald	4c54395ff6	Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549) This will allow font loading/parsing to continue, rather than immediately failing, when broken/corrupt CMap data is encountered.	2020-01-30 13:19:05 +01:00
Tim van der Meij	474fe1757e	Merge pull request #11508 from Snuffleupagus/jpg-default-marker Simplify the handling of unsupported/incorrect markers in `src/core/jpg.js`	2020-01-26 21:32:13 +01:00
Jonas Jenwald	62b2b984cc	Render Popup annotations last, once all other annotations have been rendered (issue 11362) In the current `AnnotationLayer` implementation, Popup annotations require that the parent annotation have already been rendered (otherwise they're simply ignored). Usually the annotations are ordered, in the `/Annots` array, in such a way that this isn't a problem, however there's obviously no guarantee that all PDF generators actually do so. Hence we simply ensure, when rendering the `AnnotationLayer`, that the Popup annotations are handled last.	2020-01-26 15:49:55 +01:00
Jonas Jenwald	13930e5202	Simplify the handling of unsupported/incorrect markers in `src/core/jpg.js` - Re-factor the "incorrect encoding" check, since this can be easily achieved using the general `findNextFileMarker` helper function (with a suitable `startPos` argument). - Tweak a condition, to make it easier to see that the end of the data has been reached. - Add a reference test for issue 1877, since it's what prompted the "incorrect encoding" check.	2020-01-25 22:52:24 +01:00
Jani Pehkonen	809b96b40c	Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths Fixes #11403 The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this. In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`. The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.	2020-01-21 21:35:25 +02:00
Jonas Jenwald	5c0336872e	Handle corrupt ASCII85Decode inline images with truncated EOD markers (issue 11385) In the PDF document in question, there's an ASCII85Decode inline image where the '>' part of EOD (end-of-data) marker is missing; hence the PDF document is corrupt.	2019-12-05 15:53:18 +01:00
Jonas Jenwald	9199b02a42	Subtract `stream.start` when getting the `startXRef` property for documents with a Linearization dictionary (issue 11330) For documents with a Linearization dictionary the computed `startXRef` position will be relative to the raw file, rather than the actual PDF document itself (which begins with `%PDF-`). Hence it's necessary to subtract `stream.start` in this case, since otherwise the `XRef.readXRef` method will increment the position too far resulting in parsing errors.	2019-11-16 09:29:10 +01:00
Tim van der Meij	6972bbea74	Include a reduced test case for annotations without a `Border`/`BS` entry (PR 6180 follow-up)	2019-11-10 14:37:42 +01:00
Jonas Jenwald	835d8c2be5	Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287) This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.	2019-11-01 09:01:24 +01:00
Jonas Jenwald	5c266f0e8c	Support Blend Modes which are specified in an Array of Names (issue 11279) According to the specification, the first supported Blend Mode should be choosen in this case; please see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G10.4848607	2019-10-26 14:24:31 +02:00
Tim van der Meij	11f3851a97	Merge pull request #11243 from Snuffleupagus/issue-11242 Add a fallback for non-embedded composite Verdana fonts (issue 11242)	2019-10-18 23:56:46 +02:00
Tim van der Meij	c54bb222ca	Merge pull request #11231 from Snuffleupagus/indexObjects-entries-gen Allow over-writing entries, in `XRef.indexObjects`, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303)	2019-10-17 23:56:26 +02:00
Jonas Jenwald	2fcb5afc7b	Add a fallback for non-embedded composite Verdana fonts (issue 11242) Obviously this won't look exactly right, but considering that the PDF file doesn't bother embedding non-standard fonts this is the best that we can do here.	2019-10-17 17:00:55 +02:00
Jonas Jenwald	17a3af3fc0	Replace a couple of `skipPages` annotations with `firstPage` in `test/test_manifest.json` Originally only `skipPages` existed, but given that `firstPage`/`lastPage` has existed for a long time now using them whenever possible looks simpler overall.	2019-10-17 13:20:56 +02:00
Jonas Jenwald	f3c5a690fc	Remove unnecessary `skipPages` annotations from `test/test_manifest.json` - In the `ibwa-bad` case the sixteenth page contains corrupt/incomplete commands, but given that we're suppressing `Error`s by default now skipping hardly seems warranted any more. - In the `geothermal.pdf` case the first page contains an unsupported ColourSpace, but again we're suppressing `Error`s by default now and skipping hardly seems warranted any more.	2019-10-17 13:20:49 +02:00
Jonas Jenwald	ffc847eaa5	Allow over-writing entries, in `XRef.indexObjects`, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303) This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1] Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break. The only reason that I'm even submitting this patch, is because of the number of open issues that it would address. Generally speaking though, the best course of action would probably be if `XRef.indexObjects` was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them. --- [1] Especially given that it's reverting part of PR 5910, however in the case of issue 5909 it seems that other (more recent) changes have actually made that PR redundant.	2019-10-14 22:10:04 +02:00
Jonas Jenwald	259551d144	Convert a number of reference tests, for documents with corrupt XRef tables, from `load` to `eq` As part of attempting to fix a number issues containing PDF documents with corrupt XRef tables, I'd like to improve the reference test-coverage slightly first. Obviously this will increase the runtime of the tests a bit, however I'd rather "waste" resources on the bots instead of developer time fixing regressions which could have been avoided.	2019-10-12 18:49:59 +02:00
Jonas Jenwald	f5be2d62a3	Improve the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode`, for glyphNames of the Cdd{d}/cdd{d} format (issue 9655) Please note: I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code. However, it occured to me that we're essentially trying to add a heuristic on top of another heuristic, and that it shouldn't matter how efficient the code is as long as it works. In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.	2019-10-06 10:47:29 +02:00

1 2 3 4 5 ...

887 Commits