pdf.js

Author	SHA1	Message	Date
Tim van der Meij	ed3954fc7a	Merge pull request #10851 from brendandahl/shading-bbox Apply bounding box before using shading patterns.	2019-07-12 22:52:07 +02:00
Tim van der Meij	87f36e3520	Merge pull request #10850 from brendandahl/scale-line-width Scale stroking line width when using a tiling pattern.	2019-07-12 22:50:32 +02:00
Brendan Dahl	6fab0a0dac	Apply bounding box before using shading patterns. Fixes #8092	2019-07-08 14:05:48 -07:00
Brendan Dahl	446efab707	Scale stroking line width when using a tiling pattern.	2019-07-08 13:47:54 -07:00
Jonas Jenwald	876c962235	Ignore Annotations with too large border `width`s, to prevent the `annotationLayer` from rendering it over the surrounding document (bug 1552113) The border `width` will instead fallback to the default value of `1`, rather than ignoring it altoghether, to also ensure that e.g. `LinkAnnotation`s become clickable as intended. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1552113	2019-06-01 15:51:22 +02:00
Jani Pehkonen	05c527f035	Fix glyph 0 in CIDFontType2 that has a CIDToGIDMap stream	2019-05-07 18:44:37 +03:00
Jonas Jenwald	5335285cda	Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to). Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of path operators in PDF documents just for a handful of corrupt ones.	2019-04-30 23:35:33 +02:00
Tim van der Meij	55d9b35d37	Merge pull request #10727 from Snuffleupagus/type3-image-resources Support (rare) Type3 fonts which contains image resources (issue 10717)	2019-04-18 23:07:26 +02:00
Tim van der Meij	ae2a4dc3dd	Implement free text annotations	2019-04-13 18:45:22 +02:00
Jonas Jenwald	be604bd195	Support (rare) Type3 fonts which contains image resources (issue 10717) The Type3 font type is not commonly used in PDF documents, as can be seen from telemetry data such as: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2019-04-09&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F68&measure=PDF_VIEWER_FONT_TYPES&min_channel_version=nightly%252F57&processType=&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2019-03-18&table=0&trim=1&use_submission_date=0 (see also https://github.com/mozilla/pdf.js/wiki/Enumeration-Assignments-for-the-Telemetry-Histograms#pdf_viewer_font_types). Type3 fonts containing image resources are very* rare in practice, usually they only contain path rendering operators, but as the issue shows they unfortunately do exist. Currently these Type3-related image resources are not handled in any special way, and given that fonts are document rather than page specific rendering breaks since the image resources are thus not available to the entire document. Fortunately fixing this isn't too difficult, but it does require adding a couple of Type3-specific code-paths to the `PartialEvaluator`. In order to keep the implementation simple, particularily on the main-thread, these Type3 image resources are completely decoded on the worker-thread to avoid adding too many special cases. This should not cause any issues, only marginally less efficient code, but given how rare this kind of Type3 font is adding premature optimizations didn't seem at all warranted at this point.	2019-04-13 18:27:50 +02:00
Tim van der Meij	b4c3b94592	Merge pull request #6606 from Rob--W/pattern-scaling Improve performance and correctness of Tiling Patterns	2019-03-29 00:01:38 +01:00
Tim van der Meij	f9c58115fc	Merge pull request #10683 from janpe2/type0-noncid-cmap Use CMap in Type0 fonts when CFF is not a CID font	2019-03-28 00:07:08 +01:00
Rob Wu	d3dc8f16b5	TilingPattern: Reverse transform after painting This transform resulted in an incorrectly positioned object when the bounding box's upper-left corner did not start at (0,0), because the translation was not reverted. This patch adds the missing transform. The test file (tiling-pattern-box.pdf) is based on the PDF from #2825. All but the first cube (including the PDF data) have been removed. To trigger the bug that is fixed by this commit, I changed the BBox of the first pattern from "[ 0 0 596 842]" to "[90 0 596 842]". Without this patch, the dashed vertical line that intersects the corners at A and E would disappear.	2019-03-27 17:50:35 +01:00
Rob Wu	a72a8e921f	Avoid extreme sizing / scaling in tiling pattern The new test file (tiling-pattern-large-steps.pdf) was manually created, to have the following characteristics: - Large xstep and ystep (90000) - Page width is 4000 (which is larger than MAX_PATTERN_SIZE) - Visually, the page consists of a red rectangle with a black border, surrounded by a 50 unit white padding. - Before patch: blurry; After patch: sharp Fixes #6496 Fixes #5698 Fixes #1434 Fixes #2825	2019-03-27 17:44:04 +01:00
Jonas Jenwald	9077abc263	Take the `FirstChar`/`LastChar` properties into account when computing the hash in `PartialEvaluator.preEvaluateFont` (issue 10665) Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.	2019-03-27 16:27:10 +01:00
Jani Pehkonen	49c6233fbc	Use CMap in Type0 fonts when CFF is not a CID font	2019-03-26 19:38:44 +02:00
Jonas Jenwald	88f9e633dd	Try to improve text-selection for Type3 fonts that utilize a non-default /FontMatrix (bug 1513120) For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least. The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds two different new `text` tests.	2019-03-12 10:32:08 +01:00
Jonas Jenwald	3ce8fe7927	Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614) There's a number of things wrong with the PDF document, since its inline images are first all a lot larger than the 4 KB limit (as mandated by the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.1852045). Furthermore the actual ASCII85Decode data is interspersed with a lot of needless whitespace, in particular also "inside" of the EOD (end-of-data) marker which thus completely breaks the detection. Note that according to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.1940130, this patch should be safe since it explicitly mentions that all whitespace should be ignored.	2019-03-04 23:41:36 +01:00
Jonas Jenwald	fb774a65b0	Avoid truncating/breaking some Type3 glyphs in `compileType3Glyph` (bug 1245391, issue 10568) Hopefully this patch makes sense, since I cannot claim to fully understand this function. With the changes made in PR 3354 some Type3 glyph outlines are no longer rendering correctly, since the final paths were being accidentally ignored. The fact that Type3 fonts are not very common in PDF documents, and that most Type3 glyphs are unaffected by this regression, probably explains why this has gone unnoticed since 2013.	2019-02-21 23:29:43 +01:00
Tsukasa OI	96ba6afd47	Fix copying on supplementary plane characters pdf.js had a problem when copying characters on supplementary planes (0xPPXXXX where PP is nonzero). This is because certain methods of PartialEvaluator use classic String.fromCharCode instead of ES6's String.fromCodePoint. Despite the fact that readToUnicode method tried to parse out-of-UCS2 code points by parsing UTF-16BE, it was inadequate because String.fromCharCode only supports UCS-2 range of Unicode.	2019-02-10 18:14:53 +09:00
Tim van der Meij	e2701d5422	Merge pull request #10482 from janpe2/indexed-decode Implement Decode entry in Indexed images	2019-01-24 23:46:55 +01:00
Jonas Jenwald	41fbc71ef9	Ensure that `XRef.indexObjects` can handle object numbers with zero-padding (issue 10491) All objects in the PDF document follow this pattern: ``` 0000000001 0 obj << % Some content here... >> endobj 0000000002 0 obj << % More content here... endobj ```	2019-01-24 22:37:18 +01:00
Jani Pehkonen	26121177ab	Implement Decode entry in Indexed images	2019-01-22 22:51:04 +02:00
Jonas Jenwald	b531fc4106	Avoid truncating inline images, where the data and the "EI" marker is glued together (issue 10388) (#10436 ) Thanks to the excellent debugging done by @janpe2, this was easy to fix!	2019-01-12 20:31:23 +01:00
Jonas Jenwald	d4a3858ed5	Handle more cases of corrupt PDF files with missing 'endobj' operators, where the "obj" string is immediately followed by the dictionary (PR 9288 follow-up)	2019-01-10 17:55:28 +01:00
Brendan Dahl	32eace043b	Fix reading number of HTMX metrics. The length of the HHEA table can be incorrect, so it is better to read the number of metrics offset from beginning of table instead.	2019-01-04 15:13:13 -08:00
Brendan Dahl	e2686db49b	Merge pull request #10277 from janpe2/cff-stems Repair CFF fonts if stem hints are in wrong order	2019-01-03 10:30:43 -08:00
Jonas Jenwald	60bcce184e	Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1]. Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase. Here the choice is made to attempt to load the first page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made. Obviously, just because the first page can be loaded successfully that doesn't guarantee that the entire XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is not valid[2]. Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call. Whether or not this is a problem depends very much on what you actually measure, please consider the following examples: ```javascript console.time('first'); getDocument(...).promise.then((pdfDocument) => { console.timeEnd('first'); }); console.time('second'); getDocument(...).promise.then((pdfDocument) => { pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`. console.timeEnd('second'); }); }); ``` The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable. --- [1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated. In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects. [2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the first page. [3] The only extra parsing is caused by, potentially, having to traverse part of the `Pages` tree to find the first page.	2018-12-29 12:47:25 +01:00
Jani Pehkonen	9e990f6f3e	Repair CFF fonts if stem hints are in wrong order	2018-11-20 18:50:37 +02:00
Simon Leblanc	b5806735d8	Add support of Ink annotation	2018-10-03 00:28:49 +02:00
Tim van der Meij	66422eb83e	Merge pull request #9340 from brendandahl/private-use Map all glyphs to the private use area and duplicate the first glyph.	2018-09-08 17:51:04 +02:00
Brendan Dahl	b76cf665ec	Map all glyphs to the private use area and duplicate the first glyph. There have been lots of problems with trying to map glyphs to their unicode values. It's more reliable to just use the private use areas so the browser's font renderer doesn't mess with the glyphs. Using the private use area for all glyphs did highlight other issues that this patch also had to fix: * small private use area - Previously, only the BMP private use area was used which can't map many glyphs. Now, the (much bigger) PUP 16 area can also be used. * glyph zero not shown - Browsers will not use the glyph from a font if it is glyph id = 0. This issue was less prevalent when we mapped to unicode values since the fallback font would be used. However, when using the private use area, the glyph would not be drawn at all. This is illustrated in one of the current test cases (issue #8234) where there's an "ä" glyph at position zero. The PDF looked like it rendered correctly, but it was actually not using the glyph from the font. To properly show the first glyph it is always duplicated and appended to the glyphs and the maps are adjusted. * supplementary characters - The private use area PUP 16 is 4 bytes, so String.fromCodePoint must be used where we previously used String.fromCharCode. This is actually an issue that should have been fixed regardless of this patch. * charset - Freetype fails to load fonts when the charset size doesn't match number of glyphs in the font. We now write out a fake charset with the correct length. This also brought up the issue that glyphs with seac/endchar should only ever write a standard charset, but we now write a custom one. To get around this the seac analysis is permanently enabled so those glyphs are instead always drawn as two glyphs.	2018-09-05 14:04:54 -07:00
Jonas Jenwald	e5a6d892b4	Revert "Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984)"	2018-09-05 18:01:33 +02:00
Tim van der Meij	c94df0fef3	Merge pull request #9986 from Snuffleupagus/issue-9984 Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984)	2018-09-01 21:21:29 +02:00
Jonas Jenwald	95e5bad4c4	Attempt to find truncated endstream commands, in the fallback code-path, in `Parser.makeStream` (issue 10004) Apparently there's some PDF generators, in this case the culprit is "Nooog Pdf Library / Nooog PStoPDF v1.5", that manage to mess up PDF creation enough that endstream[1] commands actually become truncated. Please note: The solution implemented here isn't perfect, since it won't be able to cope with PDF files that contains a mixture of correct and truncated endstream commands. However, considering that this particular mode of corruption fortunately doesn't seem very common[2], a slightly less complex solution ought to suffice for now. Fixes 10004. --- [1] Scanning through the PDF data to find endstream commands becomes necessary, in order to determine the stream length in cases where the `Length` entry of the (stream) dictionary is missing/incorrect. [2] I cannot recall having seen any (previous) issues/bugs with "Missing endstream" errors.	2018-08-26 11:51:11 +02:00
Jonas Jenwald	497b765ede	Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984) Please note that while this improves issue 9984 slightly (and likely others too), it's not a complete solution. The remaining issues are related to the, more general, problems with the existing heuristics related to attempting to combine separate text items.	2018-08-18 13:45:32 +02:00
Brendan Dahl	5f67a6a237	Always fallback to system font on font failure. The font in the PDF is marked as a CIDFontType0, but the font file is actually a true type font. To fully address this issue we should really peek into the font file and try to determine what it is. However, this is the first case of this issue, so I think this solution is acceptable for now.	2018-08-03 16:49:22 -07:00
Brian	2a665ebad4	Removed Extraneous Matrix Check in CalRGB Conversion	2018-08-02 10:16:42 -07:00
Tim van der Meij	716acf63d4	Merge pull request #9938 from Snuffleupagus/issue-9915 Ensure that Type0, i.e. composite, OpenType fonts with `CFF ` tables are not treated as CFF fonts if their glyph mapping is non-default (issue 9915)	2018-08-02 00:11:18 +02:00
Jonas Jenwald	3ce420131f	Prefer the Width/Height of the image data, rather than the image dictionary, for JPEG 2000 images (issue 9650) According to the PDF specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#page=45 > When using the JPXDecode filter with image XObjects, the following changes to and constraints on some entries in the image dictionary shall apply (see 8.9.5, "Image Dictionaries" for details on these entries): > > - Width and Height shall match the corresponding width and height values in the JPEG2000 data. > > - . . . Hence it seems reasonable to use the Width/Height of the image data itself, rather than the image dictionary when there's a mismatch. Given that JPEG 2000 images are already being parsed, in order to obtain basic parameters, the actual Width/Height is readily available in the `PDFImage` constructor.	2018-08-01 16:42:26 +02:00
Jonas Jenwald	690bcc8c8a	Add a reduced, `eq`, test-case for issue 9915	2018-07-29 23:06:15 +02:00
Jonas Jenwald	2b25deb84c	Prevent errors in `sanitizeTTProgram`, during parsing of CALL functions, when encountering invalid functions stack deltas (bug 1473809) I was feeling bored; so this is a very quick, and somewhat naive, attempt at fixing the bug. The breaking error, i.e. `Error during font loading: invalid array length`, was thrown when attempting to re-size the `stack` to a negative length when parsing the CALL functions. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1473809.	2018-07-10 09:45:55 +02:00
Jonas Jenwald	7f21e38787	Error, rather than warn, once a number of invalid path operators are encountered in `EvaluatorPreprocessor.read` (bug 1443140) Incomplete path operators, in particular, can result in fairly chaotic rendering artifacts, as can be observed on page four of the referenced PDF file. The initial (naive) solution that was attempted, was to simply throw a `FormatError` as soon as any invalid (i.e. too short) operator was found and rely on the existing `ignoreErrors` code-paths. However, doing so would have caused regressions in some files; see the existing `issue2391-1` test-case, which was promoted to an `eq` test to help prevent future bugs. Hence this patch, which adds special handling for invalid path operators since those may cause quite bad rendering artifacts. You could, in all fairness, argue that the patch is a handwavy solution and I wouldn't object. However, given that this only concerns corrupt PDF files, the way that PDF viewers (PDF.js included) try to gracefully deal with those could probably be described as a best-effort solution anyway. This patch also adjusts the existing `warn`/`info` messages to print the command name according to the PDF specification, rather than an internal PDF.js enumeration value. The former should be much more useful for debugging purposes. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1443140.	2018-06-24 16:05:08 +02:00
Jonas Jenwald	56e3648b65	Add basic validation of the 'trailer' dictionary candidates in `XRef.indexObjects` (issue 9418) This patch avoids choosing a (possible) 'trailer' dictionary that `XRef.parse` and/or the `Catalog` constructor/methods will reject anyway. Since `XRef.indexObjects` is already parsing the entire PDF file, the extra dictionary look-ups added here shouldn't matter much. Besides, this is a fallback code-path that only applies to corrupt PDF files anyway.	2018-06-20 13:41:22 +02:00
Jonas Jenwald	6bbcafcd26	Let `Lexer.getNumber` treat a single decimal point as zero (issue 9252) This is consistent with the behaviour in Adobe Reader.	2018-06-20 13:41:21 +02:00
Jonas Jenwald	bf0db0fb72	Pass the `ignoreErrors` API option to the `FontFaceObject` constructor, and utilize it in `getPathGenerator` to ignore missing glyphs Obviously it's still not possible to render non-embedded fonts as paths, but in this way the rest of the page will at least be allowed to continue rendering. Please note: Including the 14 standard fonts in PDF.js probably wouldn't be that difficult to implement. (I'm not a lawyer, but the fonts from PDFium could probably be used given their BSD license.) However, the main blocker ought to be the total size of the necessary font data, since I cannot imagine people being OK with shipping ~5 MB of (additional) font data with Firefox. (Based on the reactions when the CMap files were added, and those are only ~1 MB in size.)	2018-06-13 11:02:06 +02:00
Jonas Jenwald	620f65488b	Ignore the rest of the image when encountering an EOI (End of Image) marker while parsing Scan data (issue 9679)	2018-05-30 22:40:11 +02:00
Jani Pehkonen	8ea505545a	Use FDSelect and FDArray when converting CFF CID font to paths	2018-04-10 16:44:42 +03:00
Jonas Jenwald	d431ae069d	Attempt to handle corrupt PDF documents that inline Page dictionaries in a Kids array (issue 9540) According to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.1942297, the contents of a Kids array should be indirect objects.	2018-03-12 14:13:23 +01:00
Jonas Jenwald	f05e5c5460	Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces. There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary. Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be way too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution. The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it: If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of not caching given that too aggressive caching can easily lead to rendering bugs. One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the exact stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1] With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618. Fixes 9398; also improves/fixes the `issue8823` reference test. --- [1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.	2018-02-12 16:43:47 +01:00

1 2 3 4 5 ...

829 Commits