pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	49e8a270c4	Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined Note how `XRef.fetchUncompressed`, which is used a lot for most PDF documents, is calling the `makeSubStream` method without providing a `length` argument. In practice this results in the `makeSubStream` method, on the `ChunkedStream` instance, calling the `ensureRange` method with `NaN` as the end position, thus resulting in no data being requested despite it possibly being necessary. This may be quite bad, since in this particular case it will lead to a new `ChunkedStream` being created and also a new `Parser`/`Lexer` instance. Given that it's quite possible that even the very first `Parser.getObj` call could throw `MissingDataException`, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives). You obviously need to be very careful to not have `ChunkedStream.makeSubStream` accidentally requesting the entire file, hence its `this.end` property is of no use here, but it should be possible to at least check that the `start` of the data is present before any potentially expensive parsing occurs.	2019-03-29 17:20:31 +01:00
Tim van der Meij	f9c58115fc	Merge pull request #10683 from janpe2/type0-noncid-cmap Use CMap in Type0 fonts when CFF is not a CID font	2019-03-28 00:07:08 +01:00
Jonas Jenwald	9077abc263	Take the `FirstChar`/`LastChar` properties into account when computing the hash in `PartialEvaluator.preEvaluateFont` (issue 10665) Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.	2019-03-27 16:27:10 +01:00
Jonas Jenwald	a2a824ed01	Don't accidentally use an empty `hash` value when comparing `preEvaluatedFonts` in `PartialEvaluator.loadFont` Note that `PartialEvaluator.preEvaluateFont` will return an empty string when no hash was computed. This will complete short-circuit the `fontAlias` comparison in `PartialEvaluator.loadFont`, since fonts which are totally different will then match if their `hash`es are empty.	2019-03-27 00:54:39 +01:00
Jani Pehkonen	49c6233fbc	Use CMap in Type0 fonts when CFF is not a CID font	2019-03-26 19:38:44 +02:00
Tim van der Meij	33bfbef6ba	Merge pull request #10635 from timvandermeij/lexer-parser Convert `src/core/parser.js` to ES6 syntax and write more unit tests for the lexer and the parser	2019-03-19 23:17:34 +01:00
Tim van der Meij	7d3cb19571	Convert the `Linearization` class in `src/core/parser.js` to ES6 syntax Moreover, disable `var` usage for this file.	2019-03-17 13:27:45 +01:00
Jonas Jenwald	56eeeea1dc	Re-factor the `getTransfers` helper function into a "private" getter method on the `OperatorList` This function is currently called with the `OperatorList` instance as its argument, hence I cannot think of any good reason for not just moving it into the `OperatorList` properly. (This will also help with other planned changes regarding the `ImageCache` functionality.)	2019-03-16 13:06:51 +01:00
Jonas Jenwald	7273795eb6	Actually transfer eligible ImageMask data, rather than always copying it By transfering `ArrayBuffer`s you can avoid having two copies of the same data, i.e. one copy on each of the worker/main-thread, for data that's used only once on the worker-thread. Note how the code in [`PDFImage.createMask`](`80135378ca/src/core/image.js (L284-L285)`) goes to great lengths to actually enable tranfering of the image data. However in [`PartialEvaluator.buildPaintImageXObject`](`80135378ca/src/core/evaluator.js (L336)`) the `cached` property is always set to `true`, which disqualifies the image data from being transfered; see [`getTransfers`](`80135378ca/src/core/operator_list.js (L552-L554)`). For most ImageMask data this patch won't matter, since images found in the `/Resources -> /XObject` dictionary will always be indexed by name. However for inline images which contains ImageMask data, where only "small" images are cached (in both `parser.js` and `evaluator.js`), the current code will result in some unnecessary memory usage.	2019-03-16 13:06:32 +01:00
Jonas Jenwald	88f9e633dd	Try to improve text-selection for Type3 fonts that utilize a non-default /FontMatrix (bug 1513120) For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least. The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds two different new `text` tests.	2019-03-12 10:32:08 +01:00
Tim van der Meij	8d4d7dbf58	Convert the `Lexer` class in `src/core/parser.js` to ES6 syntax	2019-03-10 19:04:36 +01:00
Tim van der Meij	7d0ecee771	Convert the `Parser` class in `src/core/parser.js` to ES6 syntax	2019-03-10 19:04:35 +01:00
Tim van der Meij	d587abbceb	Merge pull request #10633 from Snuffleupagus/murmurhash-class Convert `MurmurHash3_64` to an ES6 class	2019-03-09 21:07:12 +01:00
Jonas Jenwald	6b1ac44aea	Convert `MurmurHash3_64` to an ES6 class Notable changes: - Remove the `return this;` from the `MurmurHash3_64.update` method, since it's completely unused and doesn't make a lot of sense. - Remove the loop(s) from the `MurmurHash3_64.hexdigest` method, since creating a temporary array and then looping over it is wasteful given how simple this can be written with modern JavaScript.	2019-03-09 17:03:06 +01:00
Jonas Jenwald	2665502055	Move `NativeImageDecoder` into a separate file, and convert it to a `class` Given the size of the `src/core/evaluator.js` file, it cannot hurt to move some of its (image related) helper functionality into a separate file.	2019-03-09 15:59:04 +01:00
Tim van der Meij	8b149b818e	Merge pull request #10615 from Snuffleupagus/corrupt-inline-ASCII85Decode Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614)	2019-03-08 23:06:01 +01:00
Jonas Jenwald	3ce8fe7927	Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614) There's a number of things wrong with the PDF document, since its inline images are first all a lot larger than the 4 KB limit (as mandated by the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.1852045). Furthermore the actual ASCII85Decode data is interspersed with a lot of needless whitespace, in particular also "inside" of the EOD (end-of-data) marker which thus completely breaks the detection. Note that according to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.1940130, this patch should be safe since it explicitly mentions that all whitespace should be ignored.	2019-03-04 23:41:36 +01:00
Jonas Jenwald	4170c414fa	Reduce usage of `Date.now()` in `src/core/worker.js` Currently for every single parsed/rendered page there's no less than four `Date.now()` calls being made on the worker-side. This seems totally unnecessary, since the result of these calls are, by default, not used for anything unless the verbosity level is set to `INFO`.	2019-03-02 20:23:52 +01:00
Brendan Dahl	7d6ab081eb	Put the string name of the glyph in the charset array. Also, only warn once per font when missing a glyph name.	2019-03-01 18:03:51 -08:00
Brendan Dahl	34022d2fd1	Merge pull request #10591 from brendandahl/fix-charset Add unique glyph names for CFF fonts.	2019-02-28 17:22:29 -08:00
Brendan Dahl	8a596ef5d5	Add unique glyph names for CFF fonts. Printing on MacOS was broken with the previous approach of just mapping all the glyphs to notdef.	2019-02-27 15:00:29 -08:00
Jonas Jenwald	db5dc14158	Move worker-thread only functions from `src/shared/util.js` and into a new `src/core/core_utils.js` file The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated. Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used only on the worker-thread. Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus cannot possibly ever be used on the main-thread.	2019-02-24 00:35:39 +01:00
Jonas Jenwald	60f6d49ff7	[api-minor] Expose the existence of a `Collection` dictionary via the `getMetadata` API method (issue 10555) Given the complexity of this functionality, and the fact that it doesn't seem widely used, I highly doubt that it'd ever make sense to support Collections; see also https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#M11.9.39646.2Heading.824.Collections	2019-02-15 15:40:31 +01:00
Jonas Jenwald	b6d090cc14	Fallback to the built-in font renderer when font loading fails After PR 9340 all glyphs are now re-mapped to a Private Use Area (PUA) which means that if a font fails to load, for whatever reason[1], all glyphs in the font will now render as Unicode glyph outlines. This obviously doesn't look good, to say the least, and might be seen as a "regression" since previously many glyphs were left in their original positions which provided a slightly better fallback[2]. Hence this patch, which implements a general fallback to the PDF.js built-in font renderer for fonts that fail to load (i.e. are rejected by the sanitizer). One caveat here is that this only works for the Font Loading API, since it's easy to handle errors in that case[3]. The solution implemented in this patch does not in any way delay the loading of valid fonts, which was the problem with my previous attempt at a solution, and will only require a bit of extra work/waiting for those fonts that actually fail to load. Please note: This patch doesn't fix any of the underlying PDF.js font conversion bugs that's responsible for creating corrupt font files, however it does improve rendering in a number of cases; refer to this possibly incomplete list: [Bug 1524888](https://bugzilla.mozilla.org/show_bug.cgi?id=1524888) Issue 10175 Issue 10232 --- [1] Usually because the PDF.js font conversion code wasn't able to parse the font file correctly. [2] Glyphs fell back to some default font, which while not accurate was more useful than the current state. [3] Furthermore I'm not sure how to implement this generally, assuming that's even possible, and don't really have time/interest to look into it either.	2019-02-11 10:27:08 +01:00
Tsukasa OI	96ba6afd47	Fix copying on supplementary plane characters pdf.js had a problem when copying characters on supplementary planes (0xPPXXXX where PP is nonzero). This is because certain methods of PartialEvaluator use classic String.fromCharCode instead of ES6's String.fromCodePoint. Despite the fact that readToUnicode method tried to parse out-of-UCS2 code points by parsing UTF-16BE, it was inadequate because String.fromCharCode only supports UCS-2 range of Unicode.	2019-02-10 18:14:53 +09:00
Jonas Jenwald	6f94a05a29	Do the final text scaling correctly in `flushTextContentItem` (issue 8276) It's necessary to take into account whether or not the text is vertical, to avoid either the textContent `width` or `height` becoming incorrect.	2019-01-29 15:24:04 +01:00
Jonas Jenwald	29f36d7a1b	Reduce unnecessary duplication of the `isDefaultDecode` methods on `ColorSpace` instances The recent PR 10482 made me realize that I missed an opportunity for simplification when doing the class conversion of this code in PR 10007.	2019-01-25 08:53:08 +01:00
Tim van der Meij	e2701d5422	Merge pull request #10482 from janpe2/indexed-decode Implement Decode entry in Indexed images	2019-01-24 23:46:55 +01:00
Jonas Jenwald	41fbc71ef9	Ensure that `XRef.indexObjects` can handle object numbers with zero-padding (issue 10491) All objects in the PDF document follow this pattern: ``` 0000000001 0 obj << % Some content here... >> endobj 0000000002 0 obj << % More content here... endobj ```	2019-01-24 22:37:18 +01:00
Jani Pehkonen	26121177ab	Implement Decode entry in Indexed images	2019-01-22 22:51:04 +02:00
Jonas Jenwald	24a688d6c6	Convert some usage of `indexOf` to `startsWith`/`includes` where applicable In many cases in the code you don't actually care about the index itself, but rather just want to know if something exists in a String/Array or if a String starts in a particular way. With modern JavaScript functionality, it's thus possible to remove a number of existing `indexOf` cases.	2019-01-18 17:57:41 +01:00
Jonas Jenwald	b531fc4106	Avoid truncating inline images, where the data and the "EI" marker is glued together (issue 10388) (#10436 ) Thanks to the excellent debugging done by @janpe2, this was easy to fix!	2019-01-12 20:31:23 +01:00
Jonas Jenwald	d4a3858ed5	Handle more cases of corrupt PDF files with missing 'endobj' operators, where the "obj" string is immediately followed by the dictionary (PR 9288 follow-up)	2019-01-10 17:55:28 +01:00
Tim van der Meij	f162fed6b9	Convert `src/core/charsets.js` and `src/core/standard_fonts.js` to ES6 syntax Moreover, include the "no var" ESLint comment to `src/core/annotation.js` and `src/core/ps_parser.js` since they are already converted.	2019-01-06 15:04:01 +01:00
Tim van der Meij	3b637e71d4	Convert `src/core/arithmetic_decoder.js` to ES6 syntax	2019-01-06 15:04:01 +01:00
Brendan Dahl	32eace043b	Fix reading number of HTMX metrics. The length of the HHEA table can be incorrect, so it is better to read the number of metrics offset from beginning of table instead.	2019-01-04 15:13:13 -08:00
Jonas Jenwald	66fccd860b	Adjust how `AnnotationBorderStyle.setWidth` handles the input being a `Name` (issue 10385) In order to be consistent with the behaviour in Adobe Reader, the width will now always be set to zero when the input is a `Name`.	2019-01-04 10:38:10 +01:00
Tim van der Meij	2d00bb098b	Merge pull request #10404 from Snuffleupagus/issue-10401 Remove the `for ... of` loop from the `PDFDocument.fingerprint` getter (issue 10401)	2019-01-03 22:46:51 +01:00
Brendan Dahl	e2686db49b	Merge pull request #10277 from janpe2/cff-stems Repair CFF fonts if stem hints are in wrong order	2019-01-03 10:30:43 -08:00
Jonas Jenwald	8c278530dd	Remove the `for ... of` loop from the `PDFDocument.fingerprint` getter (issue 10401) It appears that the `Symbol` polyfill doesn't work well in conjunction with `TypedArray`s, and that part of PR 10393 is thus reverted.	2019-01-03 11:17:45 +01:00
Tim van der Meij	d8f201ea2a	Merge pull request #10397 from Snuffleupagus/issue-10385 Ensure that `AnnotationBorderStyle.setWidth` is able to handle the input being a `Name`, to correctly deal with corrupt PDF documents (issue 10385)	2018-12-31 12:58:28 +01:00
Jonas Jenwald	76a9580aeb	Ensure that `AnnotationBorderStyle.setWidth` is able to handle the input being a `Name`, to correctly deal with corrupt PDF documents (issue 10385)	2018-12-31 12:21:28 +01:00
Jonas Jenwald	15b3806937	Actually validate the input in `AnnotationBorderStyle.setStyle`	2018-12-31 12:15:15 +01:00
Tim van der Meij	d5e5d18430	Convert the `PDFDocument` class in `src/core/document.js` to ES6 syntax	2018-12-30 13:54:43 +01:00
Tim van der Meij	612fc9fcc2	Convert the `Page` class in `src/core/document.js` to ES6 syntax	2018-12-30 13:54:43 +01:00
Tim van der Meij	aad27ff9a0	Optimize the `Ref` class in `src/core/primitives.js` The `toString` method always creates two string objects (for the 'R' character and for the `num` concatenation) and in the worst case creates three string objects (one more for the `gen` concatenation). For the Tracemonkey paper alone, this resulted in 12000 string objects when scrolling from the top to the bottom of the document. Since this is a hot function, it's worth minimizing the number of string objects, especially for large documents, to reduce peak memory usage. This commit refactors the `toString` method to always create only one string object.	2018-12-29 17:48:41 +01:00
Jonas Jenwald	60bcce184e	Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1]. Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase. Here the choice is made to attempt to load the first page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made. Obviously, just because the first page can be loaded successfully that doesn't guarantee that the entire XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is not valid[2]. Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call. Whether or not this is a problem depends very much on what you actually measure, please consider the following examples: ```javascript console.time('first'); getDocument(...).promise.then((pdfDocument) => { console.timeEnd('first'); }); console.time('second'); getDocument(...).promise.then((pdfDocument) => { pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`. console.timeEnd('second'); }); }); ``` The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable. --- [1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated. In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects. [2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the first page. [3] The only extra parsing is caused by, potentially, having to traverse part of the `Pages` tree to find the first page.	2018-12-29 12:47:25 +01:00
Tim van der Meij	360c3d3813	Remove the unused `url` argument for the `ChunkedStreamManager` class	2018-12-24 13:14:42 +01:00
Tim van der Meij	47344197f4	Convert `src/core/chunked_stream.js` to ES6 syntax	2018-12-24 13:14:42 +01:00
Jonas Jenwald	b05f053287	[api-minor] Add support for OpenAction destinations (issue 10332) Note that the OpenAction dictionary may contain other information besides just a destination array, e.g. instructions for auto-printing[1]. Given first of all that an arbitrary `Dict` cannot be sent from the Worker (since cloning would fail), and second of all that the data obviously needs to be validated, this patch purposely only adds support for fetching a destination from the OpenAction entry[2]. --- [1] This information is, currently in PDF.js, being included through the `getJavaScript` API method. [2] This significantly reduces the complexity of the implementation, which seems fine for now. If there's ever need for other kinds of OpenAction to be fetched, additional API methods could/should be implemented as necessary (could e.g. follow the `getOpenActionWhatever` naming scheme).	2018-12-19 11:45:16 +01:00

1 2 3 4 5 ...

1448 Commits