Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	4b51bcc733	Ensure that `PDFImage.buildImage` won't accidentally swallow errors, e.g. from ColorSpace parsing (issue 6707, PR 11601 follow-up) Because of a really stupid `Promise`-related mistake on my part, when re-factoring `PDFImage.buildImage` during the `NativeImageDecoder` removal, we're no longer re-throwing errors occuring during image parsing/decoding as intended. The result is that some (fairly) corrupt documents will never finish loading, and unfortunately there were apparently no sufficiently corrupt images in the test-suite to catch this.	2020-06-13 15:02:37 +02:00
Jonas Jenwald	df7d8c74ca	Extract the actual sending of image data from the `PartialEvaluator.buildPaintImageXObject` method After PRs 10727 and 11912, the code responsible for sending the decoded image data to the main-thread has now become a fair bit more involved the previously. To reduce the amount of duplication here, the actual code responsible for sending the data is thus extracted into a new helper method instead.	2020-06-07 12:01:51 +02:00
Jonas Jenwald	af815e417d	Ensure that that we don't attempt to cache inline images in the `GlobalImageCache` (PR 11912 follow-up) Since inline images, i.e. those defined inside of `/Contents` streams, are by their very definition page-specific it thus seem like a good idea to actually enforce that they won't accidentally end up in the `GlobalImageCache`.	2020-06-01 01:00:30 +02:00
Jonas Jenwald	4ef547f400	Improve caching of empty `/XObject`s in the `PartialEvaluator.getTextContent` method It turns out that `getTextContent` suffers from similar problems with repeated images as `getOperatorList`; please see the previous patch. While only `/XObject` resources of the `Form`-type will actually be parsed in `PartialEvaluator.getTextContent`, since those are the only ones that may contain text, we're still forced to fetch repeated image resources where the name differs (but not the reference). Obviously it's less bad in this case, since we're not actually parsing `/XObject`s of e.g. the `Image`-type. However, you still want to avoid even fetching the data whenever possible, since `Stream`s are not cached on the `XRef` instance (given their potential size) and the lookup can thus be somewhat expensive in general. To address these issues, we can simply replace the exiting name-only caching in `PartialEvaluator.getTextContent` with a new cache backed by `LocalImageCache` instead.	2020-05-26 09:49:01 +02:00
Jonas Jenwald	d62c9181bd	Improve the local image caching in `PartialEvaluator.getOperatorList` Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is only done by name (usually using a format such as e.g. "Im0", "Im1", ...). However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table). With these changes we'll now cache local images using both name and (where applicable) reference, thus improving re-usage of images resources even further. This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file: ``` [ { "id": "bug857031", "file": "../web/pdfs/bug857031.pdf", "md5": "", "rounds": 250, "lastPage": 1, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser \| page \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ---- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- firefox \| 0 \| Overall \| 250 \| 2749 \| 2656 \| -93 \| -3.38 \| faster firefox \| 0 \| Page Request \| 250 \| 3 \| 4 \| 1 \| 50.14 \| slower firefox \| 0 \| Rendering \| 250 \| 2746 \| 2652 \| -94 \| -3.44 \| faster ``` While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case. In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely). Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate this patch more than necessary.	2020-05-25 15:14:14 +02:00
Jonas Jenwald	18e0b10d3c	[api-minor] Remove the `disableCreateObjectURL` option from the `getDocument` parameters, since it's now unused in the API With the changes in previous patches, the `disableCreateObjectURL` option/functionality is no longer used for anything in the API and/or in the Worker code. Note however that there's some functionality, mainly related to file loading/downloading, in the GENERIC version of the default viewer which still depends on this option. Hence the `disableCreateObjectURL` option (and related compatibility code) is moved into the viewer, see e.g. `web/app_options.js`, such that it's still available in the default viewer.	2020-05-22 00:22:48 +02:00
Jonas Jenwald	0351852d74	[api-minor] Decode all JPEG images with the built-in PDF.js decoder in `src/core/jpg.js` Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons: - It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library. - The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image. - While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first. In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should always be the case. - The native decoding, for anything except the simplest of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707). Furthermore this also leads to data being parsed on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons. - Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests. - Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used. Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues. At this point in time, there's two kinds of failure with this patch: - Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder. This type of "failure" accounts for the vast majority of the total number of changes in the reference tests. - Changes where the JPEG images now looks ever so slightly blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough). Basically if you disable [this downscaling in canvas.js](`8fb82e939c/src/display/canvas.js (L2356-L2395)`), which is what happens when zooming in, the differences simply vanish! Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that all images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.	2020-05-22 00:22:48 +02:00
Jonas Jenwald	dda6626f40	Attempt to cache repeated images at the document, rather than the page, level (issue 11878) Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the same images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1] Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2] However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages. In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be twenty copies of the image data). While this obviously benefit both CPU and memory usage in this case, for very large image data this patch may possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will only cache a certain number of image resources at the document level and simply fallback to the default behaviour. Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3] Please note: The patch will lead to small movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator. --- [1] There's e.g. PDF documents that use the same image as background on all pages. [2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer. [3] If the latter case were true, we could simply check for repeat images before parsing started and thus avoid handling any duplicate image resources.	2020-05-21 18:13:45 +02:00
Brendan Dahl	b1be33c96f	Add more categories of unsupported features. Fixes #11815	2020-05-04 11:02:16 -07:00
Jonas Jenwald	911c33f025	Move the `maybeValidDimensions` check, used with JPEG images, to occur earlier (PR 11523 follow-up) Given that the `NativeImageDecoder.{isSupported, isDecodable}` methods require both dictionary lookups and ColorSpace parsing, in hindsight it actually seems more reasonable to the `JpegStream.maybeValidDimensions` checks first.	2020-04-26 12:07:46 +02:00
Jonas Jenwald	1cc3dbb694	Enable the `dot-notation` ESLint rule Please note: These changes were done automatically, using the `gulp lint --fix` command. This rule is already enabled in mozilla-central, see https://searchfox.org/mozilla-central/rev/567b68b8ff4b6d607ba34a6f1926873d21a7b4d7/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#103-104 The main advantage, besides improved consistency, of this rule is that it reduces the size of the code (by 3 bytes for each case). In the PDF.js code-base there's close to 8000 instances being fixed by the `dot-notation` ESLint rule, which end up reducing the size of even the built files significantly; the total size of the `gulp mozcentral` build target changes from `3 247 456` to `3 224 278` bytes, which is a reduction of `23 178` bytes (or ~0.7%) for a completely mechanical change. A large number of these changes affect the (large) lookup tables used on the worker-thread, but given that they are still initialized lazily I don't think that the new formatting this patch introduces should undo any of the improvements from PR 6915. Please find additional details about the ESLint rule at https://eslint.org/docs/rules/dot-notation	2020-04-17 12:24:46 +02:00
Jonas Jenwald	44b4a74f48	A couple of small `String.fromCodePoint` improvements (PR 11698 and 11769 follow-up) - Add a reduced test-case for issue 11768, to prevent future regressions. (Given that PR 11769 is only a work-around, rather than a proper solution, it may not be entirely accurate for the issue to be closed as fixed.) - Add more validation of the charCode, as found by the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode` to prevent future issues.	2020-04-15 13:45:08 +02:00
Jonas Jenwald	426945b480	Update Prettier to version 2.0 Please note that these changes were done automatically, using `gulp lint --fix`. Given that the major version number was increased, there's a fair number of (primarily whitespace) changes; please see https://prettier.io/blog/2020/03/21/2.0.0.html In order to reduce the size of these changes somewhat, this patch maintains the old "arrowParens" style for now (once mozilla-central updates Prettier we can simply choose the same formatting, assuming it will differ here).	2020-04-14 12:28:14 +02:00
Jonas Jenwald	2d46230d23	[api-minor] Change `Font.exportData` to, by default, stop exporting properties which are completely unused on the main-thread and/or in the API (PR 11773 follow-up) For years now, the `Font.exportData` method has (because of its previous implementation) been exporting many properties despite them being completely unused on the main-thread and/or in the API. This is unfortunate, since among those properties there's a number of potentially very large data-structures, containing e.g. Arrays and Objects, which thus have to be first structured cloned and then stored on the main-thread. With the changes in this patch, we'll thus by default save memory for every `Font` instance created (there can be a lot in longer documents). The memory savings obviously depends a lot on the actual font data, but some approximate figures are: For non-embedded fonts it can save a couple of kilobytes, for simple embedded fonts a handful of kilobytes, and for composite fonts the size of this auxiliary can even be larger than the actual font program itself. All-in-all, there's no good reason to keep exporting these properties by default when they're unused. However, since we cannot be sure that every property is unused in custom implementations of the PDF.js library, this patch adds a new `getDocument` option (named `fontExtraProperties`) that still allows access to the following properties: - "cMap": An internal data structure, only used with composite fonts and never really intended to be exposed on the main-thread and/or in the API. Note also that the `CMap`/`IdentityCMap` classes are a lot more complex than simple Objects, but only their "internal" properties survive the structured cloning used to send data to the main-thread. Given that CMaps can often be very large, not exporting them can also save a fair bit of memory. - "defaultEncoding": An internal property used with simple fonts, and used when building the glyph mapping on the worker-thread. Considering how complex that topic is, and given that not all font types are handled identically, exposing this on the main-thread and/or in the API most likely isn't useful. - "differences": An internal property used with simple fonts, and used when building the glyph mapping on the worker-thread. Considering how complex that topic is, and given that not all font types are handled identically, exposing this on the main-thread and/or in the API most likely isn't useful. - "isSymbolicFont": An internal property, used during font parsing and building of the glyph mapping on the worker-thread. - "seacMap": An internal map, only potentially used with some Type1/CFF fonts and never intended to be exposed in the API. The existing `Font.{charToGlyph, charToGlyphs}` functionality already takes this data into account when handling text. - "toFontChar": The glyph map, necessary for mapping characters to glyphs in the font, which is built upon the various encoding information contained in the font dictionary and/or font program. This is not directly used on the main-thread and/or in the API. - "toUnicode": The unicode map, necessary for text-extraction to work correctly, which is built upon the ToUnicode/CMap information contained in the font dictionary, but not directly used on the main-thread and/or in the API. - "vmetrics": An array of width data used with fonts which are composite and vertical, but not directly used on the main-thread and/or in the API. - "widths": An array of width data used with most fonts, but not directly used on the main-thread and/or in the API.	2020-04-06 11:47:09 +02:00
Jonas Jenwald	2619272d73	Change the signature of `TranslatedFont`, and convert it to a proper class In preparation for the next patch, this changes the signature of `TranslatedFont` to take an object rather than individual parameters. This also, in my opinion, makes the call-sites easier to read since it essentially provides a small bit of documentation of the arguments. Finally, since it was necessary to touch `TranslatedFont` anyway it seemed like a good idea to also convert it to a proper `class`.	2020-04-05 20:53:48 +02:00
Jonas Jenwald	59f54b946d	Ensure that all `Font` instances have the `vertical` property set to a boolean Given that the `vertical` property is always accessed on the main-thread, ensuring that the property is explicitly defined seems like the correct thing to do since it also avoids boolean casting elsewhere in the code-base.	2020-04-05 16:27:50 +02:00
Jonas Jenwald	dcb16af968	Whitelist closure related cases to address the remaining `no-shadow` linting errors Given the way that "classes" were previously implemented in PDF.js, using regular functions and closures, there's a fair number of false positives when the `no-shadow` ESLint rule was enabled. Note that while some of these `eslint-disable` statements can be removed if/when the relevant code is converted to proper `class`es, we'll probably never be able to get rid of all of them given our naming/coding conventions (however I don't really see this being a problem).	2020-03-25 11:57:12 +01:00
Jonas Jenwald	216cbca16c	Remove variable shadowing from the JavaScript files in the `src/core/` folder This is part of a series of patches that will try to split PR 11566 into smaller chunks, to make reviewing more feasible. Once all the code has been fixed, we'll be able to eventually enable the ESLint no-shadow rule; see https://eslint.org/docs/rules/no-shadow	2020-03-23 18:28:30 +01:00
Jonas Jenwald	1cd9d5a8fd	Remove the unused `wideChars` property on `Font` instances This property was added in PR 1599 (almost eight years ago), but has been unused ever since PR 3674 (six and a half years ago).	2020-03-20 10:37:32 +01:00
Jonas Jenwald	15e8692eff	Don't accidentally accept invalid glyphNames which appear to follow the Cdd{d}/cdd{d} format in `PartialEvaluator._buildSimpleFontToUnicode` (issue 11697) The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a possible Cdd{d}/cdd{d} glyphName by the existing heuristics. Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point. To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds additional validation of the charCode found by the heuristics.	2020-03-13 23:35:47 +01:00
Jonas Jenwald	65e6ea2cb2	Prevent lookup errors in `PartialEvaluator.hasBlendModes` from breaking all parsing/rendering of a page (issue 11678) The PDF document in question is corrupt, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.	2020-03-09 12:00:12 +01:00
Tim van der Meij	1a97c142b3	Merge pull request #11523 from Snuffleupagus/issue-10880 Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)	2020-03-06 23:03:09 +01:00
Jonas Jenwald	160cfc4084	Slightly simplify the lookup of data in `Dict.{get, getAsync, has}` Note that `Dict.set` will only be called with values returned through `Parser.getObj`, and thus indirectly via `Lexer.getObj`. Since neither of those methods will ever return `undefined`, we can simply assert that that's the case when inserting data into the `Dict` and thus get rid of `in` checks when doing the data lookups. In this case, since `Dict.set` is fairly hot, the patch utilizes an inline check and when necessary a direct call to `unreachable` to not affect performance of `gulp server/test` too much (rather than always just calling `assert`). For very large and complex PDF files this will help performance slightly, since `Dict.{get, getAsync, has}` is called a lot during parsing in the worker. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 250, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 250 \| 2838 \| 2820 \| -18 \| -0.65 \| faster Firefox \| Page Request \| 250 \| 1 \| 2 \| 0 \| 11.92 \| slower Firefox \| Rendering \| 250 \| 2837 \| 2818 \| -19 \| -0.65 \| faster ```	2020-03-06 14:12:14 +01:00
Jonas Jenwald	65e514e063	Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651) The PDF document in question is corrupt, since it contains multiple instances of incorrect operators. We obviously don't want to slow down parsing of all documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.	2020-03-03 10:05:18 +01:00
Jonas Jenwald	c55d30a715	Use the same non-embedded Wingdings fallback for fonts named "Wingdings-Regular" too (PR 5463 follow-up, issue 11451) This patch extends the existing heuristics, which are really the best that we can do in general for these kinds of non-embedded and non-standard fonts. Furthermore, this patch also tries to improve the copy-and-paste behaviour for non-embedded Wingdings fonts by also using the `ZapfDingbatsEncoding` in this case. Note: I'm not sure that adding additional tests for Wingdings fonts matters that much, given how limited our "support" for them really is.	2020-02-24 17:40:06 +01:00
Jonas Jenwald	5494f7d5bc	Add basic validation of the `scanLines` parameter in JPEG images, before delegating decoding to the browser In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter. Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render. In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing. The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks. --- Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea. There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker. Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing. In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.	2020-02-22 14:16:07 +01:00
Tim van der Meij	61056a9238	Merge pull request #11551 from Snuffleupagus/issue-11549 Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)	2020-02-09 17:32:35 +01:00
Brendan Dahl	09a6e17d22	Merge pull request #11528 from janpe2/type1-nonemb-notdef Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths	2020-02-06 13:30:07 -08:00
Jonas Jenwald	4c54395ff6	Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549) This will allow font loading/parsing to continue, rather than immediately failing, when broken/corrupt CMap data is encountered.	2020-01-30 13:19:05 +01:00
Tim van der Meij	cbbda9d883	Merge pull request #11515 from Snuffleupagus/cache-fallback-font Cache the fallback font dictionary on the `PartialEvaluator` (PR 11218 follow-up)	2020-01-25 21:32:28 +01:00
Jonas Jenwald	83bdb525a4	Fix remaining linting errors, from enabling the `prefer-const` ESLint rule globally This covers cases that the `--fix` command couldn't deal with, and in a few cases (notably `src/core/jbig2.js`) the code was changed to use block-scoped variables instead.	2020-01-25 00:20:23 +01:00
Jonas Jenwald	9e262ae7fa	Enable the ESLint `prefer-const` rule globally (PR 11450 follow-up) Please find additional details about the ESLint rule at https://eslint.org/docs/rules/prefer-const With the recent introduction of Prettier this sort of mass enabling of ESLint rules becomes a lot easier, since the code will be automatically reformatted as necessary to account for e.g. changed line lengths. Note that this patch is generated automatically, by using the ESLint `--fix` argument, and will thus require some additional clean-up (which is done separately).	2020-01-25 00:20:22 +01:00
Jani Pehkonen	809b96b40c	Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths Fixes #11403 The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this. In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`. The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.	2020-01-21 21:35:25 +02:00
Jonas Jenwald	9ab7c280aa	Cache the fallback font dictionary on the `PartialEvaluator` (PR 11218 follow-up) This way we'll benefit from the existing font caching, and can thus avoid re-creating a fallback font over and over again during parsing. (Thece changes necessitated the previous patch, since otherwise breakage could occur e.g. with fake workers.)	2020-01-16 15:12:05 +01:00
Jonas Jenwald	36881e3770	Ensure that all `import` and `require` statements, in the entire code-base, have a `.js` file extension In order to eventually get rid of SystemJS and start using native `import`s instead, we'll need to provide "complete" file identifiers since otherwise there'll be MIME type errors when attempting to use `import`.	2020-01-04 13:01:43 +01:00
Jonas Jenwald	a63f7ad486	Fix the linting errors, from the Prettier auto-formatting, that ESLint `--fix` couldn't handle This patch makes the follow changes: - Remove no longer necessary inline `// eslint-disable-...` comments. - Fix `// eslint-disable-...` comments that Prettier moved down, thus causing new linting errors. - Concatenate strings which now fit on just one line. - Fix comments that are now too long. - Finally, and most importantly, adjust comments that Prettier moved down, since the new positions often is confusing or outright wrong.	2019-12-26 12:35:12 +01:00
Jonas Jenwald	de36b2aaba	Enable auto-formatting of the entire code-base using Prettier (issue 11444) Note that Prettier, purposely, has only limited [configuration options](https://prettier.io/docs/en/options.html). The configuration file is based on [the one in `mozilla central`](https://searchfox.org/mozilla-central/source/.prettierrc) with just a few additions (to avoid future breakage if the defaults ever changes). Prettier is being used for a couple of reasons: - To be consistent with `mozilla-central`, where Prettier is already in use across the tree. - To ensure a consistent coding style everywhere, which is automatically enforced during linting (since Prettier is used as an ESLint plugin). This thus ends "all" formatting disussions once and for all, removing the need for review comments on most stylistic matters. Many ESLint options are now redundant, and I've tried my best to remove all the now unnecessary options (but I may have missed some). Note also that since Prettier considers the `printWidth` option as a guide, rather than a hard rule, this patch resorts to a small hack in the ESLint config to ensure that comments won't become too long. Please note: This patch is generated automatically, by appending the `--fix` argument to the ESLint call used in the `gulp lint` task. It will thus require some additional clean-up, which will be done in a separate commit. (On a more personal note, I'll readily admit that some of the changes Prettier makes are extremely ugly. However, in the name of consistency we'll probably have to live with that.)	2019-12-26 12:34:24 +01:00
Jonas Jenwald	835d8c2be5	Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287) This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.	2019-11-01 09:01:24 +01:00
Jonas Jenwald	0496ea61f5	Ensure that `PartialEvaluator.hasBlendModes` handles Blend Modes in Arrays (PR 11281 follow-up) I completely overlooked this in PR 11281, but you obviously need to make similar changes in `PartialEvaluator.hasBlendModes` since it will otherwise ignore valid Blend Modes.	2019-10-28 11:37:05 +01:00
Jonas Jenwald	5c266f0e8c	Support Blend Modes which are specified in an Array of Names (issue 11279) According to the specification, the first supported Blend Mode should be choosen in this case; please see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G10.4848607	2019-10-26 14:24:31 +02:00
Tim van der Meij	ca3a58f93a	Consistently use `@returns` for returned data types in JSDoc comments Sometimes we also used `@return`, but `@returns` is what the JSDoc documentation recommends. Even though `@return` works as an alias, it's good to use the recommended syntax and to be consistent within the project.	2019-10-13 13:58:17 +02:00
Jonas Jenwald	bfcbf2d78d	Cache processed 'ExtGState's in `PartialEvaluator.hasBlendModes` to avoid unnecessary parsing/lookups This simply extends the already existing caching of processed resources to avoid duplicated parsing of 'ExtGState's, which should help with badly generated PDF documents. This patch was tested using the PDF file from issue 6961, i.e. https://github.com/mozilla/pdf.js/files/121712/test.pdf, with the following manifest file: ``` [ { "id": "issue6961", "file": "../web/pdfs/issue6961.pdf", "md5": "", "rounds": 200, "type": "eq" } ] ``` which gave the following overall results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 400 \| 1063 \| 1051 \| -12 \| -1.17 \| faster Firefox \| Page Request \| 400 \| 552 \| 543 \| -9 \| -1.69 \| faster Firefox \| Rendering \| 400 \| 511 \| 508 \| -3 \| -0.61 \| ``` and the following page-specific results: ``` -- Grouped By page, stat -- page \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ---- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- 0 \| Overall \| 200 \| 1122 \| 1110 \| -12 \| -1.03 \| 0 \| Page Request \| 200 \| 552 \| 544 \| -8 \| -1.48 \| faster 0 \| Rendering \| 200 \| 570 \| 566 \| -4 \| -0.62 \| 1 \| Overall \| 200 \| 1005 \| 992 \| -13 \| -1.33 \| faster 1 \| Page Request \| 200 \| 552 \| 542 \| -11 \| -1.91 \| faster 1 \| Rendering \| 200 \| 452 \| 450 \| -3 \| -0.61 \| ```	2019-10-12 12:35:42 +02:00
Jonas Jenwald	af71f9b40a	Inline all the possible type checks in `PartialEvaluator.hasBlendModes` to avoid unnecessary function calls For badly generated PDF documents, with issue 6961 being one example, there's well over one hundred thousand function calls being made in total for just the two pages.	2019-10-12 11:24:37 +02:00
huzjakd	94171d9d72	Attempt to fallback to a default font, for non-available ones, in `PartialEvaluator.loadFont` This handles the two different ways that fonts can be loaded, either by Name (which is the common case) or by Reference. Furthermore, this also takes the `ignoreErrors` option into account when deciding whether to fallback or Error. Finally, by creating a minimal but valid Font dictionary, there's no special-cases necessary in any of the font parsing code. Co-authored-by: huzjakd <huzjakd@gmail.com> Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>	2019-10-10 16:49:46 +02:00
Jonas Jenwald	f5be2d62a3	Improve the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode`, for glyphNames of the Cdd{d}/cdd{d} format (issue 9655) Please note: I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code. However, it occured to me that we're essentially trying to add a heuristic on top of another heuristic, and that it shouldn't matter how efficient the code is as long as it works. In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.	2019-10-06 10:47:29 +02:00
Jonas Jenwald	f11a4ba750	Transfer, rather than copy, CMap data to the worker-thread It recently occurred to me that the CMap data should be an excellent candidate for transfering. This will help reduce peak memory usage for PDF documents using CMaps, since transfering of data avoids duplicating it on both the main- and worker-threads. Unfortunately it's not possible to actually transfer data when returning data through `sendWithPromise`, and another solution had to be used. Initially I looked at using one message for requesting the data, and another message for returning the actual CMap data. While that should have worked, it would have meant adding a lot more complexity particularly on the worker-thread. Hence the simplest solution, at least in my opinion, is to utilize `sendWithStream` since that makes it really easy to transfer the CMap data. (This required PR 11115 to land first, since otherwise CMap fetch errors won't propagate correctly to the worker-thread.) Please note that the patch purposely only changes the API to Worker communication, and not the API itself since changing the interface of `CMapReaderFactory` would be a breaking change. Furthermore, given the relatively small size of the `.bcmap` files (the largest one is smaller than the default range-request size) streaming doesn't really seem necessary either.	2019-09-04 11:46:04 +02:00
Yury Delendik	66e0dd1b06	Use streams for OperatorList chunking (issue 10023) Please note: The majority of this patch was written by Yury, and it's simply been rebased and slightly extended to prevent issues when dealing with `RenderingCancelledException`. By leveraging streams this (finally) provides a simple way in which parsing can be aborted on the worker-thread, which will ultimately help save resources. With this patch worker-thread parsing will only be aborted when the document is destroyed, and not when rendering is cancelled. There's a couple of reasons for this: - The API currently expects the entire OperatorList to be extracted, or an Error to occur, once it's been started. Hence additional re-factoring/re-writing of the API code will be necessary to properly support cancelling and re-starting of OperatorList parsing in cases where the `lastChunk` hasn't yet been seen. - Even with the above addressed, immediately cancelling when encountering a `RenderingCancelledException` will lead to worse performance in e.g. the default viewer. When zooming and/or rotation of the document occurs it's very likely that `cancel` will be (almost) immediately followed by a new `render` call. In that case you'd obviously not want to abort parsing on the worker-thread, since then you'd risk throwing away a partially parsed Page and thus be forced to re-parse it again which will regress perceived performance. - This patch is already somewhat risky, given that it touches fundamentally important/critical code, and trying to keep it somewhat small should hopefully reduce the risk of regressions (and simplify reviewing as well). Time permitting, once this has landed and been in Nightly for awhile, I'll try to work on the remaining points outlined above. Co-Authored-By: Yury Delendik <ydelendik@mozilla.com> Co-Authored-By: Jonas Jenwald <jonas.jenwald@gmail.com>	2019-08-24 15:56:40 +02:00
Jonas Jenwald	5ac9c7c384	Support corrupt PDF files with invalid/non-existent Group /CS entries (issue 11045) The PDF file in question tries to reference a non-existent ColorSpace, which should be quite rare in practice.	2019-08-06 14:33:05 +02:00
Jonas Jenwald	38ccb43436	Reduce the number of function calls in `EvaluatorPreprocessor.read` For very large and complex PDF files this will help performance slightly, since `EvaluatorPreprocessor.read` is called a lot during parsing in the worker. This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471, using the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 200, "type": "eq" } ] ``` This gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser \| stat \| Count \| Baseline(ms) \| Current(ms) \| +/- \| % \| Result(P<.05) ------- \| ------------ \| ----- \| ------------ \| ----------- \| --- \| ----- \| ------------- Firefox \| Overall \| 200 \| 3402 \| 3358 \| -43 \| -1.28 \| faster Firefox \| Page Request \| 200 \| 1 \| 1 \| 0 \| 26.71 \| Firefox \| Rendering \| 200 \| 3401 \| 3357 \| -44 \| -1.28 \| faster ```	2019-07-29 08:43:36 +02:00
Jonas Jenwald	f710eb56e4	Change the signature of the `Parser` constructor to take a parameter object A lot of the `new Parser()` call-sites look quite unwieldy/ugly as-is, with a bunch of somewhat randomly ordered arguments, which we can avoid by changing the constructor to accept an object instead. As an added bonus, this provides better documentation without having to add inline argument comments in the code.	2019-06-23 16:01:45 +02:00
Tim van der Meij	c8c937c257	Merge pull request #10794 from janpe2/cidtogidmap-zero Fix glyph at index zero in CIDFontType2 that has a CIDToGIDMap stream	2019-05-15 00:04:39 +02:00
Jonas Jenwald	173fbef05b	Enable the `consistent-return` ESLint rule This rule is already enabled in mozilla-central, and helps ensure more consistent functions/methods, see https://searchfox.org/mozilla-central/rev/b9da45f63cb567244933c77b2c7e827a057d3f9b/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#119-120 Please see https://eslint.org/docs/rules/consistent-return for additional information.	2019-05-11 14:27:21 +02:00
Jani Pehkonen	05c527f035	Fix glyph 0 in CIDFontType2 that has a CIDToGIDMap stream	2019-05-07 18:44:37 +03:00
Jonas Jenwald	007fab6ab5	Change `PartialEvaluator.handleColorN` to throw when no valid pattern is found Currently `handleColorN` will fallback to add a completely unparsed/unvalidated operator when no valid pattern was found. This is unfortunate, since it could very easily lead to a couple of different errors: - `DataCloneError`s when attempting to send the data to the main-thread, e.g. when `args` is `Dict`/`Stream`. - Errors in `getShadingPatternFromIR` on the main-thread, unless `args` just happens to have the expected format. - Errors when actually attempting to render the pattern on the main-thread, since the `args` will most likely not have the expected format. Hence it probably makes sense to error in `PartialEvaluator.handleColorN`, and having invalid patterns fail gracefully via the existing `ignoreErrors` code-paths instead.	2019-05-04 12:53:18 +02:00
Jonas Jenwald	5335285cda	Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to). Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of path operators in PDF documents just for a handful of corrupt ones.	2019-04-30 23:35:33 +02:00
Jonas Jenwald	34952b732e	Add a `getDocId` method to the `idFactory`, in `Page` instances, to avoid passing around `PDFManager` instances unnecessarily (PR 7941 follow-up) This way we can avoid manually building a "document id" in multiple places in `evaluator.js`, and it also let's us avoid passing in an otherwise unnecessary `PDFManager` instance when creating a `PartialEvaluator`.	2019-04-20 13:11:17 +02:00
Jonas Jenwald	be604bd195	Support (rare) Type3 fonts which contains image resources (issue 10717) The Type3 font type is not commonly used in PDF documents, as can be seen from telemetry data such as: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2019-04-09&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F68&measure=PDF_VIEWER_FONT_TYPES&min_channel_version=nightly%252F57&processType=&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2019-03-18&table=0&trim=1&use_submission_date=0 (see also https://github.com/mozilla/pdf.js/wiki/Enumeration-Assignments-for-the-Telemetry-Histograms#pdf_viewer_font_types). Type3 fonts containing image resources are very* rare in practice, usually they only contain path rendering operators, but as the issue shows they unfortunately do exist. Currently these Type3-related image resources are not handled in any special way, and given that fonts are document rather than page specific rendering breaks since the image resources are thus not available to the entire document. Fortunately fixing this isn't too difficult, but it does require adding a couple of Type3-specific code-paths to the `PartialEvaluator`. In order to keep the implementation simple, particularily on the main-thread, these Type3 image resources are completely decoded on the worker-thread to avoid adding too many special cases. This should not cause any issues, only marginally less efficient code, but given how rare this kind of Type3 font is adding premature optimizations didn't seem at all warranted at this point.	2019-04-13 18:27:50 +02:00
Jonas Jenwald	9077abc263	Take the `FirstChar`/`LastChar` properties into account when computing the hash in `PartialEvaluator.preEvaluateFont` (issue 10665) Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.	2019-03-27 16:27:10 +01:00
Jonas Jenwald	a2a824ed01	Don't accidentally use an empty `hash` value when comparing `preEvaluatedFonts` in `PartialEvaluator.loadFont` Note that `PartialEvaluator.preEvaluateFont` will return an empty string when no hash was computed. This will complete short-circuit the `fontAlias` comparison in `PartialEvaluator.loadFont`, since fonts which are totally different will then match if their `hash`es are empty.	2019-03-27 00:54:39 +01:00
Jonas Jenwald	7273795eb6	Actually transfer eligible ImageMask data, rather than always copying it By transfering `ArrayBuffer`s you can avoid having two copies of the same data, i.e. one copy on each of the worker/main-thread, for data that's used only once on the worker-thread. Note how the code in [`PDFImage.createMask`](`80135378ca/src/core/image.js (L284-L285)`) goes to great lengths to actually enable tranfering of the image data. However in [`PartialEvaluator.buildPaintImageXObject`](`80135378ca/src/core/evaluator.js (L336)`) the `cached` property is always set to `true`, which disqualifies the image data from being transfered; see [`getTransfers`](`80135378ca/src/core/operator_list.js (L552-L554)`). For most ImageMask data this patch won't matter, since images found in the `/Resources -> /XObject` dictionary will always be indexed by name. However for inline images which contains ImageMask data, where only "small" images are cached (in both `parser.js` and `evaluator.js`), the current code will result in some unnecessary memory usage.	2019-03-16 13:06:32 +01:00
Jonas Jenwald	88f9e633dd	Try to improve text-selection for Type3 fonts that utilize a non-default /FontMatrix (bug 1513120) For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least. The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds two different new `text` tests.	2019-03-12 10:32:08 +01:00
Jonas Jenwald	2665502055	Move `NativeImageDecoder` into a separate file, and convert it to a `class` Given the size of the `src/core/evaluator.js` file, it cannot hurt to move some of its (image related) helper functionality into a separate file.	2019-03-09 15:59:04 +01:00
Jonas Jenwald	db5dc14158	Move worker-thread only functions from `src/shared/util.js` and into a new `src/core/core_utils.js` file The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated. Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used only on the worker-thread. Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus cannot possibly ever be used on the main-thread.	2019-02-24 00:35:39 +01:00
Jonas Jenwald	b6d090cc14	Fallback to the built-in font renderer when font loading fails After PR 9340 all glyphs are now re-mapped to a Private Use Area (PUA) which means that if a font fails to load, for whatever reason[1], all glyphs in the font will now render as Unicode glyph outlines. This obviously doesn't look good, to say the least, and might be seen as a "regression" since previously many glyphs were left in their original positions which provided a slightly better fallback[2]. Hence this patch, which implements a general fallback to the PDF.js built-in font renderer for fonts that fail to load (i.e. are rejected by the sanitizer). One caveat here is that this only works for the Font Loading API, since it's easy to handle errors in that case[3]. The solution implemented in this patch does not in any way delay the loading of valid fonts, which was the problem with my previous attempt at a solution, and will only require a bit of extra work/waiting for those fonts that actually fail to load. Please note: This patch doesn't fix any of the underlying PDF.js font conversion bugs that's responsible for creating corrupt font files, however it does improve rendering in a number of cases; refer to this possibly incomplete list: [Bug 1524888](https://bugzilla.mozilla.org/show_bug.cgi?id=1524888) Issue 10175 Issue 10232 --- [1] Usually because the PDF.js font conversion code wasn't able to parse the font file correctly. [2] Glyphs fell back to some default font, which while not accurate was more useful than the current state. [3] Furthermore I'm not sure how to implement this generally, assuming that's even possible, and don't really have time/interest to look into it either.	2019-02-11 10:27:08 +01:00
Tsukasa OI	96ba6afd47	Fix copying on supplementary plane characters pdf.js had a problem when copying characters on supplementary planes (0xPPXXXX where PP is nonzero). This is because certain methods of PartialEvaluator use classic String.fromCharCode instead of ES6's String.fromCodePoint. Despite the fact that readToUnicode method tried to parse out-of-UCS2 code points by parsing UTF-16BE, it was inadequate because String.fromCharCode only supports UCS-2 range of Unicode.	2019-02-10 18:14:53 +09:00
Jonas Jenwald	6f94a05a29	Do the final text scaling correctly in `flushTextContentItem` (issue 8276) It's necessary to take into account whether or not the text is vertical, to avoid either the textContent `width` or `height` becoming incorrect.	2019-01-29 15:24:04 +01:00
Jani Pehkonen	26121177ab	Implement Decode entry in Indexed images	2019-01-22 22:51:04 +02:00
Jonas Jenwald	24a688d6c6	Convert some usage of `indexOf` to `startsWith`/`includes` where applicable In many cases in the code you don't actually care about the index itself, but rather just want to know if something exists in a String/Array or if a String starts in a particular way. With modern JavaScript functionality, it's thus possible to remove a number of existing `indexOf` cases.	2019-01-18 17:57:41 +01:00
Jani Pehkonen	9cd5f94f03	Normalize the BBox of form XObjects on the /core side	2018-10-22 14:17:05 +03:00
Jonas Jenwald	842e9206c0	Replace `String.prototype.substr()` occurrences with `String.prototype.substring()` As outlined in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr, which refers to the ECMA-262 specification, using the `substr` function is advised against. Hence this PR, which replaces all remaining `substr` occurrences with `substring` instead. Please refer to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr#Syntax respectively https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substring#Syntax for the differences between the two functions. Note that in most cases in the code-base there's only one argument passed to `substr`, and those require no other changes except replacing "substr" with "substring". For the other cases, the `substr(start, length)` calls are changed to `substring(start, start + length)` instead.	2018-09-28 11:41:07 +02:00
Jonas Jenwald	e5a6d892b4	Revert "Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984)"	2018-09-05 18:01:33 +02:00
Tim van der Meij	c94df0fef3	Merge pull request #9986 from Snuffleupagus/issue-9984 Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984)	2018-09-01 21:21:29 +02:00
Jonas Jenwald	099ed08852	Add support for `async`/`await` using Babel For proof-of-concept, this patch converts a couple of `Promise` returning methods to use `async` instead. Please note that the `generic` build, based on this patch, has been successfully testing in IE11 (i.e. the viewer loads and nothing is obviously broken). Being able to use modern JavaScript features like `async`/`await` is a huge plus, but there's one (obvious) side-effect: The size of the built files will increase slightly (unless `SKIP_BABEL == true`). That's unavoidable, but seems like a small price to pay in the grand scheme of things. Finally, note that the `chromium` build target was changed to no longer skip Babel translation, since the Chrome extension still supports version `49` of the browser (where native `async` support isn't available).	2018-08-19 16:54:11 +02:00
Jonas Jenwald	497b765ede	Attempt to combine separate beginText/endText sequences in `getTextContent` (issue 9984) Please note that while this improves issue 9984 slightly (and likely others too), it's not a complete solution. The remaining issues are related to the, more general, problems with the existing heuristics related to attempting to combine separate text items.	2018-08-18 13:45:32 +02:00
Jonas Jenwald	cfdb597e4a	Ensure that the `CIDSystemInfo` strings, in Type0 fonts, are correctly decoded This isn't directly related to the subsequent patch, but just something that I happened to notice while poking around in the font code.	2018-07-29 23:06:15 +02:00
Jonas Jenwald	81b471c781	[Regression] Convert `Catalog.builtInCMapCache` into a `Map`, instead of an Object, to ensure that it's correctly reset (PR 8064 follow-up) With the `builtInCMapCache` being a simple Object, it unfortunately means that the `Catalog.cleanup` method isn't resetting it as intended. By just replacing the `builtInCMapCache` with an empty Object, existing references to it will not actually be updated. The result is that e.g. `Page` instances still keeps references to, what should have been removed, CMap data. To fix these problems, the `builtInCMapCache` is converted into a `Map` instead (since it can be easily reset).	2018-07-28 22:20:43 +02:00
Jonas Jenwald	7f21e38787	Error, rather than warn, once a number of invalid path operators are encountered in `EvaluatorPreprocessor.read` (bug 1443140) Incomplete path operators, in particular, can result in fairly chaotic rendering artifacts, as can be observed on page four of the referenced PDF file. The initial (naive) solution that was attempted, was to simply throw a `FormatError` as soon as any invalid (i.e. too short) operator was found and rely on the existing `ignoreErrors` code-paths. However, doing so would have caused regressions in some files; see the existing `issue2391-1` test-case, which was promoted to an `eq` test to help prevent future bugs. Hence this patch, which adds special handling for invalid path operators since those may cause quite bad rendering artifacts. You could, in all fairness, argue that the patch is a handwavy solution and I wouldn't object. However, given that this only concerns corrupt PDF files, the way that PDF viewers (PDF.js included) try to gracefully deal with those could probably be described as a best-effort solution anyway. This patch also adjusts the existing `warn`/`info` messages to print the command name according to the PDF specification, rather than an internal PDF.js enumeration value. The former should be much more useful for debugging purposes. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1443140.	2018-06-24 16:05:08 +02:00
Jonas Jenwald	f01e54eae1	Improve the warning messages printed by `PartialEvaluator.{getOperatorList, getTextContent} when errors are being ignored Currently the actual errors aren't printed, which can make debugging harder than necessary.	2018-06-12 11:01:32 +02:00
Jonas Jenwald	731f2e6dfc	Remove manual clamping/rounding from `ColorSpace` and `PDFImage`, by having their methods use `Uint8ClampedArray`s The built-in image decoders are already using `Uint8ClampedArray` when returning data, and this patch simply extends that to the rest of the image/colorspace code. As far as I can tell, the only reason for using manual clamping/rounding in the first place was because TypedArrays used to be polyfilled (using regular arrays). And trying to polyfill the native clamping/rounding would probably have been had too much overhead, but given that TypedArray support is required in PDF.js version `2.0` that's no longer a concern. Please note: Because of different rounding behaviour, basically `Math.round` in `Uint8ClampedArray` respectively `Math.floor` in the old code, there will be very slight movement in quite a few existing test-cases. However, the changes should be imperceivable to the naked eye, given that the absolute difference is at most `1` for each RGB component when comparing `master` and this patch (see also the updated expectation values in the unit-tests).	2018-06-12 11:01:32 +02:00
Jonas Jenwald	80441346a3	Fallback to the built-in JPEG decoder if 'JpegStream', in `src/display/api.js`, fails to load the image This works by making `PartialEvaluator.buildPaintImageXObject` wait for the success/failure of `loadJpegStream` on the API side before parsing continues. Please note that in practice, it should be quite rare for the browser to fail loading/decoding of a JPEG image. In the general case, it should thus not be completely surprising if even `src/core/jpg.js` will fail to decode the image.	2018-02-05 21:05:31 +01:00
Jonas Jenwald	76afe1018b	Fallback to built-in image decoding if the `NativeImageDecoder` fails In particular this means that if 'JpegDecode', in `src/display/api.js`, fails we'll fallback to the built-in JPEG decoder.	2018-02-05 17:01:35 +01:00
Jonas Jenwald	7f73fc9ace	Re-factor `PartialEvaluator.buildPaintImageXObject` to make it asynchronous This is necessary for upcoming changes, which will add fallback code-paths to allow graceful handling of native image decoding failures.	2018-02-05 17:01:35 +01:00
Jonas Jenwald	ec85d5c625	Change the signature of `PartialEvaluator.buildPaintImageXObject` to take a parameter object This method currently requires a fair number of parameters, which creates quite unwieldy call-sites. When invoking `buildPaintImageXObject`, you have to remember not only which arguments to supply, but also the correct order, to prevent run-time errors.	2018-02-05 17:01:35 +01:00
Jonas Jenwald	d0c8992e8a	Attempt to actually resolve ColourSpace names in accordance with the specification (issue 9285) Please refer to the PDF specification, in particular http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.3801570 > A colour space shall be specified in one of two ways: > - Within a content stream, the CS or cs operator establishes the current colour space parameter in the graphics state. The operand shall always be name object, which either identifies one of the colour spaces that need no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, or some cases of Pattern) or shall be used as a key in the ColorSpace subdictionary of the current resource dictionary (see 7.8.3, "Resource Dictionaries"). In the latter case, the value of the dictionary entry in turn shall be a colour space array or name. A colour space array shall never be inline within a content stream. > > - Outside a content stream, certain objects, such as image XObjects, shall specify a colour space as an explicit parameter, often associated with the key ColorSpace. In this case, the colour space array or name shall always be defined directly as a PDF object, not by an entry in the ColorSpace resource subdictionary. This convention also applies when colour spaces are defined in terms of other colour spaces.	2018-01-10 20:20:43 +01:00
Jonas Jenwald	2db75a2a3a	Update the ESLint dependencies, and also tweak the `no-multiple-empty-lines` rules Since multiple empty lines is virtually unused in the code-base, and the few cases that do exist look like "typos", let's enforce greater consistency here; please see https://eslint.org/docs/rules/no-multiple-empty-lines.	2018-01-03 13:32:57 +01:00
Jonas Jenwald	f3c50fe2f9	Merge pull request #9192 from Snuffleupagus/issue-8229 Build a fallback `ToUnicode` map for simple fonts (issue 8229)	2017-11-30 10:27:32 +01:00
Jani Pehkonen	06d083b04b	Fix pattern-filled text	2017-11-28 19:40:22 +02:00
Jonas Jenwald	61e19bee43	Build a fallback `ToUnicode` map for simple fonts (issue 8229) In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things. Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case. Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data. According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172): > A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. > ... Reading that paragraph literally, it doesn't seem too unreasonable to use different methods for different charcodes. Fixes 8229.	2017-11-26 14:45:15 +01:00
Jonas Jenwald	ffbfc3c2a7	Refactor the building of `ToUnicode` maps for simple fonts a helper method	2017-11-26 13:30:29 +01:00
Tim van der Meij	ae07adf143	Merge pull request #9073 from Snuffleupagus/image-streams-fixes Fix the interface of `JpegStream`/`JpxStream`/`Jbig2Stream` to agree with the other `DecodeStream`s	2017-11-17 23:26:36 +01:00
Jonas Jenwald	36593d6bbc	Move `JpegStream` and `JpxStream` to their own files	2017-11-11 11:22:16 +01:00
Yury Delendik	85f544f55a	Moves OperatorList and QueueOptimizer into separate file.	2017-10-30 13:29:58 -05:00
Jonas Jenwald	b1472cddbb	Allow `getOperatorList`/`getTextContent` to skip errors when parsing broken XObjects (issue 8702, issue 8704) This patch makes use of the existing `ignoreErrors` property in `src/core/evaluator.js`, see PRs 8240 and 8441, thus allowing us to attempt to recovery as much as possible of a page even when it contains broken XObjects. Fixes 8702. Fixes 8704.	2017-09-29 17:14:21 +02:00
Jonas Jenwald	b8ec518a1e	Split the existing `PDFFunction` in two classes, a private `PDFFunction` and a public `PDFFunctionFactory, and utilize the latter in` PDFDocument `to allow various code to access the methods of` PDFFunction` Follow-up to PR 8909. This requires us to pass around `pdfFunctionFactory` to quite a lot of existing code, however I don't see another way of handling this while still guaranteeing that we can access `PDFFunction` as freely as in the old code. Please note that the patch passes all tests locally (unit, font, reference), and I very much hope that we have sufficient test-coverage for the code in question to catch any typos/mistakes in the re-factoring.	2017-09-29 15:30:53 +02:00
Jonas Jenwald	5c961c76bb	Remove the unused `inline` parameter from various methods/functions in `PDFImage`, and change a couple of methods to use Objects rather than plain parameters The `inline` parameter is passed to a number of methods/functions in `PDFImage`, despite not actually being used. Its value is never checked, nor is it ever assigned to the current `PDFImage` instance (i.e. no `this.inline = inline` exists). Looking briefly at the history of this code, I was also unable to find a point in time where `inline` was being used. As far as I'm concerned, `inline` does nothing more than add clutter to already very unwieldy method/function signatures, hence why I'm proposing that we just remove it. To further simplify call-sites using `PDFImage`/`NativeImageDecoder`, a number of methods/functions are changed to take Objects rather than a bunch of (somewhat) randomly ordered parameters.	2017-09-29 15:30:40 +02:00
Brendan Dahl	10ba292b46	Use font's default width even when 0. Bug 1392647 has a PDF where the default width of the font is 0. It draws some charcodes that don't have glyphs, but we were wrongly using the 1000 default width for these charcodes causing some text to be overlapping.	2017-09-20 11:38:30 -07:00
Jonas Jenwald	dc926ffc0f	Check `isEvalSupported`, and test that `eval` is actually supported, before attempting to use the `PostScriptCompiler` (issue 5573) Currently `PDFFunction` is implemented (basically) like a class with only `static` methods. Since it's used directly in a number of different `src/core/` files, attempting to pass in `isEvalSupported` would result in code that's very messy, not to mention difficult to maintain (since every single `PDFFunction` method call would need to include a `isEvalSupported` argument). Rather than having to wait for a possible re-factoring of `PDFFunction` that would avoid the above problems by design, it probably makes sense to at least set `isEvalSupported` globally for `PDFFunction`. Please note that there's one caveat with this solution: If `PDFJS.getDocument` is used to open multiple files simultaneously, with different `PDFJS.isEvalSupported` values set before each call, then the last one will always win. However, that seems like enough of an edge-case that we shouldn't have to worry about it. Besides, since we'll also test that `eval` is actually supported, it should be fine. Fixes 5573.	2017-09-15 12:02:45 +02:00
Jonas Jenwald	cfb4955a92	Replace the `isArray` helper function with the native `Array.isArray` function Follow-up to PR 8813.	2017-09-01 20:27:13 +02:00
Jonas Jenwald	093afd1212	Replace the `coded` property with `isType3Font` when building the font `properties` object in `PartialEvaluator.translateFont` This appears to simply have been forgotten in the re-factoring in PR 4815, where the `coded` property was renamed to the much more descriptive `isType3Font` property.	2017-08-08 14:03:02 +02:00
Jonas Jenwald	4729e96fb7	Remove leftover `args[0].code` checks from the `OPS.paintXObject` cases in evaluator.js From looking at blame, it seems that these checks became obsolete with PR 692 (which landed close to six years ago). Note how, after that PR, there's no longer anything being assigned to the `code` property of an Object.	2017-08-07 10:48:37 +02:00
Yury Delendik	a1dfbec532	Properly cancel streams and guard at getTextContent.	2017-08-03 16:36:46 -05:00
Jonas Jenwald	814fa1dee3	Remove most `assert()` calls (issue 8506) This replaces `assert` calls with `throw new FormatError()`/`throw new Error()`. In a few places, throwing an `Error` (which is what `assert` meant) isn't correct since the enclosing function is supposed to return a `Promise`, hence some cases were changed to `Promise.reject(...)` and similarily for `createPromiseCapability` instances.	2017-07-21 18:51:02 +02:00
Yury Delendik	d028c26210	Removes error()	2017-07-07 09:40:24 -05:00
Mukul Mishra	0c13d0ff46	Adds Streams API in getTextContent to stream data. This patch adds Streams API support in getTextContent so that we can stream data in chunks instead of fetching whole data from worker thread to main thread. This patch supports Streams API without changing the core functionality of getTextContent. Enqueue textContent directly at getTextContent in partialEvaluator. Adds desiredSize and ready property in streamSink.	2017-06-17 20:03:27 +05:30
Jonas Jenwald	e589834f13	Ensure that `TilingPattern`s have valid (non-zero) /BBox arrays (issue 8330) Fixes 8330.	2017-06-09 21:41:48 +02:00
Jonas Jenwald	a8c87f8019	Fix inconsistent spacing and trailing commas in objects in `src/core/` files, so we can enable the `comma-dangle` and `object-curly-spacing` ESLint rules later on Unfortunately this patch is fairly big, even though it only covers the `src/core` folder, but splitting it even further seemed difficult. http://eslint.org/docs/rules/comma-dangle http://eslint.org/docs/rules/object-curly-spacing Given that we currently have quite inconsistent object formatting, fixing this in one big patch probably wouldn't be feasible (since I cannot imagine anyone wanting to review that); hence I've opted to try and do this piecewise instead. Please note: This patch was created automatically, using the ESLint --fix command line option. In a couple of places this caused lines to become too long, and I've fixed those manually; please refer to the interdiff below for the only hand-edits in this patch. ```diff diff --git a/src/core/evaluator.js b/src/core/evaluator.js index abab9027..dcd3594b 100644 --- a/src/core/evaluator.js +++ b/src/core/evaluator.js @@ -2785,7 +2785,8 @@ var EvaluatorPreprocessor = (function EvaluatorPreprocessorClosure() { t['Tz'] = { id: OPS.setHScale, numArgs: 1, variableArgs: false, }; t['TL'] = { id: OPS.setLeading, numArgs: 1, variableArgs: false, }; t['Tf'] = { id: OPS.setFont, numArgs: 2, variableArgs: false, }; - t['Tr'] = { id: OPS.setTextRenderingMode, numArgs: 1, variableArgs: false, }; + t['Tr'] = { id: OPS.setTextRenderingMode, numArgs: 1, + variableArgs: false, }; t['Ts'] = { id: OPS.setTextRise, numArgs: 1, variableArgs: false, }; t['Td'] = { id: OPS.moveText, numArgs: 2, variableArgs: false, }; t['TD'] = { id: OPS.setLeadingMoveText, numArgs: 2, variableArgs: false, }; diff --git a/src/core/jbig2.js b/src/core/jbig2.js index 5a17d482..71671541 100644 --- a/src/core/jbig2.js +++ b/src/core/jbig2.js @@ -123,19 +123,22 @@ var Jbig2Image = (function Jbig2ImageClosure() { { x: -1, y: -1, }, { x: 0, y: -1, }, { x: 1, y: -1, }, { x: -2, y: 0, }, { x: -1, y: 0, }], [{ x: -3, y: -1, }, { x: -2, y: -1, }, { x: -1, y: -1, }, { x: 0, y: -1, }, - { x: 1, y: -1, }, { x: -4, y: 0, }, { x: -3, y: 0, }, { x: -2, y: 0, }, { x: -1, y: 0, }] + { x: 1, y: -1, }, { x: -4, y: 0, }, { x: -3, y: 0, }, { x: -2, y: 0, }, + { x: -1, y: 0, }] ]; var RefinementTemplates = [ { coding: [{ x: 0, y: -1, }, { x: 1, y: -1, }, { x: -1, y: 0, }], - reference: [{ x: 0, y: -1, }, { x: 1, y: -1, }, { x: -1, y: 0, }, { x: 0, y: 0, }, - { x: 1, y: 0, }, { x: -1, y: 1, }, { x: 0, y: 1, }, { x: 1, y: 1, }], + reference: [{ x: 0, y: -1, }, { x: 1, y: -1, }, { x: -1, y: 0, }, + { x: 0, y: 0, }, { x: 1, y: 0, }, { x: -1, y: 1, }, + { x: 0, y: 1, }, { x: 1, y: 1, }], }, { - coding: [{ x: -1, y: -1, }, { x: 0, y: -1, }, { x: 1, y: -1, }, { x: -1, y: 0, }], - reference: [{ x: 0, y: -1, }, { x: -1, y: 0, }, { x: 0, y: 0, }, { x: 1, y: 0, }, - { x: 0, y: 1, }, { x: 1, y: 1, }], + coding: [{ x: -1, y: -1, }, { x: 0, y: -1, }, { x: 1, y: -1, }, + { x: -1, y: 0, }], + reference: [{ x: 0, y: -1, }, { x: -1, y: 0, }, { x: 0, y: 0, }, + { x: 1, y: 0, }, { x: 0, y: 1, }, { x: 1, y: 1, }], } ]; ```	2017-06-02 11:20:19 +02:00
Jonas Jenwald	982b6aa65b	Convert the files in the `/src/core` folder to ES6 modules Please note that the `glyphlist.js` and `unicode.js` files are converted to CommonJS modules instead, since Babel cannot handle files that large and they are thus excluded from transpilation.	2017-05-30 22:06:21 +02:00
Yury Delendik	5dc8dcdc0f	Merge pull request #8388 from Snuffleupagus/issue-8380 Cache JPEG images, just as we do for other image formats, in `evaluator.js` (issue 8380)	2017-05-17 17:25:51 -05:00
巴里切罗	8d5d97264e	fix(svg) adjust strategy for decoding JPEG images	2017-05-08 11:32:44 +08:00
Jonas Jenwald	0c2ebda31c	Cache JPEG images, just as we do for other image formats, in `evaluator.js` (issue 8380) For some reason, we're putting all kind of images except JPEG into the `imageCache` in `evaluator.js`.[1] This means that in the PDF file in issue 8380, we'll keep sending the same two small images[2] to the main-thread and decoding them over and over. This is obviously hugely inefficient! As can be seen from the discussion in the issue, the performance becomes extremely bad if the user has the addon "Adblock Plus" installed. However, even in a clean Firefox profile, the performance isn't that great. This patch not only addresses the performance implications of the "Adblock Plus" addon together with that particular PDF file, but it also improves the rendering times considerably for all users. Locally, with a clean profile, the rendering times are reduced from `~2000 ms` to `~500 ms` for my setup! Obviously, the general structure of the PDF file and its operator sequence is still hugely inefficient, however I'd say that the performance with this patch is good enough to consider the issue (as it stands) resolved.[3] Fixes 8380. --- [1] Not technically true, since inline images are cached from `parser.js`, but whatever :-) [2] The two JPEG images have dimensions 1x2, respectively 4x2. [3] To make this even more efficient, a new state would have to be added to the `QueueOptimizer`. Given that PDF files this stupid fortunately aren't too common, I'm not convinced that it's worth doing.	2017-05-07 13:07:41 +02:00
Jonas Jenwald	3e20d30afc	Change the signatures of the `PartialEvaluator` "constructor" and its `getOperatorList`/`getTextContent` methods to take parameter objects Currently these methods accept a large number of parameters, which creates quite unwieldy call-sites. When invoking them, you have to remember not only what arguments to supply, but also the correct order, to avoid runtime errors. Furthermore, since some of the parameters are optional, you also have to remember to pass e.g. `null` or `undefined` for those ones. Also, adding new parameters to these methods (which happens occasionally), often becomes unnecessarily tedious (based on personal experience). Please note that I do not think that we need/should convert every single method in `evaluator.js` (or elsewhere in `/core` files) to take parameter objects. However, in my opinion, once a method starts relying on approximately five parameter (or even more), passing them in individually becomes quite cumbersome. With these changes, I obviously needed to update the `evaluator_spec.js` unit-tests. The main change there, except the new method signatures[1], is that it's now re-using one `PartialEvalutor` instance, since I couldn't see any compelling reason for creating a new one in every single test. Note: If this patch is accepted, my intention is to (time permitting) see if it makes sense to convert additional methods in `evaluator.js` (and other `/core` files) in a similar fashion, but I figured that it'd be a good idea to limit the initial scope somewhat. --- [1] A fun fact here, note how the `PartialEvaluator` signature used in `evaluator_spec.js` wasn't even correct in the current `master`.	2017-05-03 12:10:20 +02:00
Jonas Jenwald	95bbc8101c	Replace unnecessary `bind(this)` and `var self = this` statements with arrow functions in `src/core/evaluator.js` Note that by using `let` instead of `var` in `PartialEvaluator.setGState` and `TranslatedFont.loadType3Data`, we can get rid of further `bind` usages since `let` is block-scoped. Also, the fact that `bind` wasn't used in the `Font` case inside of `setGState` is actually a bug which has been present ever since PR 5205, where a closure was replaced by a standard loop.[1] --- [1] I'm not aware of any bugs caused by this, but that is probably more a happy accident than anything else, since e.g. just removing the `bind` from the `SMask` case without using block-scoped variables causes test failures.	2017-05-01 20:29:44 +02:00
Jonas Jenwald	afc74b0178	Enable the `object-shorthand` ESLint rule in `src/shared` Please see http://eslint.org/docs/rules/object-shorthand. For the most part, these changes are of the search-and-replace kind, and the previously enabled `no-undef` rule should complement the tests in helping ensure that no stupid errors crept into to the patch.	2017-04-27 17:29:40 +02:00
Jonas Jenwald	fbe7b2eee7	Always ignore Type3 glyphs if their `OperatorList`s contain errors, regardless of the value of the `stopAtErrors` option Compared to the parsing of e.g. an entire page, it doesn't really make sense to only be able to render a Type3 glyph partially.	2017-04-11 08:59:22 +02:00
Jonas Jenwald	a39d636eb8	[api-minor] Always allow e.g. rendering to continue even if there are errors, and add a `stopAtErrors` parameter to `getDocument` to opt-out of this behaviour (issue 6342, issue 3795, bug 1130815) Other PDF readers, e.g. Adobe Reader and PDFium (in Chrome), will attempt to render as much of a page as possible even if there are errors present. Currently we just bail as soon the first error is hit, which means that we'll usually not render anything in these cases and just display a blank page instead. NOTE: This patch changes the default behaviour of the PDF.js API to always attempt to recover as much data as possible, even when encountering errors during e.g. `getOperatorList`/`getTextContent`, which thus improve our handling of corrupt PDF files and allow the default viewer to handle errors slightly more gracefully. In the event that an API consumer wishes to use the old behaviour, where we stop parsing as soon as an error is encountered, the `stopAtErrors` parameter can be set at `getDocument`. Fixes, inasmuch it's possible since the PDF files are corrupt, e.g. issue 6342, issue 3795, and [bug 1130815](https://bugzilla.mozilla.org/show_bug.cgi?id=1130815) (and probably others too).	2017-04-11 08:59:22 +02:00
Jonas Jenwald	10e5f766a2	Merge pull request #8266 from brendandahl/issue6652 Normalize blend mode names.	2017-04-11 08:54:42 +02:00
Brendan Dahl	4969b2ad97	Normalize blend mode names.	2017-04-10 16:18:08 -07:00
Jonas Jenwald	f41d80bdd3	Enable the `prefer-promise-reject-errors` ESLint rule See http://eslint.org/docs/rules/prefer-promise-reject-errors, note that this is similar to the already used `no-throw-literal` rule.	2017-04-08 11:47:22 +02:00
Jonas Jenwald	e229c21ce1	Remove unnecessary `xref` parameters from various method signatures in `PartialEvaluator`, since `this.xref` is already available in the relevant scope For reasons I don't pretend to understand, we're passing around `xref` arguments to a bunch of methods despite `this.xref` being available in `PartialEvaluator`. This patch is a small first small step towards cleaning up the, often unwieldy, signatures of methods in `PartialEvaluator`.	2017-03-26 14:12:53 +02:00
Jonas Jenwald	e40fd63bd3	In `src/core/evaluator.js`, convert a couple of `if (!someVariable) { error(...); }` instances to `assert(someVariable);` instead Rather than, in a number of places, basically duplicating the logic of `assert` we can simply utilize the function directly instead.	2017-03-26 13:53:13 +02:00
Jonas Jenwald	a7c19d9cbb	Adjust the `yoda` ESLint rule to apply to inequalities as well I happened to notice that some inequalities had the wrong order, and was surprised since I thought that the `yoda` rule should have caught that. However, reading http://eslint.org/docs/rules/yoda#options a bit more closely than previously, it's quite obvious that the `onlyEquality` option does exactly what its name suggests. Hence I think that it makes sense to adjust the options such that only ranges are allowed instead.	2017-03-19 13:27:14 +01:00
Jonas Jenwald	111419a64a	Cache built-in binary CMap files in the worker (issue 4794)	2017-02-16 10:55:39 +01:00
Jonas Jenwald	769c1450b7	[api-minor] Refactor fetching of built-in CMaps to utilize a factory on the `display` side instead, to allow users of the API to provide a custom CMap loading factory (e.g. for use with Node.js) Currently the built-in CMap files are loaded in `src/core/cmap.js` using `XMLHttpRequest` directly. For some environments that might be a problem, hence this patch refactors that to instead use a factory to load built-in CMaps on the main thread and message the data to the worker thread. This is inspired by other recent work, e.g. the addition of the `CanvasFactory`, and to a large extent on the IRC discussion starting at http://logs.glob.uno/?c=mozilla%23pdfjs&s=12+Oct+2016&e=12+Oct+2016#c53010.	2017-02-16 10:55:35 +01:00
Jonas Jenwald	9c34d0aa8c	[api-minor] Add a `getDocument` parameter that allows disabling of the `NativeImageDecoder` (e.g. for use with Node.js) Note that I initially tried to add this as a parameter to the `PDFPageProxy.render` method, such that it could be passed to `PartialEvaluator.getOperatorList`. However, given all the different code-paths that call `getOperatorList` (there's a bunch only in `annotation.js`), this seemed to very quickly become unwieldy and thus difficult to maintain compared to simply using the existing `evaluatorOptions`.	2017-02-06 22:21:34 +01:00
Jonas Jenwald	52e0f51917	Enable the `no-unused-vars` ESLint rule Please see http://eslint.org/docs/rules/no-unused-vars; note that this patch purposely uses the same rule options as in `mozilla-central`, such that it fixes part of issue 7957. It wasn't, in my opinion, entirely straightforward to enable this rule compared to the already existing rules. In many cases a `var descriptiveName = ...` format was used (more or less) to document the code, and I choose to place the old variable name in a trailing comment to not lose that information. I welcome feedback on these changes, since it wasn't always entirely easy to know what changes made the most sense in every situation.	2017-01-29 23:23:17 +01:00
Jonas Jenwald	50c2856097	Move `EOF`/`isEOF` from core/parser.js to core/primitives.js Given the nature of `EOF` and `isEOF`, it seems to me that they really ought to be placed in `core/primitives.js` instead. In general, it doesn't seem great to have to depend on the entire `core/parser.js` file for such simple primitives/helper functions. In particular, while `core/ps_parser.js` is completely separate from `core/parser.js` with regards to its function, it still depends on the latter for just one primitive. Note that compared to e.g. PR 7389, this will not reduce the number of dependencies for `core/ps_parser`, however the new dependency IMHO makes more sense.	2017-01-27 13:37:48 +01:00
Jonas Jenwald	642d8621ef	Replace direct lookup of `uniquePrefix`/`idCounters`, in `Page` instances, with an `idFactory` containing an `createObjId` method instead We're currently making use of `uniquePrefix`/`idCounters` in multiple files, to create unique object id's, and adding a new occurrence of them requires some care to ensure that an object id isn't accidentally reused. Furthermore, having to pass around multiple parameters as we currently do seem like something you want to avoid. Instead, this patch adds a factory which means that there's only one thing that needs to be passed around. And since it's now only necessary to call a method in order to obtain a unique object id, the details are thus abstracted away at the call-sites which avoids accidental reuse of object id's. To test that this works as expected a very simple `Page` unit-test is added, and the existing `Annotation layer` tests are also adjusted slightly.	2017-01-09 23:16:25 +01:00
Jonas Jenwald	4046d67fde	Enable the `no-else-return` ESLint rule Using `else` after `return` is not necessary, and can often lead to unnecessarily cluttered code. By using the `no-else-return` rule in ESLint we can avoid this pattern, see http://eslint.org/docs/rules/no-else-return.	2017-01-09 20:27:39 +01:00
Jonas Jenwald	ddea9a6b04	Improve the handling of `Encoding` dictionary, with `Differences` array, in `PartialEvaluator_preEvaluateFont` I recently happened to look at the code I wrote for PR 5964, which fixed [bug 1157493](https://bugzilla.mozilla.org/show_bug.cgi?id=1157493), and I quickly realized that the solution is way too simplistic. The fact that only using the `length` of a `Differences` array worked seems more like a happy accident for a particular set of font data, but could just as easily be incorrect for other PDF files. Note that in practice, the case where the `Encoding` entry is a regular `Dict` (and not a `Ref` or `Name`) is very rare, hence I don't think that we really need to worry about having to reparse this data. Also, the performance of this code-block is quite a bit better by updating the `hash` with the data from the entire `Differences` array, instead of at every loop iteration.	2016-12-28 21:32:54 +01:00
Ross Johnson	4537590033	Consitently apply textAdvanceScale during building of textContentItems for improved highlighting. Fixes #7878 .	2016-12-14 21:02:19 -06:00
Jonas Jenwald	c5b06cb40d	Ensure that `PartialEvaluator_extractWidths` is able to handle indirect objects in all kinds of "width" data (issue 7855) Fixes 7855.	2016-11-29 20:49:07 +01:00
Jonas Jenwald	451956c0b1	Merge pull request #7628 from Snuffleupagus/issue-7580 Fallback to the `StandardEncoding` for Nonsymbolic fonts without `/Encoding` entry (issue 7580)	2016-11-29 12:37:36 +01:00
Jonas Jenwald	a930f9af15	For commands with with too few arguments, clear out `args` if it's an Array instead of replacing it with `null` in `EvaluatorPreprocessor_read` (issue 7804) For `PartialEvaluator_getTextContent`, the same `args` Array should be re-used for every `EvaluatorPreprocessor_read` call. Hence we want to ensure that it's not accidentally replaced with `null` in `EvaluatorPreprocessor_read`, since otherwise corrupt PDF files (with too few arguments for certain commands) will cause errors in `PartialEvaluator_getTextContent`. Perhaps a micro-optimization, but this patch also changes two `!args` comparisons to `args === null`, since that should be a tiny bit more efficient.	2016-11-16 10:20:29 +01:00
Chas Emerick	85c52f1fd6	Fix getTextContent evaluation to only apply TJ horizontal offsets using numeric items/args While the array argument to TJ should only contain strings and numbers, other unfortunate items are found in PDFs in the wild, e.g.: [(Grandes) 0.0 Tc -250.0 (Client\350les,) 0.0 Tc -250.0 (Financements) 0.0 Tc -250.0 (et) 0.0 Tc -250.0 (March\351s) ] TJ getOperatorList already properly ignores any non-string, non-numeric values in TJ arrays; without this patch to getTextContent, returned text items can have NaN widths due to calculations being applied to those non-numeric values.	2016-10-13 08:08:31 -04:00
Jonas Jenwald	116ba19dd9	Respect the 'ColorTransform' entry in the image dictionary when decoding JPEG images (bug 956965, issue 6574) Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=956965. Fixes 6574.	2016-09-26 21:55:43 +02:00
Jonas Jenwald	356b321f6d	Fallback to the `StandardEncoding` for Nonsymbolic fonts without `/Encoding` entry (issue 7580) Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry. Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data. According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts. There's obviously a risk that fixing this particular issue could break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage. Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact not Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat. Fixes 7580.	2016-09-13 14:07:16 +02:00
Jonas Jenwald	325f7afcca	For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142) Note that in order to prevent any possible issues, this patch does not try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary. Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.	2016-09-11 20:54:10 +02:00
Yury Delendik	ffa99397ad	Merge pull request #7387 from Snuffleupagus/issue-5808 Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808)	2016-08-30 15:21:41 -05:00
Jonas Jenwald	83ce6f0b6d	Adjust the (applicable) existing `isName` callsites to use the new `isName(v, name)` version of the function	2016-08-10 11:15:08 +02:00
Jonas Jenwald	77c6ed5389	Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808) This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2. The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before every showText command. Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.	2016-07-27 21:37:52 +02:00
Yury Delendik	a02e2686b9	Merge pull request #7475 from Snuffleupagus/api-getTextContent-combineTextItems [api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items	2016-07-27 08:34:24 -05:00
Brendan Dahl	5678486802	Merge pull request #7347 from Snuffleupagus/evaluator-more-Ref_toString Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402)	2016-07-22 17:21:47 -07:00
Brendan Dahl	50d6e4f147	Merge pull request #7447 from Snuffleupagus/buildToUnicode-notdef Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256)	2016-07-22 14:33:32 -07:00
Jonas Jenwald	390c02a3e9	Attempt to cache fonts that are direct objects (i.e. `Dict`s), as opposed to `Ref`s, to prevent re-rendering after `cleanup` from breaking (issue 7403 and issue 7402) Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run. The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation. Note that if we instead loaded fonts per page, instead of per document, this issue wouldn't have existed.	2016-07-21 16:04:07 +02:00
Jonas Jenwald	2e9cd3ea64	Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402) Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements. There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare. Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference and the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.) In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.	2016-07-21 16:03:33 +02:00
Jonas Jenwald	f297e4d17c	[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.	2016-07-19 13:38:57 +02:00
klemens	6f03f62327	trivial spelling fixes	2016-07-17 14:33:41 +02:00
Jonas Jenwald	bdd58ab1d2	Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256) Fixes 5256.	2016-06-27 16:20:23 +02:00
Jonas Jenwald	b02d560ae0	Fix errors in `setGState` in `PartialEvaluator_getTextContent` that prevents text-selection from working properly Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly. This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.	2016-06-01 22:58:49 +02:00
Jonas Jenwald	7ddb0bc718	Attempt to combine text runs positioned with `setTextMatrix`	2016-05-18 17:21:58 +02:00
Jonas Jenwald	6111c17c8a	Use `Dict_getArray` in more places in `src/core/` to avoid issues when Arrays contain indirect objects As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where some elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing). The problem in both the cases above, besides broken rendering, was that there were no errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue. Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.	2016-05-05 19:42:57 +02:00
Jonas Jenwald	f59c3a0644	Remove the remaining usages of `new {Name,Cmd}` in favor of `{Name,Cmd}.get` Using `new {Name,Cmd}` should be avoided, since it creates a new object on every call, whereas `{Name,Cmd}.get` uses caches to only create one object regardless of how many times they are called. Most of these are found in the unit-tests, where increased memory usage probably doesn't matter very much. But it still seems good to get rid of those cases, since no part of the codebase ought to advertise that usage. Given the small size of the patch, I'm also tweaking a few comments and class names.	2016-04-08 12:14:05 +02:00
Yury Delendik	a250c150ab	Merge pull request #7134 from yurydelendik/circ-stream-colorspace Refactors to remove stream.js dependency on colorspace.js	2016-04-01 08:23:24 -05:00
Yury Delendik	35cbf74b12	Refactors to remove stream.js dependency on colorspace.js	2016-04-01 07:36:16 -05:00
Jonas Jenwald	05cf709f8e	Parse Type1 font files to determine the various `Length{n}` properties, instead of trusting the PDF file (issue 5686, issue 3928) Fixes 5686. Fixes 3928.	2016-03-31 11:08:12 +02:00
Brendan Dahl	4e2f70440f	Merge pull request #6711 from yurydelendik/errors Better errors capturing at the core and stop rendering on error.	2016-03-29 09:19:28 -07:00
Yury Delendik	bda5e6235e	Removes global PDFJS usage from the src/core/.	2016-03-23 19:24:37 -05:00
Manas	f6d28ca323	Refactors CMapFactory.create to make it async	2016-03-21 23:08:19 +05:30
Yury Delendik	8ba413e761	Better errors capturing at the core and stop rendering on error.	2016-03-11 07:59:09 -06:00
Jonas Jenwald	93ea866f01	Remove `getAll` from `EvaluatorPreprocessor_read` For the operators that we currently support, the arguments are not `Dict`s, which means that it's not really necessary to use `Dict_getAll` in `EvaluatorPreprocessor_read`. Also, I do think that if/when we support operators that use `Dict`s as arguments, that should be dealt with in the corresponding `case` in `PartialEvaluator_getOperatorList` which handles the operator. The only reason that I can find for using `Dict_getAll` like that, is that prior to PR 6550 we would just append certain (currently unsupported) operators without doing any further processing/checking. But as issue 6549 showed, that can lead to issues in practice, which is why it was changed. In an effort to prevent possible issue with unsupported operators, this patch simply ignores operators with `Dict` arguments in `PartialEvaluator_getOperatorList`.	2016-02-12 22:31:50 +01:00
Jonas Jenwald	f7f60197ce	Replace `getAll` with `getKeys` in `loadType3Data` Not only is `getAll` less efficient, but given that we actually need the keys here, using `getKeys` seems much more suitable.	2016-02-10 20:19:14 +01:00
Jonas Jenwald	07e1ad40a2	Replace `getAll` with `getKeys` in `PartialEvaluator_hasBlendModes` to speed up loading of badly generated PDF files (issue 6961) Some bad PDF generators, in particular "Scribus PDF", duplicates resources a lot at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete. The reason is that the current code is using `Dict_getAll`, which recursively dereferences all indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects. I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.) Fixes 6961.	2016-02-10 17:21:38 +01:00
Jonas Jenwald	a1fe2cb443	Don't directly access the private `map` in `setGState`, and ensure that we avoid indirect objects This patch is based on something I noticed while debugging some of the PDF files in issue 6931. In a number of the cases in `setGState`, we're implicitly assuming that we're not dealing with indirect objects (i.e. `Ref`s). See e.g. the 'Font' case, or the various cases where we simply do `gStateObj.push([key, value]);` (since the code in `canvas.js` won't be able to deal with a `Ref` for those cases). The reason that I didn't use `Dict_forEach` instead, is that it would re-introduce the unncessary closures that PR 5205 removed.	2016-02-03 17:13:42 +01:00
Jonas Jenwald	2d4a1aa0af	Actually ignore no-op `setGState` (PR 5192 followup) The intention of PR 5192 was to avoid adding empty `setGState` ops to the operatorList. But the patch accidentally used `>=`, which means that it's not actually working as intended, since empty arrays always have `length === 0`.	2016-02-03 17:13:02 +01:00
Jonas Jenwald	4770b516fe	Correct the upper bound used when building the `transferMap` for SMasks (PR 6723 followup) Even though the currently known test-cases render correctly without this patch, that seems more like a lucky coincidence, given that there's no guarantee that `transferMap[255] === 0` for every possible transfer function.	2016-02-03 13:41:10 +01:00
Jonas Jenwald	992472fd38	Ensure that we don't modify the `Dict` data when the `Differences` array of a font contains indirect objects This patch fixes an issue that I inadvertently introduced in PR 5815, where we accidentally modify the `Differences` array in the encoding dictionary for indirect objects. Instead of this change, we could also have used the now existing `Dict_getArray`. However in this case I don't think that would have been a good idea, since it would mean iterating through the array twice.	2016-01-30 13:31:24 +01:00
Yury Delendik	2edf2792dc	Replaces literal {} created lookup tables with Object.create	2016-01-28 12:18:38 -06:00
Yury Delendik	d6adf84159	Lazify OP_MAP.	2016-01-28 12:18:37 -06:00
Yury Delendik	1de90454b7	Lazify Metrics	2016-01-28 12:11:46 -06:00
Yury Delendik	55a201d92d	Lazify NormalizedUnicodes	2016-01-28 11:56:42 -06:00
Yury Delendik	d0738d7e24	Lazify stdFontMap, serifFonts, GlyphMapForStandardFonts	2016-01-28 11:51:54 -06:00
Yury Delendik	1a9a665adf	Refactor Encodings	2016-01-28 11:32:59 -06:00
Yury Delendik	6b60c8f4db	Adds UMD headers to core, display and shared files.	2015-12-15 13:24:39 -06:00
Yury Delendik	15c9969abe	Adds transfer function support for SMask.	2015-12-04 12:52:45 -06:00
Yury Delendik	c9cb6a3025	Replaces UnsupportedManager with callback.	2015-11-30 14:42:47 -06:00
Yury Delendik	e4e69e2f05	Set error font for Type3 if its loading failed.	2015-11-27 13:05:51 -06:00
Jonas Jenwald	6dfe53b976	[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that enables replacing of all whitespace with standard spaces in the textLayer (issue 6612) This patch goes a bit further than issue 6612 requires, and replaces all kinds of whitespace with standard spaces. When testing this locally, it actually seemed to slightly improve two existing test-cases (`tracemonkey-text` and `taro-text`). Fixes 6612.	2015-11-25 17:28:40 +01:00
Yury Delendik	06c1904675	Refactors FontLoader to group fonts per document.	2015-11-24 13:27:22 -06:00
Manas	a2ba1b8189	Uses editorconfig to maintain consistent coding styles Removes the following as they unnecessary /* -- Mode: Java; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -- / / vim: set shiftwidth=2 tabstop=2 autoindent cindent expandtab: */	2015-11-14 07:32:18 +05:30
Yury Delendik	fa423cfab0	Refactors fake space heuristics for speed.	2015-11-06 10:55:43 -06:00
Yury Delendik	376f8bde14	Combines standalone divs into text groups.	2015-11-06 10:20:49 -06:00
Yury Delendik	fa46b73c47	Better spacing in text layer.	2015-11-02 08:54:15 -06:00
Jonas Jenwald	1c66d4a106	Add a `totalLength` getter to `OperatorList`, since the `length` is zero after flushing In the `RenderPageRequest` handler in `worker.js`, we attempt to print an `info` message containing the rendering time and the length of the operator list. The latter is currently broken (and has been for quite some time), since the `length` of an `OperatorList` is reset when flushing occurs. This patch attempts to rectify this, by adding a getter which keeps track of the total length.	2015-10-26 18:12:14 +01:00
Yury Delendik	58c3ea0820	Adds thread abort capabilities.	2015-10-23 09:06:32 -05:00
Jonas Jenwald	2e751199fb	Prevent getOperatorList from failing to correctly parse OPS.paintXObject for TilingPatterns that are missing some /Resources entries (issue 6541) Fixes 6541.	2015-10-21 21:30:56 +02:00
Rob Wu	50ff2d4c2a	Ignore operators that are known to be unsupported `operatorList.addOp` adds the arguments to the list which is then passed as-is by postMessage to the main thread. But since we don't parse these operations, they are raw PDF objects and may therefore cause a serialization error. This is a conservative patch, and only affects operators which are known to be unsupported. We should ignore all unknown operators, but I haven't really looked into the consequences of doing that. Fixes #6549	2015-10-21 15:39:25 +02:00
Brendan Dahl	3eaeacfe19	Merge pull request #6476 from Snuffleupagus/PartialEvaluator_readToUnicode-cmap-length Right-size the `map` array in PartialEvaluator_readToUnicode	2015-10-09 10:31:28 -07:00
Jonas Jenwald	1b8cb52555	Prevent `PartialEvaluator_buildFormXObject` from failing if the `Matrix` or `BBox` contains indirect objects This patch fixes yet another instance of bad PDF data, specifically a case where the `BBox` array contains indirect objects (i.e. `Ref`s). Fixes the missing image in http://www.int.washington.edu/talks/WorkShops/int_08_37W/People/Franz_M/Franz.pdf#page=24. Note: There are missing images on a number of the pages in that file.	2015-09-29 10:11:49 +02:00
Jonas Jenwald	8d831449ab	Right-size the `map` array in PartialEvaluator_readToUnicode We can avoid a lot of intermediate resizings, by directly allocating the required number of elements for the `map` array.	2015-09-24 13:08:53 +02:00
Fabian Lange	2564827503	Fix text spacing with vertical fonts (#6387 ) According to the PDF spec 5.3.2, a positive value means in horizontal, that the next glyph is further to the left (so narrower), and in vertical that it is further down (so wider). This change fixes the way PDF.js has interpreted the value.	2015-09-15 09:28:45 +02:00
Rob Wu	b0a8c0fa40	cmaps: Use cmap.forEach instead of Array.forEach CMaps may be sparse. Array.prototype.forEach is terribly slow in Chrome (and also in Firefox) when the sparse array contains a key with a high value. E.g. console.time('forEach sparse') var a = []; a[0xFFFFFF] = 1; a.forEach(function(){}); console.timeEnd('forEach sparse'); // Chrome: 2890ms // Firefox: 1345ms Switching to CMap.prototype.forEach, which is optimized for such scenarios fixes the problem.	2015-08-08 13:30:30 +02:00
Jonas Jenwald	46a8485db4	Ignore paint form XObject when the name is missing (issue 4558) Fixes 4558 (since the font issues already appear to be fixed).	2015-06-22 22:10:26 +02:00
Tim van der Meij	90982332bf	Merge pull request #5995 from CodingFabian/tweak-char-spacing-text-selection Apply char spacing only when there are chars.	2015-05-14 20:06:22 +02:00
Fabian Lange	c2013094e7	Apply char spacing only when there are chars.	2015-05-13 23:45:20 +02:00
Tim van der Meij	b34366d2fc	Merge pull request #5898 from stri8ed/master Extract more accurate glyph heights from type3 fonts	2015-05-13 21:07:17 +02:00
Jonas Jenwald	6d2d854f65	Merge pull request #5815 from Snuffleupagus/type1-diff-refs Ensure that entries in the Differences array of Type1 fonts are either numbers or names	2015-05-07 22:33:23 +02:00
Brendan Dahl	cd53cbe7d4	Merge pull request #5964 from Snuffleupagus/bug-1157493 Handle the Encoding being a dictionary in PartialEvaluator_preEvaluateFont (bug 1157493)	2015-05-05 14:41:32 -07:00
Jonas Jenwald	760222cf0b	Handle the Encoding being a dictionary in PartialEvaluator_preEvaluateFont (bug 1157493) This is a regression from PR 4423. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1157493.	2015-04-25 16:48:14 +02:00
Jonas Jenwald	7c7d05e7a3	Attempt to infer if a CMap file actually contains just a standard `Identity-H`/`Identity-V` map	2015-04-25 11:28:33 +02:00
Jonas Jenwald	4c2ad3bc7b	Ensure that entries in the Differences array of Type1 fonts are either numbers or names This patch is yet another installment in the (never ending) series of bugs in PDF files with non-embedded fonts. Fixes http://www.int.washington.edu/talks/WorkShops/int_08_37W/People/Franz_M/Franz.pdf.	2015-04-17 20:32:27 +02:00
Levi Melamed	a5159a7942	extract more accurate glpyh heights from type-3 fonts	2015-04-03 08:49:06 -05:00
Brendan Dahl	3a8d4a7d72	Merge pull request #5713 from Snuffleupagus/evaluator-IdentityToUnicodeMap Create a IdentityToUnicodeMap in evaluator.js when toUnicode contains IdentityH/IdentityV	2015-03-25 10:33:29 -07:00
Hengjie	109d67691c	Lower threshold Fixes text selection formatting with https://github.com/vortext/vortext/blob/master/resources/public/examples/TestDocument3.pdf	2015-02-13 22:27:49 -08:00
Jonas Jenwald	f19a1db414	Create a IdentityToUnicodeMap in evaluator.js when toUnicode contains IdentityH/IdentityV Currently if a font contains a `toUnicode` entry, we always create a new `ToUnicodeMap` in evaluator.js. This is done even for `IdentityV/IdentityH`, despite to possibility to use the much more compact `IdentityToUnicodeMap` representation. This patch refactors the `IdentityH/IdentityV` cases, to: - Avoid calling `IdentityCMap.getMap`, since this prevents allocating and iterating through an array with 65536 elements. - Ensure that the handling of `toUnicode` is actually correct in fonts.js. We rely on `toUnicode instanceof IdentityToUnicodeMap` in a few places, and currently this does not work correctly for `IdentityH/IdentityV`.	2015-02-09 16:52:31 +01:00
Yury Delendik	f5df30f967	Merge pull request #5445 from CodingFabian/fixImageCachingInParser Fixes caching of inline images during parsing.	2014-12-15 10:51:23 -06:00
palkan	4764c52b5b	fix passing null as Promise's onFullfilled (which is broken in Chrome 32)	2014-11-25 16:40:27 +04:00
Fabian Lange	970c048d50	fixes caching of inline images during parsing. As described in #5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A1 A2 A3 A4 would be seen as A1 A2 A2 A2 by the evaluator, which prevents using the "repeat" optimization. Also only the last encountered image is cached, so A1 B1 A2 B2, would stay A1 B1 A2 B2. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. It also caches any eligible image by its adler32. The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1 which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of #2618)	2014-10-28 15:39:41 +01:00
Jonas Jenwald	4bda6ba1b8	Add basic support for ZapfDingbats	2014-09-03 21:54:04 +02:00
Nicholas Nethercote	96b9af68dd	Remove setGStateForKey() closure. setGStateForKey() is a closure that serves no particularly useful purpose. This change inlines it at the single call site. This avoids 1.7 MiB of allocations (because closures are objects) for the MTA map mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=835380#c17.	2014-08-17 22:21:45 -07:00
Yury Delendik	e53a28c996	Merge pull request #5192 from nnethercote/empty-setGState Ignore setGState no-ops.	2014-08-15 10:20:14 -05:00
Nicholas Nethercote	9674abc542	Ignore setGState no-ops. For the document in #2504, 11% of the ops are `setGState` with a `gStateObj` that is an empty array, which is a no-op. This is possible because we ignore various setGState keys (OP, OPM, BG, etc). This change prevents these ops from being inserted into the operator list.	2014-08-14 20:46:28 -07:00
Nicholas Nethercote	7cbd057deb	Avoid unnecessary array allocations in EvaluatorPreprocessor_read(). EvaluatorPreprocessor_read() is called in two cases. For the normal layer, the args array it produces is used beyond the bounds of the loop in which EvaluatorPreprocessor_read() is called. But for the text layer, the args array is used in a very short-term fashion. This change reworks things so that a single array is repeatedly used for the text layer. This reduces total JS allocations for the Spoorkaart map by 11%, and has similar effects on many other PDFs.	2014-08-11 16:57:40 -07:00
Yury Delendik	4ce1b1e987	Merge pull request #5150 from nnethercote/toUnicode Fix #4935	2014-08-10 14:07:26 -05:00
Yury Delendik	99b08ed223	Merge pull request #5162 from yurydelendik/pramodhkp-fixupgstate2 [SVG] Reduces amount of used memory during PNG creation.	2014-08-09 15:56:11 -05:00
pramodhkp	458b69b649	Adds image and mask features, fixes clippath	2014-08-10 01:06:43 +05:30
Jonas Jenwald	66c56ac546	Fixes a regression from PR 4982 After PR 4982, the rendering of the first two pages of http://www.openmagazin.cz/pdf/2011/openMagazin-2011-04.pdf (from issue 215) no longer completes. The issue is that we cannot have `args === null` in `PartialEvaluator_buildPath`, but must use an empty array instead. In this patch I've also moved the `argsLength` variable definition in `EvaluatorPreprocessor_read`, to make sure that it's always defined.	2014-08-08 13:19:18 +02:00
Nicholas Nethercote	9576047f0d	Add ToUnicodeMap class.	2014-08-07 20:05:24 -07:00
Yury Delendik	2b87ff9286	Merge pull request #5008 from nnethercote/better-QueueOpt Make QueueOptimizer easier to read.	2014-08-05 16:59:26 -05:00
Jonas Jenwald	87038e44cd	Add strict equalities in src/core/evaluator.js	2014-08-01 18:40:10 +02:00
Nicholas Nethercote	b86daed29d	Make CMap.map quasi-private. This makes it easier for the representation to be improved.	2014-07-30 06:26:35 -07:00
Jonas Jenwald	2485f11829	Fix loading of PDF files with invalid or missing Type3 characters (issue 5039)	2014-07-24 15:03:22 +02:00
Nicholas Nethercote	a483c80fc3	Make QueueOptimizer easier to read. QueueOptimizer is really hard to read. Enough so that it's blocking my efforts to streamline the representation used for operator lists. This patch improves its readability in the following ways. - More descriptive variable names make the sequence checking much clearer, as do additional comments. - The addState() functions now return the index of the first op past the sequence, instead of setting context.currentOperation to the last op of the sequence. - The loop in optimize() is clearer. - The array modification in the fourth addState() function is much clearer -- we're just removing trios of ops. - All four \|addState\| functions are now more consistent with each other. I used some debug printfs to find documents where these optimizations are used and then checked that the number of optimized ops was the same before and after my changes.	2014-07-03 19:16:31 -07:00
Jonas Jenwald	04975acceb	Prevent CMapFactory.create from failing by passing the necessary parameters from PartialEvaluator_readToUnicode (issue 5010)	2014-06-27 00:46:16 +02:00
Yury Delendik	6d5a04149b	Merge pull request #4993 from pramodhkp/rectelmnt Combine re element into constructPath	2014-06-24 09:27:21 -05:00
pramodhkp	8407d28c9e	Combine re element into constructPath	2014-06-25 00:27:42 +05:30
Fabian Lange	60f67c3961	Restructured EvaluatorPreprocessor_read to be more natural.	2014-06-23 23:35:25 +02:00
Nicholas Nethercote	081866a184	Use null instead of [] for ops with no args. This reduces peak RSS on one test file from ~600 to ~560 MiB.	2014-06-22 16:03:48 -07:00
Yury Delendik	b557b87fc9	Merge pull request #4972 from nnethercote/preprocessor-read Avoid allocating return object in EvaluatorPreprocessor_read().	2014-06-18 22:00:31 -05:00
Nicholas Nethercote	17170af3c7	Avoid allocating return object in EvaluatorPreprocessor_read(). This function can be called 100s of 1000s or even millions of times, and the allocated return object accounts for 10% of all GC thing allocations for some documents. It's easy to avoid, which reduces stress on the garbage collector, and this patch does that.	2014-06-18 16:41:29 -07:00
Nicholas Nethercote	bce7601480	Build up textChunk.str more efficiently. PartialEvaluator.getTextContent() builds up textChunk strings 1 char at a time, creating many 100s of 1000s of intermediate strings along the way. This patch make it instead push chars to an array and then join them at the end, as we have done in numerous other places.	2014-06-18 07:48:22 -07:00
Yury Delendik	5a2e511cbd	Merge pull request #4955 from timvandermeij/rename-concatenate Renames concatenateToArray to appendToArray	2014-06-17 08:21:47 -05:00
Yury Delendik	0cd28ebfa3	Telemetry for used stream and font types	2014-06-16 16:41:04 -05:00
Tim van der Meij	9c072a5d4b	Renames concatenateToArray to appendToArray	2014-06-16 22:10:10 +02:00
Nicholas Nethercote	7923eb7edb	Fix mishandling of incomplete, inverted masks.	2014-06-13 06:14:52 -07:00
Jonas Jenwald	c0250e16e3	Return ErrorFont in loadFont when the fontRef is undefined	2014-06-12 12:46:39 +02:00
Jonas Jenwald	7802a7ab97	Handle cases where the fontName contains non-alphanumeric characters (issue 4909)	2014-06-10 17:25:49 +02:00
Yury Delendik	b2d8e73d54	Merge pull request #4895 from p01/Small_optimizations_1 Small optimizations 1	2014-06-10 10:09:12 -05:00
p01	6731de6829	Minor refactoring of EvaluatorPreprocessor_read	2014-06-10 12:37:40 +02:00
p01	d4a01f6034	evaluator.js minor optimizations	2014-06-10 12:37:37 +02:00
Fabian Lange	532d7246ea	add object id to streams to prevent infinite loops. fixes http://bugzil.la/1020858	2014-06-10 11:29:25 +02:00
Yury Delendik	844bc644fb	Merge pull request #4861 from timvandermeij/xobject Fixes unhandled XObject subtype PS error	2014-05-29 08:40:57 -05:00
Jonas Jenwald	7e6cdc74af	Merge pull request #4832 from yurydelendik/showtext Refactors showText: split type3, remove showSpacedText	2014-05-29 12:58:09 +02:00
Tim van der Meij	e128bdc397	Fixes unhandled XObject subtype PS error	2014-05-29 11:53:13 +02:00
Yury Delendik	542c9c4c7a	Moves ColorSpace logic into evaluator	2014-05-23 14:11:47 -05:00
Yury Delendik	d53dc2e7d6	Refactors showText: split type3, remove showSpacedText	2014-05-23 13:36:54 -05:00
Yury Delendik	e5a0d89da9	Refactors loadFont for translateFont be async; fixes type3 dup data	2014-05-19 16:27:54 -05:00
Yury Delendik	88aa396aca	Terminate getOperationList and getTextContent every 20 ms	2014-05-19 16:19:54 -05:00
Yury Delendik	d8eb8b1de1	Adds Promise to the getOperatorList	2014-05-19 16:19:54 -05:00
Christian Krebs	3e7bcaa892	Handle nested post script arguments in the preprocessor Fix for issue #4785	2014-05-15 19:49:43 +02:00
Jonas Jenwald	b907e15a90	Build paths for glyph accents when drawing text as curves	2014-05-14 00:04:44 +02:00

... 3 4 5 6 7 ...