Sakurai/pdf.js - pdf.js - Gitea on kemo

Sakurai/pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	769c1450b7	[api-minor] Refactor fetching of built-in CMaps to utilize a factory on the `display` side instead, to allow users of the API to provide a custom CMap loading factory (e.g. for use with Node.js) Currently the built-in CMap files are loaded in `src/core/cmap.js` using `XMLHttpRequest` directly. For some environments that might be a problem, hence this patch refactors that to instead use a factory to load built-in CMaps on the main thread and message the data to the worker thread. This is inspired by other recent work, e.g. the addition of the `CanvasFactory`, and to a large extent on the IRC discussion starting at http://logs.glob.uno/?c=mozilla%23pdfjs&s=12+Oct+2016&e=12+Oct+2016#c53010.	2017-02-16 10:55:35 +01:00
Jonas Jenwald	9c34d0aa8c	[api-minor] Add a `getDocument` parameter that allows disabling of the `NativeImageDecoder` (e.g. for use with Node.js) Note that I initially tried to add this as a parameter to the `PDFPageProxy.render` method, such that it could be passed to `PartialEvaluator.getOperatorList`. However, given all the different code-paths that call `getOperatorList` (there's a bunch only in `annotation.js`), this seemed to very quickly become unwieldy and thus difficult to maintain compared to simply using the existing `evaluatorOptions`.	2017-02-06 22:21:34 +01:00
Jonas Jenwald	52e0f51917	Enable the `no-unused-vars` ESLint rule Please see http://eslint.org/docs/rules/no-unused-vars; note that this patch purposely uses the same rule options as in `mozilla-central`, such that it fixes part of issue 7957. It wasn't, in my opinion, entirely straightforward to enable this rule compared to the already existing rules. In many cases a `var descriptiveName = ...` format was used (more or less) to document the code, and I choose to place the old variable name in a trailing comment to not lose that information. I welcome feedback on these changes, since it wasn't always entirely easy to know what changes made the most sense in every situation.	2017-01-29 23:23:17 +01:00
Jonas Jenwald	50c2856097	Move `EOF`/`isEOF` from core/parser.js to core/primitives.js Given the nature of `EOF` and `isEOF`, it seems to me that they really ought to be placed in `core/primitives.js` instead. In general, it doesn't seem great to have to depend on the entire `core/parser.js` file for such simple primitives/helper functions. In particular, while `core/ps_parser.js` is completely separate from `core/parser.js` with regards to its function, it still depends on the latter for just one primitive. Note that compared to e.g. PR 7389, this will not reduce the number of dependencies for `core/ps_parser`, however the new dependency IMHO makes more sense.	2017-01-27 13:37:48 +01:00
Jonas Jenwald	642d8621ef	Replace direct lookup of `uniquePrefix`/`idCounters`, in `Page` instances, with an `idFactory` containing an `createObjId` method instead We're currently making use of `uniquePrefix`/`idCounters` in multiple files, to create unique object id's, and adding a new occurrence of them requires some care to ensure that an object id isn't accidentally reused. Furthermore, having to pass around multiple parameters as we currently do seem like something you want to avoid. Instead, this patch adds a factory which means that there's only one thing that needs to be passed around. And since it's now only necessary to call a method in order to obtain a unique object id, the details are thus abstracted away at the call-sites which avoids accidental reuse of object id's. To test that this works as expected a very simple `Page` unit-test is added, and the existing `Annotation layer` tests are also adjusted slightly.	2017-01-09 23:16:25 +01:00
Jonas Jenwald	4046d67fde	Enable the `no-else-return` ESLint rule Using `else` after `return` is not necessary, and can often lead to unnecessarily cluttered code. By using the `no-else-return` rule in ESLint we can avoid this pattern, see http://eslint.org/docs/rules/no-else-return.	2017-01-09 20:27:39 +01:00
Jonas Jenwald	ddea9a6b04	Improve the handling of `Encoding` dictionary, with `Differences` array, in `PartialEvaluator_preEvaluateFont` I recently happened to look at the code I wrote for PR 5964, which fixed [bug 1157493](https://bugzilla.mozilla.org/show_bug.cgi?id=1157493), and I quickly realized that the solution is way too simplistic. The fact that only using the `length` of a `Differences` array worked seems more like a happy accident for a particular set of font data, but could just as easily be incorrect for other PDF files. Note that in practice, the case where the `Encoding` entry is a regular `Dict` (and not a `Ref` or `Name`) is very rare, hence I don't think that we really need to worry about having to reparse this data. Also, the performance of this code-block is quite a bit better by updating the `hash` with the data from the entire `Differences` array, instead of at every loop iteration.	2016-12-28 21:32:54 +01:00
Ross Johnson	4537590033	Consitently apply textAdvanceScale during building of textContentItems for improved highlighting. Fixes #7878 .	2016-12-14 21:02:19 -06:00
Jonas Jenwald	c5b06cb40d	Ensure that `PartialEvaluator_extractWidths` is able to handle indirect objects in all kinds of "width" data (issue 7855) Fixes 7855.	2016-11-29 20:49:07 +01:00
Jonas Jenwald	451956c0b1	Merge pull request #7628 from Snuffleupagus/issue-7580 Fallback to the `StandardEncoding` for Nonsymbolic fonts without `/Encoding` entry (issue 7580)	2016-11-29 12:37:36 +01:00
Jonas Jenwald	a930f9af15	For commands with with too few arguments, clear out `args` if it's an Array instead of replacing it with `null` in `EvaluatorPreprocessor_read` (issue 7804) For `PartialEvaluator_getTextContent`, the same `args` Array should be re-used for every `EvaluatorPreprocessor_read` call. Hence we want to ensure that it's not accidentally replaced with `null` in `EvaluatorPreprocessor_read`, since otherwise corrupt PDF files (with too few arguments for certain commands) will cause errors in `PartialEvaluator_getTextContent`. Perhaps a micro-optimization, but this patch also changes two `!args` comparisons to `args === null`, since that should be a tiny bit more efficient.	2016-11-16 10:20:29 +01:00
Chas Emerick	85c52f1fd6	Fix getTextContent evaluation to only apply TJ horizontal offsets using numeric items/args While the array argument to TJ should only contain strings and numbers, other unfortunate items are found in PDFs in the wild, e.g.: [(Grandes) 0.0 Tc -250.0 (Client\350les,) 0.0 Tc -250.0 (Financements) 0.0 Tc -250.0 (et) 0.0 Tc -250.0 (March\351s) ] TJ getOperatorList already properly ignores any non-string, non-numeric values in TJ arrays; without this patch to getTextContent, returned text items can have NaN widths due to calculations being applied to those non-numeric values.	2016-10-13 08:08:31 -04:00
Jonas Jenwald	116ba19dd9	Respect the 'ColorTransform' entry in the image dictionary when decoding JPEG images (bug 956965, issue 6574) Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=956965. Fixes 6574.	2016-09-26 21:55:43 +02:00
Jonas Jenwald	356b321f6d	Fallback to the `StandardEncoding` for Nonsymbolic fonts without `/Encoding` entry (issue 7580) Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry. Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data. According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts. There's obviously a risk that fixing this particular issue could break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage. Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact not Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat. Fixes 7580.	2016-09-13 14:07:16 +02:00
Jonas Jenwald	325f7afcca	For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142) Note that in order to prevent any possible issues, this patch does not try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary. Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.	2016-09-11 20:54:10 +02:00
Yury Delendik	ffa99397ad	Merge pull request #7387 from Snuffleupagus/issue-5808 Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808)	2016-08-30 15:21:41 -05:00
Jonas Jenwald	83ce6f0b6d	Adjust the (applicable) existing `isName` callsites to use the new `isName(v, name)` version of the function	2016-08-10 11:15:08 +02:00
Jonas Jenwald	77c6ed5389	Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808) This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2. The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before every showText command. Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.	2016-07-27 21:37:52 +02:00
Yury Delendik	a02e2686b9	Merge pull request #7475 from Snuffleupagus/api-getTextContent-combineTextItems [api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items	2016-07-27 08:34:24 -05:00
Brendan Dahl	5678486802	Merge pull request #7347 from Snuffleupagus/evaluator-more-Ref_toString Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402)	2016-07-22 17:21:47 -07:00
Brendan Dahl	50d6e4f147	Merge pull request #7447 from Snuffleupagus/buildToUnicode-notdef Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256)	2016-07-22 14:33:32 -07:00
Jonas Jenwald	390c02a3e9	Attempt to cache fonts that are direct objects (i.e. `Dict`s), as opposed to `Ref`s, to prevent re-rendering after `cleanup` from breaking (issue 7403 and issue 7402) Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run. The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation. Note that if we instead loaded fonts per page, instead of per document, this issue wouldn't have existed.	2016-07-21 16:04:07 +02:00
Jonas Jenwald	2e9cd3ea64	Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402) Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements. There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare. Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference and the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.) In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.	2016-07-21 16:03:33 +02:00
Jonas Jenwald	f297e4d17c	[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.	2016-07-19 13:38:57 +02:00
klemens	6f03f62327	trivial spelling fixes	2016-07-17 14:33:41 +02:00
Jonas Jenwald	bdd58ab1d2	Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256) Fixes 5256.	2016-06-27 16:20:23 +02:00
Jonas Jenwald	b02d560ae0	Fix errors in `setGState` in `PartialEvaluator_getTextContent` that prevents text-selection from working properly Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly. This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.	2016-06-01 22:58:49 +02:00
Jonas Jenwald	7ddb0bc718	Attempt to combine text runs positioned with `setTextMatrix`	2016-05-18 17:21:58 +02:00
Jonas Jenwald	6111c17c8a	Use `Dict_getArray` in more places in `src/core/` to avoid issues when Arrays contain indirect objects As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where some elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing). The problem in both the cases above, besides broken rendering, was that there were no errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue. Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.	2016-05-05 19:42:57 +02:00
Jonas Jenwald	f59c3a0644	Remove the remaining usages of `new {Name,Cmd}` in favor of `{Name,Cmd}.get` Using `new {Name,Cmd}` should be avoided, since it creates a new object on every call, whereas `{Name,Cmd}.get` uses caches to only create one object regardless of how many times they are called. Most of these are found in the unit-tests, where increased memory usage probably doesn't matter very much. But it still seems good to get rid of those cases, since no part of the codebase ought to advertise that usage. Given the small size of the patch, I'm also tweaking a few comments and class names.	2016-04-08 12:14:05 +02:00
Yury Delendik	a250c150ab	Merge pull request #7134 from yurydelendik/circ-stream-colorspace Refactors to remove stream.js dependency on colorspace.js	2016-04-01 08:23:24 -05:00
Yury Delendik	35cbf74b12	Refactors to remove stream.js dependency on colorspace.js	2016-04-01 07:36:16 -05:00
Jonas Jenwald	05cf709f8e	Parse Type1 font files to determine the various `Length{n}` properties, instead of trusting the PDF file (issue 5686, issue 3928) Fixes 5686. Fixes 3928.	2016-03-31 11:08:12 +02:00
Brendan Dahl	4e2f70440f	Merge pull request #6711 from yurydelendik/errors Better errors capturing at the core and stop rendering on error.	2016-03-29 09:19:28 -07:00
Yury Delendik	bda5e6235e	Removes global PDFJS usage from the src/core/.	2016-03-23 19:24:37 -05:00
Manas	f6d28ca323	Refactors CMapFactory.create to make it async	2016-03-21 23:08:19 +05:30
Yury Delendik	8ba413e761	Better errors capturing at the core and stop rendering on error.	2016-03-11 07:59:09 -06:00
Jonas Jenwald	93ea866f01	Remove `getAll` from `EvaluatorPreprocessor_read` For the operators that we currently support, the arguments are not `Dict`s, which means that it's not really necessary to use `Dict_getAll` in `EvaluatorPreprocessor_read`. Also, I do think that if/when we support operators that use `Dict`s as arguments, that should be dealt with in the corresponding `case` in `PartialEvaluator_getOperatorList` which handles the operator. The only reason that I can find for using `Dict_getAll` like that, is that prior to PR 6550 we would just append certain (currently unsupported) operators without doing any further processing/checking. But as issue 6549 showed, that can lead to issues in practice, which is why it was changed. In an effort to prevent possible issue with unsupported operators, this patch simply ignores operators with `Dict` arguments in `PartialEvaluator_getOperatorList`.	2016-02-12 22:31:50 +01:00
Jonas Jenwald	f7f60197ce	Replace `getAll` with `getKeys` in `loadType3Data` Not only is `getAll` less efficient, but given that we actually need the keys here, using `getKeys` seems much more suitable.	2016-02-10 20:19:14 +01:00
Jonas Jenwald	07e1ad40a2	Replace `getAll` with `getKeys` in `PartialEvaluator_hasBlendModes` to speed up loading of badly generated PDF files (issue 6961) Some bad PDF generators, in particular "Scribus PDF", duplicates resources a lot at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete. The reason is that the current code is using `Dict_getAll`, which recursively dereferences all indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects. I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.) Fixes 6961.	2016-02-10 17:21:38 +01:00
Jonas Jenwald	a1fe2cb443	Don't directly access the private `map` in `setGState`, and ensure that we avoid indirect objects This patch is based on something I noticed while debugging some of the PDF files in issue 6931. In a number of the cases in `setGState`, we're implicitly assuming that we're not dealing with indirect objects (i.e. `Ref`s). See e.g. the 'Font' case, or the various cases where we simply do `gStateObj.push([key, value]);` (since the code in `canvas.js` won't be able to deal with a `Ref` for those cases). The reason that I didn't use `Dict_forEach` instead, is that it would re-introduce the unncessary closures that PR 5205 removed.	2016-02-03 17:13:42 +01:00
Jonas Jenwald	2d4a1aa0af	Actually ignore no-op `setGState` (PR 5192 followup) The intention of PR 5192 was to avoid adding empty `setGState` ops to the operatorList. But the patch accidentally used `>=`, which means that it's not actually working as intended, since empty arrays always have `length === 0`.	2016-02-03 17:13:02 +01:00
Jonas Jenwald	4770b516fe	Correct the upper bound used when building the `transferMap` for SMasks (PR 6723 followup) Even though the currently known test-cases render correctly without this patch, that seems more like a lucky coincidence, given that there's no guarantee that `transferMap[255] === 0` for every possible transfer function.	2016-02-03 13:41:10 +01:00
Jonas Jenwald	992472fd38	Ensure that we don't modify the `Dict` data when the `Differences` array of a font contains indirect objects This patch fixes an issue that I inadvertently introduced in PR 5815, where we accidentally modify the `Differences` array in the encoding dictionary for indirect objects. Instead of this change, we could also have used the now existing `Dict_getArray`. However in this case I don't think that would have been a good idea, since it would mean iterating through the array twice.	2016-01-30 13:31:24 +01:00
Yury Delendik	2edf2792dc	Replaces literal {} created lookup tables with Object.create	2016-01-28 12:18:38 -06:00
Yury Delendik	d6adf84159	Lazify OP_MAP.	2016-01-28 12:18:37 -06:00
Yury Delendik	1de90454b7	Lazify Metrics	2016-01-28 12:11:46 -06:00
Yury Delendik	55a201d92d	Lazify NormalizedUnicodes	2016-01-28 11:56:42 -06:00
Yury Delendik	d0738d7e24	Lazify stdFontMap, serifFonts, GlyphMapForStandardFonts	2016-01-28 11:51:54 -06:00
Yury Delendik	1a9a665adf	Refactor Encodings	2016-01-28 11:32:59 -06:00

1 2 3 4 5