pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	4acd31f51e	Merge pull request #7550 from Snuffleupagus/Type1-toUnicode-builtInEncoding-fallback For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142)	2016-09-16 17:51:55 +02:00
Tim van der Meij	323e86c442	Text widget annotations: implement unit testing and sanitize data values	2016-09-13 14:57:11 +02:00
Jonas Jenwald	356b321f6d	Fallback to the `StandardEncoding` for Nonsymbolic fonts without `/Encoding` entry (issue 7580) Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry. Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data. According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts. There's obviously a risk that fixing this particular issue could break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage. Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact not Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat. Fixes 7580.	2016-09-13 14:07:16 +02:00
Jonas Jenwald	f620f61887	Change `src/core/jpx.js` to use the `error` utility function instead of using `throw new Error` Note that in `parseCodestream` I purposly left the `throw new Error` instances inside of the `try` block, since we don't want to throw any `Errors` while in recovery mode. Finally somewhat unrelated to the rest of the patch, but I moved the `doNotRecover` variable declaration outside of the `try` block to avoid variable hoisting given that it's accessed inside the `catch` block.	2016-09-12 11:05:43 +02:00
Jonas Jenwald	325f7afcca	For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142) Note that in order to prevent any possible issues, this patch does not try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary. Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.	2016-09-11 20:54:10 +02:00
Jonas Jenwald	0b75f63c03	Don't duplicate the first entry in the `charCodeToGlyphId` map for CIDFontType2 fonts with a `CIDToGIDMap` that already mapped the first entry to a non-zero `glyphId` (issue 7544) Fixes 7544.	2016-09-09 22:33:41 +02:00
Tim van der Meij	b112f9f9f4	Merge pull request #7600 from Snuffleupagus/issue-7598 Check that Type1C fonts does not actually contain OpenType font files (issue 7598)	2016-09-09 22:02:58 +02:00
Jonas Jenwald	44b75c01a1	Check that Type1C fonts does not actually contain OpenType font files (issue 7598) This patch is yet another instalment in the (never ending) series of patches for PDF files that specify completely incorrect Type/Subtype for its fonts. In this case Type1/Type1C, when in fact OpenType would have been correct. Fixes 7598.	2016-09-06 10:13:11 +02:00
Tim van der Meij	576f742047	Improve the structure for widget annotations Currently, we only support text widget annotations (field type 'Tx') partially. However, the current code does not make this entirely clear and does not provide a warning when an unsupported field type is encountered, making it harder to determine why rendering fails. Moreover, in the display layer we make no distinction between the various types of widget annotations, causing the code for text widget annotations to also be executed for other types of widget annotations in a fallback situation. This patch improves the structure of the widget annotation code. In the core layer, we use the same structure we use for non-widget annotations in the factory and provide a clear warning when an unsupported type is encountered. In the display layer, we do the same and split the `WidgetAnnotationElement` class into two classes, namely `TextWidgetAnnotationElement` for text widget annotations and `WidgetAnnotationElement` for other unsupported annotations as a fallback. From this it clear that we only support text widget annotations and nothing else.	2016-09-06 00:26:05 +02:00
Jonas Jenwald	a35773ec8c	Change `src/core/jpg.js` to use the `error` utility function instead of `throw`ing This allows us to remove the `try/catch` statements used in `src/core/stream.js` when parsing JPEG images. As far as I can tell, the only reason for the current usage of plain `throw` is that `jpg.js` originally was external code. Given that this code now lives in our repo, this patch brings the JPEG code more in line with e.g. `src/core/jpx.js` and `src/core/jbig2.js`.	2016-09-04 16:28:23 +02:00
Jonas Jenwald	1bbc694ac3	Assign the `quantizationTables` after parsing the entire JPEG image, to prevent issues when the DQT (Define Quantization Tables) marker is encountered after SOF{n} (Start of Frame) markers (issue 7406) This is a tentative patch that fixes 7406.	2016-08-31 18:42:05 +02:00
Yury Delendik	ffa99397ad	Merge pull request #7387 from Snuffleupagus/issue-5808 Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808)	2016-08-30 15:21:41 -05:00
Tim van der Meij	f520616e00	Merge pull request #7570 from Snuffleupagus/issue-7569 Create a fallback annotation `id` for entries in `Annots` dictionaries that are not indirect objects (issue 7569)	2016-08-28 00:23:59 +02:00
Jonas Jenwald	088ce6c009	Add a unit-test to check that `ProblematicCharRanges` contains valid entries When adding new entries to `ProblematicCharRanges`, you have to be careful to not make any mistakes since that could cause glyph mapping issues. Currently the existing reference tests should probably help catch any errors, but based on experience I think that having a unit-test which specifically checks `ProblematicCharRanges` would be both helpful and timesaving when modifying/reviewing changes to this code. Hence this patch which adds a function (and unit-test) that is used to validate the entries in `ProblematicCharRanges`, and also checks that we don't accidentally add more character ranges than the Private Use Area can actually contain. The way that the validation code, and thus the unit-test, is implemented also means that we have an easy way to tell how much of the Private Use Area is potentially utilized by re-mapped characters.	2016-08-27 11:56:00 +02:00
Jonas Jenwald	78889646c8	Create a fallback annotation `id` for entries in `Annots` dictionaries that are not indirect objects (issue 7569) According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#page=86, entries in `Annots` dictionaries should be indirect objects, but obviously there're PDF generators that ignore this. Fixes 7569.	2016-08-27 10:56:16 +02:00
Tim van der Meij	b4c8814fc9	Merge pull request #7534 from Snuffleupagus/isName-name-check Add a parameter to the `isName` function that enables checking not just that something is a `Name`, but also that the actual `name` properties matches	2016-08-17 15:48:42 +02:00
Jonas Jenwald	544d29f5cb	Add a `recoveryMode` that suppresses errors from the `Parser`, and utilize it when searching for the main trailer in `XRef_indexObjects` (bug 1250079) Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.	2016-08-17 12:37:35 +02:00
Jonas Jenwald	83ce6f0b6d	Adjust the (applicable) existing `isName` callsites to use the new `isName(v, name)` version of the function	2016-08-10 11:15:08 +02:00
Jonas Jenwald	af636aae96	Add a parameter to the `isName` function that enables checking not just that something is a `Name`, but also that the actual `name` properties matches This is similar to the existing `isCmd` and `isDict` functions, which already support similar kind of checks. With the updated `isName` function, we'll be able to simplify many callsites from: `isName(someVariable) && someVariable.name === 'someName'` to: `isName(someVariable, 'someName')`.	2016-08-10 11:15:03 +02:00
Jonas Jenwald	77c6ed5389	Attempt to ignore multiple identical Tf (setFont) commands in `PartialEvaluator_getTextContent` (issue 5808) This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2. The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before every showText command. Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.	2016-07-27 21:37:52 +02:00
Yury Delendik	a02e2686b9	Merge pull request #7475 from Snuffleupagus/api-getTextContent-combineTextItems [api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items	2016-07-27 08:34:24 -05:00
Jonas Jenwald	558a22cd02	Prevent errors when parsing Annotations with missing (or invalid) /Subtype entries (issue 7446) Note that I used a separate warning message for this case, instead of utilizing the same one as in the unsupported subtype case, to more clearly indicate that the PDF file itself is to blame rather than PDF.js. Fixes 7446.	2016-07-25 13:59:26 +02:00
Brendan Dahl	5678486802	Merge pull request #7347 from Snuffleupagus/evaluator-more-Ref_toString Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402)	2016-07-22 17:21:47 -07:00
Brendan Dahl	50d6e4f147	Merge pull request #7447 from Snuffleupagus/buildToUnicode-notdef Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256)	2016-07-22 14:33:32 -07:00
Jonas Jenwald	390c02a3e9	Attempt to cache fonts that are direct objects (i.e. `Dict`s), as opposed to `Ref`s, to prevent re-rendering after `cleanup` from breaking (issue 7403 and issue 7402) Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run. The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation. Note that if we instead loaded fonts per page, instead of per document, this issue wouldn't have existed.	2016-07-21 16:04:07 +02:00
Jonas Jenwald	2e9cd3ea64	Slightly refactor the `fontRef` handling in `PartialEvaluator_loadFont` (issue 7403 and issue 7402) Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements. There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare. Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference and the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.) In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.	2016-07-21 16:03:33 +02:00
Tim van der Meij	10f9f11ec4	Merge pull request #7490 from Snuffleupagus/issue-7426 Don't map glyphs to the Lepcha Unicode block (issue 7426)	2016-07-21 14:39:19 +02:00
Jonas Jenwald	f297e4d17c	[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.	2016-07-19 13:38:57 +02:00
Jonas Jenwald	90d19de935	Catch errors and continue parsing in `parseCMap` (issue 7492) After PR 7039, the PDF file in issue 7492 no longer renders at all, but note that text selection wasn't working correctly previously. The problem with the PDF file in issue 7492 is that the `cMap`, in the `toUnicode` entry in the font, contains an invalid name: ``` /CMapName /-usr-share-fonts-truetype-Panton-Panton Family-Fontfabric - Panton.otf,000-UTF16 def ``` When we parse that line, things obviously break because there are spaces present in the wrong places. To avoid that issue, the patch simply lets `parseCMap` continue when errors are encountered, to try and recover usable data. Note that by not aborting immediatly when an error is encountered, we are also able to fix the text selection. Obviously, it could be argued that we should just immediatly reject a corrupt `cMap`. But given that they usually are correct, it seems that trying to recover as much data as possible from corrupt one can only be a good thing for both glyph mapping and text selection. Fixes 7492.	2016-07-18 16:39:56 +02:00
Jonas Jenwald	64783c8b6e	Don't map glyphs to the Lepcha Unicode block (issue 7426) In the PDF file in the issue, some of the glyphs end up being mapped to the Lepcha Unicode block; see https://en.wikipedia.org/wiki/Lepcha_(Unicode_block). This didn't use to matter, but after HarfBuzz updates that improved support for Lepcha fonts, in particular https://bugzilla.mozilla.org/show_bug.cgi?id=1249861, some glyphs are now moved horizontally. To avoid that, this patch adds the Lepcha block to the list of Unicode ranges that we skip when building the glyph mapping. Fixes 7426.	2016-07-17 16:53:36 +02:00
klemens	6f03f62327	trivial spelling fixes	2016-07-17 14:33:41 +02:00
Jonas Jenwald	51e46fa1a7	Change the `warn` to `info` in `recoverGlyphName` to reduce the console spam After PR 7441, where `recoverGlyphName` is used a lot more than before, many PDF files will generate a lot of warnings the console. For normal usage, compared to debugging/development, this is probably more annoying than helpful.	2016-07-09 12:08:41 +02:00
Tim van der Meij	b6826a46a8	Merge pull request #7453 from simoncpu/master Expose the text widget's maximum length.	2016-07-07 01:03:40 +02:00
Brendan Dahl	1f3f4a8dd7	Merge pull request #7441 from Snuffleupagus/issue-7439 Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439)	2016-07-06 13:02:21 -07:00
Brendan Dahl	e2e657e44f	Merge pull request #7390 from Snuffleupagus/issue-7180 Add upper-case `I` as a possible space replacement fallback in `Font.spaceWidth` to improve text-selection (issue 7180)	2016-06-29 15:11:19 -07:00
Simon Cornelius P. Umacob	d872fc90b9	Expose the text widget's maximum length.	2016-06-29 17:04:33 +08:00
Jonas Jenwald	bdd58ab1d2	Ignore .notdef in the `differences` array when building a fallback `toUnicode` map in `PartialEvaluator_buildToUnicode` (issue 5256) Fixes 5256.	2016-06-27 16:20:23 +02:00
Jonas Jenwald	7866109af9	Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439) Fixes 7439.	2016-06-25 14:54:34 +02:00
Jonas Jenwald	c1ca268ef3	Skip mapping of glyphs to Unicode "Ideographic space" (issue 7416) Fixes 7416, which is an IE specific issue.	2016-06-22 08:58:00 +02:00
Tim van der Meij	f97d52182a	Merge pull request #7341 from Snuffleupagus/getDestinationHash-Array [api-minor] Improve handling of links that are using explicit destination arrays	2016-06-09 00:29:10 +02:00
Tim van der Meij	70b3eea4a3	Merge pull request #7389 from Snuffleupagus/move-isSpace Move the `isSpace` utility function from core/parser.js to shared/util.js	2016-06-08 22:57:45 +02:00
Jonas Jenwald	6a0b047bfa	Add upper-case `I` as a possible space replacement fallback in `Font.spaceWidth` to improve text-selection (issue 7180) In fonts with only upper-case glyphs, that are also missing a space glyph, `get spaceWidth` won't be able to return anything useful. By adding upper-case `I` as a fallback, we can thus improve text-selection in some PDF files. Note that locally, the patch causes slight movement in a few existing `text` tests, but in my opinion this actually looks like slight improvements. Fixes 7180.	2016-06-07 22:55:25 +02:00
Jonas Jenwald	6260fc09a3	Attempt to recover valid `format 3` FDSelect data from broken CFF fonts (bug 1146106) According to the CFF specification, see http://partners.adobe.com/public/developer/en/font/5176.CFF.pdf#G3.46884, for `format 3` FDSelect data: "The first range must have a ‘first’ GID of 0". Since the PDF file (attached in the bug) violates that part of the specification, this patch tries to recover valid FDSelect data to prevent OTS from rejecting the font. Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1146106.	2016-06-06 18:20:52 +02:00
Jonas Jenwald	a36a946976	Move the `isSpace` utility function from core/parser.js to shared/util.js Currently the `isSpace` utility function is a member of `Lexer`, which seems suboptimal, given that it's placed in `core/parser.js`. In practice, this means that in a number of `core/.js` files we thus have an otherwise* completely unnecessary dependency on `core/parser.js` for a one-line function. Instead, this patch moves `isSpace` into `shared/util.js` which seems more appropriate for this kind of utility function. Not to mention that since all the affected `core/*.js` files already depends on `shared/util.js`, this doesn't incur any more file dependencies.	2016-06-06 09:11:33 +02:00
Jonas Jenwald	b02d560ae0	Fix errors in `setGState` in `PartialEvaluator_getTextContent` that prevents text-selection from working properly Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly. This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.	2016-06-01 22:58:49 +02:00
Jonas Jenwald	98fe094d18	Let non-viewable Popup Annotations inherit the parent's Annotation Flags if the parent is viewable Fixes http://www.pdf-archive.com/2013/09/30/file2/file2.pdf. Note how it's not possible to show the various Popup Annotations in the above document. To fix that, this patch lets the Popup inherit the flags of the parent, in the special case where the parent is `viewable` and the Popup is not. In general, I don't think that a Popup must have the same flags set as the parent. However, it seems very strange to have a `viewable` parent annotation, and then not being able to view the Popup. Annoyingly the PDF specification doesn't, as far as I can find, mention anything about how this case should be handled, but this patch seem consistent with the actual behaviour in Adobe Reader.	2016-05-25 23:00:26 +02:00
Brendan Dahl	b86610ffdb	Merge pull request #7300 from Snuffleupagus/bug-1068432 Prevent adding invalid values in `CFFDict_setByKey` (bug 1068432)	2016-05-24 12:12:38 -07:00
Jonas Jenwald	b354682dd6	[api-minor] Let `LinkAnnotation`/`PDFLinkService_getDestinationHash` return a stringified version of the destination array for explicit destinations Currently for explicit destinations, compared to named destinations, we manually try to build a hash that often times is a quite poor representation of the actual destination. (Currently this only, kind of, works for `\XYZ` destinations.) For PDF files using explicit destinations, this can make it difficult/impossible to obtain a link to a specific section of the document through the URL. Note that in practice most PDF files, especially newer ones, use named destinations and these are thus unnaffected by this patch. This patch also fixes an existing issue in `PDFLinkService_getDestinationHash`, where a named destination consisting of only a number would not be handled correctly. With the added, and already existing, type checks in place for destinations, I really don't think that this patch exposes any "sensitive" internal destination code not already accessible through normal hash parameters. Please note: Just trying to improve the algorithm that generates the hash is unfortunately not possible in general, since there are a number of cases where it will simply never work well. - First of all, note that `getDestinationHash` currently relies on the `_pagesRefCache`, hence it's possible that the hash returned is empty during e.g. ranged/streamed loading of a PDF file. - Second of all, the currently computed hash is actually dependent on the document rotation. With named destinations, the fetched internal destination array is rotational invariant (as it should be), but this will not hold in general for the hash. We can easily avoid this issue by using a stringified destination array. - Third of all, note that according to the PDF specification[1], `GoToR` destinations may actually contain explicit destination arrays. Since we cannot really construct a hash in `annotation.js`, we currently have no good way to support those. Even though this case seems very rare in practice (I've not actually seen such a PDF file), it's in the specification, and this patch allows us to support that for "free". --- [1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G11.1951685	2016-05-21 14:14:07 +02:00
Jonas Jenwald	01ab15a6f1	[api-minor] Let `Catalog_getPageIndex` check that the `Ref` actually points to a /Page dictionary Currently the `getPageIndex` method will happily return `0`, even if the `Ref` parameter doesn't actually point to a proper /Page dictionary. Having the API trust that the consumer is doing the right thing seems error-prone, hence this patch which adds a check for this case. Given that the `Catalog_getPageIndex` method isn't used in any hot part of the codebase, this extra check shouldn't be a problem. (Note: in the standard viewer, it is only ever used from `PDFLinkService_navigateTo` if a destination needs to be resolved during document loading, which isn't common enough to be an issue IMHO.)	2016-05-21 14:13:41 +02:00
Tim van der Meij	db46829ef7	Merge pull request #7316 from timvandermeij/remove-unused Remove unused variables	2016-05-21 14:07:33 +02:00

1 2 3 4 5 ...

1055 Commits