For Arabic characters, the Unicode character codes are mapped to Unicode
character types using the character codes for indexing. However, the
character code 0x061D is undefined (and therefore invalid) in the
Unicode standard. The imported list does not contain this entry, but not
having it in the list breaks indexing for items after it. Therefore, put
an empty string on its position to make indexing work properly and issue
a warning in the unlikely event that we encounter this character.
*This patch fixes something that I noticed while debugging https://bugzilla.mozilla.org/show_bug.cgi?id=1308536.*
The PDF file contains a font called "NuptialScript", which unfortunately is not embedded. Since that is a non-standard font we will not be able to render it entirely correct. However, by adding "NuptialScript" to the `getNonStdFontMap`, we can at least improve the rendering slightly by using an italic (serif) fallback font.
This patch adds support for non-embedded Arial Black fonts, that use a `Arial-Black...` format for the font names.
Also, this patch changes `canvas.js` such that we always render Arial Black fonts with the maximum weight, which actually improves a number of existing test-cases. This should thus explain the test "failures", which are clear improvements compared with e.g. Adobe Reader.
Fixes 7835.
For `PartialEvaluator_getTextContent`, the same `args` Array should be re-used for every `EvaluatorPreprocessor_read` call. Hence we want to ensure that it's not accidentally replaced with `null` in `EvaluatorPreprocessor_read`, since otherwise corrupt PDF files (with too few arguments for certain commands) will cause errors in `PartialEvaluator_getTextContent`.
Perhaps a micro-optimization, but this patch also changes two `!args` comparisons to `args === null`, since that should be a tiny bit more efficient.
By only allowing very specific type of `JavaScript` actions, and also utilizing the existing `URL` validation, this patch shouldn't pose too much risk.
Fixes one of the points in issue 3897 (with the PDF file taken from issue 3438).
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=843699 (probably, since that bug doesn't contain a test-case).
It seems that certain bad PDF generators can create badly encoded "Prefix" entries for Page Labels, one example being http://ukjewishfilm.org/wp-content/uploads/2015/09/Jewish-Film-Festival-Programme-ONLINE.pdf.
Unfortunately I didn't come across such a PDF file while adding the API support for Page Labels, but with them now being used in the viewer I just found this issue. With this patch, we now display the Page Labels in the same way as Adobe Reader.
The original code is difficult to read and, more importantly, performs
actions that are not described in the specification. It replaces empty
names with a backtick and an index, but this behavior is not described
in the specification. While the specification is not entirely clear
about what should happen in this case, it does specify that the `T`
field is optional and that multiple field dictionaries may have the same
fully qualified name, so to achieve this it makes the most sense to
ignore missing `T` fields during construction of the field name. This is
the most specification-compliant solution and, judging by opened issue #6623, also the required and expected behavior.
In general we neither want, nor can, support arbitrary `Launch` actions. But in practice, all the cases we've seen so far just contains relative URLs to other PDF files. Building on PR 7689, we can thus at least support basic `Launch` actions.
Note that in `FIREFOX/MOZCENTRAL/CHROME` builds of the standard viewer the `docBaseUrl` parameter will be set by default, since in that case it makes sense to use the current URL as a base.
For the `GENERIC` viewer, or the API itself, it doesn't make sense to try and set the `docBaseUrl` by default. However, custom deployments/implementations may still find the parameter useful.
Note that this will automatically reject any relative URL.
To make the API more useful to consumers, URLs that are rejected will be available via the `unsafeUrl` property in the data object returned by `PDFPageProxy_getAnnotations`.
The patch also adds a bit more validation of the data for `Named` actions.
This not only reduces code duplication, but it also allow us to easily support the same kind of URLs we currently do for Link annotations in the Outline as well.
Let `Parser_makeFilter` pass in the `DecodeParms` data to various image `Stream`s, instead of re-fetching it in various `[...]Stream.prototype.ensureBuffer` methods
In `Parser_filter` the `DecodeParms` data is fetched and passed to `Parser_makeFilter`, where we also make sure that a `Ref` is resolved to a direct object.
We can thus pass this along to the various image `Stream` constructors, to avoid the current situation where we lookup/resolve data that is already available.
Note also that we currently do *not* handle the case where `DecodeParms` is an Array entirely correct in the various image `Stream`s, and this patch fixes that for free.
While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:
[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ
getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
For PDF files with multiple `/Filter`s, where the `/Length` entry is zero, we fail to render the file correctly. The reason is that `maybeLength` is `null` for the every filter except the first, and `!maybeLength` is thus truthy.
Hence it seems that we should completely ignore the `/Length` entry and also explicitly check `maybeLength === 0`.
Note that I've not (yet) come across a PDF file with this issue in the wild, but given all the stupid things PDF generators do I wouldn't be surprised if such a file actually exists. In order to prevent a possible future bug, I'm submitting this patch which includes a hand-edited PDF file that we currently cannot render correctly (but e.g. Adobe Reader can).
Directly use the hexadecimal representation, just like the
`AnnotationFlags`, to avoid calculations and to improve readability.
This allows us to simplify the unit tests for text widget annotations as
well.
When debugging issue 7643, I noticed that the `forms` tests currently doesn't look like the rendering in the viewer (with `renderInteractiveForms = true` set).
After scratching my head for a little while, I realized that PR 7633 make the implicit assumption that `Page_getOperatorList` (in `core/document.js`) is called *before* fetching the annotation with `PDFPageProxy_getAnnotations` (in `display/api.js`).
Hence this patch, that changes it so that we instead pass in the `renderInteractiveForms` parameter to `Annotation_appendToOperatorList` to ensure that it's always correctly set.
If interactive forms are enabled, then the display layer takes care of
rendering the form elements. There is no need to draw them on the canvas
as well. This also leads to issues when values are prefilled, because
the text fields are transparent, so the contents that have been rendered
onto the canvas will be visible too.
We address this issue by passing the `renderInteractiveForms` parameter
to the render task and handling it when the page is rendered (i.e., when
the canvas is rendered).
This patch improves the unit tests by testing the support for read-only
and multiline fields. Moreover, we add a reference test to ensure that
the text widgets are not only rendered, but also that their contents are
styled properly.
Finally, we perform minor improvements in `src/core/annotation.js`, for
example adding missing comments.
For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142)
Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry.
Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data.
According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts.
There's obviously a risk that fixing this particular issue *could* break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage.
Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact *not* Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat.
Fixes 7580.
Note that in `parseCodestream` I purposly left the `throw new Error` instances inside of the `try` block, since we don't want to throw any `Errors` while in recovery mode.
Finally somewhat unrelated to the rest of the patch, but I moved the `doNotRecover` variable declaration outside of the `try` block to avoid variable hoisting given that it's accessed inside the `catch` block.
Note that in order to prevent any possible issues, this patch does *not* try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary.
Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.
This patch is yet another instalment in the (never ending) series of patches for PDF files that specify completely incorrect Type/Subtype for its fonts. In this case Type1/Type1C, when in fact OpenType would have been correct.
Fixes 7598.
Currently, we only support text widget annotations (field type 'Tx')
partially. However, the current code does not make this entirely clear
and does not provide a warning when an unsupported field type is
encountered, making it harder to determine why rendering fails.
Moreover, in the display layer we make no distinction between the
various types of widget annotations, causing the code for text widget
annotations to also be executed for other types of widget annotations in
a fallback situation.
This patch improves the structure of the widget annotation code. In the
core layer, we use the same structure we use for non-widget annotations
in the factory and provide a clear warning when an unsupported type is
encountered. In the display layer, we do the same and split the
`WidgetAnnotationElement` class into two classes, namely
`TextWidgetAnnotationElement` for text widget annotations and
`WidgetAnnotationElement` for other unsupported annotations as a
fallback. From this it clear that we only support text widget
annotations and nothing else.
This allows us to remove the `try/catch` statements used in `src/core/stream.js` when parsing JPEG images.
As far as I can tell, the only reason for the current usage of plain `throw` is that `jpg.js` originally was external code. Given that this code now lives in our repo, this patch brings the JPEG code more in line with e.g. `src/core/jpx.js` and `src/core/jbig2.js`.
When adding new entries to `ProblematicCharRanges`, you have to be careful to not make any mistakes since that could cause glyph mapping issues.
Currently the existing reference tests should probably help catch any errors, but based on experience I think that having a unit-test which specifically checks `ProblematicCharRanges` would be both helpful and timesaving when modifying/reviewing changes to this code.
Hence this patch which adds a function (and unit-test) that is used to validate the entries in `ProblematicCharRanges`, and also checks that we don't accidentally add more character ranges than the Private Use Area can actually contain.
The way that the validation code, and thus the unit-test, is implemented also means that we have an easy way to tell how much of the Private Use Area is potentially utilized by re-mapped characters.
Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.
This is similar to the existing `isCmd` and `isDict` functions, which already support similar kind of checks.
With the updated `isName` function, we'll be able to simplify many callsites from: `isName(someVariable) && someVariable.name === 'someName'` to: `isName(someVariable, 'someName')`.
This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2.
The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before *every* showText command.
Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.
[api-minor] Add a parameter to `PDFPageProxy_getTextContent` that controls whether `PartialEvaluator_getTextContent` will attempt to combine same line text items
Note that I used a separate warning message for this case, instead of utilizing the same one as in the unsupported subtype case, to more clearly indicate that the PDF file itself is to blame rather than PDF.js.
Fixes 7446.
Fonts that are not referenced by `Ref`s are very uncommon in practice, but it can unfortunately happen. In this case, we're currently not caching them in the usual way, i.e. by `Ref`, which leads to failures when a page is rendered after `cleanup` has run.
The simplest solution would have been to remove the `font.translated` workaround, but since this would have meant loading these kind of fonts over and over, the patch attempts to be a bit clever about this situation.
Note that if we instead loaded fonts per *page*, instead of per document, this issue wouldn't have existed.
Originally, I was just going to change this code to use `Ref_toString` in a couple more places. When I started reading the code, I figured that it wouldn't hurt to clean up a couple of comments. While doing this, I noticed that the logic for the (rare) `isDict(fontRef)` case could do with a few improvements.
There should be no functional changes with this patch, but given the added reference checks, we will now avoid bogus `Ref`s when resolving font aliases. In practice, as issue 7403 shows, the current code can break certain PDF files even if it's very rare.
Note that the only thing that this patch will change, is the `font.loadedName` in the case where a `fontRef` is a reference *and* the font doesn't have a descriptor. Previously for `fontRef = Ref(4, 0)` we'd get `font.loadedName = 'g_d0_f4_0'`, and with this patch `font.loadedName = g_d0_f4R`, which is actually one character shorted in most cases. (Given that `Ref_toString` contains an optimization for the `gen === 0` case, which is by far the most common `gen` value.)
In the already existing fallback case, where the `fontName` is used to when creating the `font.loadedName`, we allow any alphanumeric character. Hence I don't see how (as mentioned above) e.g. `font.loadedName = g_d0_f4R` would be an issue here.
From the discussion in issue 7445, it seems that there may be cases where an API consumer would want to get the text content as is, without combined text items.
After PR 7039, the PDF file in issue 7492 no longer renders at all, but note that text selection wasn't working correctly previously.
The problem with the PDF file in issue 7492 is that the `cMap`, in the `toUnicode` entry in the font, contains an invalid name:
```
/CMapName /-usr-share-fonts-truetype-Panton-Panton Family-Fontfabric - Panton.otf,000-UTF16 def
```
When we parse that line, things obviously break because there are spaces present in the wrong places.
To avoid that issue, the patch simply lets `parseCMap` continue when errors are encountered, to try and recover usable data. Note that by not aborting immediatly when an error is encountered, we are also able to fix the text selection.
Obviously, it could be argued that we should just immediatly reject a corrupt `cMap`. But given that they usually are correct, it seems that trying to recover as much data as possible from corrupt one can only be a good thing for both glyph mapping and text selection.
Fixes 7492.
In the PDF file in the issue, some of the glyphs end up being mapped to the Lepcha Unicode block; see https://en.wikipedia.org/wiki/Lepcha_(Unicode_block).
This didn't use to matter, but after HarfBuzz updates that improved support for Lepcha fonts, in particular https://bugzilla.mozilla.org/show_bug.cgi?id=1249861, some glyphs are now moved horizontally.
To avoid that, this patch adds the Lepcha block to the list of Unicode ranges that we skip when building the glyph mapping.
Fixes 7426.
After PR 7441, where `recoverGlyphName` is used a lot more than before, many PDF files will generate a lot of warnings the console. For normal usage, compared to debugging/development, this is probably more annoying than helpful.
Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439)
In fonts with only upper-case glyphs, that are also missing a space glyph, `get spaceWidth` won't be able to return anything useful.
By adding upper-case `I` as a fallback, we can thus improve text-selection in some PDF files.
Note that locally, the patch causes slight movement in a few existing `text` tests, but in my opinion this actually looks like slight improvements.
Fixes 7180.
Currently the `isSpace` utility function is a member of `Lexer`, which seems suboptimal, given that it's placed in `core/parser.js`. In practice, this means that in a number of `core/*.js` files we thus have an *otherwise* completely unnecessary dependency on `core/parser.js` for a one-line function.
Instead, this patch moves `isSpace` into `shared/util.js` which seems more appropriate for this kind of utility function. Not to mention that since all the affected `core/*.js` files already depends on `shared/util.js`, this doesn't incur any more file dependencies.
Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly.
This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.
Fixes http://www.pdf-archive.com/2013/09/30/file2/file2.pdf.
Note how it's not possible to show the various Popup Annotations in the above document.
To fix that, this patch lets the Popup inherit the flags of the parent, in the special case where the parent is `viewable` *and* the Popup is not.
In general, I don't think that a Popup must have the same flags set as the parent. However, it seems very strange to have a `viewable` parent annotation, and then not being able to view the Popup.
Annoyingly the PDF specification doesn't, as far as I can find, mention anything about how this case should be handled, but this patch seem consistent with the actual behaviour in Adobe Reader.
Currently for explicit destinations, compared to named destinations, we manually try to build a hash that often times is a quite poor representation of the *actual* destination. (Currently this only, kind of, works for `\XYZ` destinations.)
For PDF files using explicit destinations, this can make it difficult/impossible to obtain a link to a specific section of the document through the URL.
Note that in practice most PDF files, especially newer ones, use named destinations and these are thus unnaffected by this patch.
This patch also fixes an existing issue in `PDFLinkService_getDestinationHash`, where a named destination consisting of only a number would not be handled correctly.
With the added, and already existing, type checks in place for destinations, I really don't think that this patch exposes any "sensitive" internal destination code not already accessible through normal hash parameters.
*Please note:* Just trying to improve the algorithm that generates the hash is unfortunately not possible in general, since there are a number of cases where it will simply never work well.
- First of all, note that `getDestinationHash` currently relies on the `_pagesRefCache`, hence it's possible that the hash returned is empty during e.g. ranged/streamed loading of a PDF file.
- Second of all, the currently computed hash is actually dependent on the document rotation. With named destinations, the fetched internal destination array is rotational invariant (as it should be), but this will not hold in general for the hash. We can easily avoid this issue by using a stringified destination array.
- Third of all, note that according to the PDF specification[1], `GoToR` destinations may actually contain explicit destination arrays. Since we cannot really construct a hash in `annotation.js`, we currently have no good way to support those. Even though this case seems *very* rare in practice (I've not actually seen such a PDF file), it's in the specification, and this patch allows us to support that for "free".
---
[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G11.1951685
Currently the `getPageIndex` method will happily return `0`, even if the `Ref` parameter doesn't actually point to a proper /Page dictionary.
Having the API trust that the consumer is doing the right thing seems error-prone, hence this patch which adds a check for this case.
Given that the `Catalog_getPageIndex` method isn't used in any hot part of the codebase, this extra check shouldn't be a problem.
(Note: in the standard viewer, it is only ever used from `PDFLinkService_navigateTo` if a destination needs to be resolved during document loading, which isn't common enough to be an issue IMHO.)