I just realized that none of our current unit-tests cover this particular part of the Page Label parsing code, hence this patch adjusts an existing test PDF to include a "St" entry in the Page Label dictionary.
*This patch fixes something that I noticed while debugging https://bugzilla.mozilla.org/show_bug.cgi?id=1308536.*
The PDF file contains a font called "NuptialScript", which unfortunately is not embedded. Since that is a non-standard font we will not be able to render it entirely correct. However, by adding "NuptialScript" to the `getNonStdFontMap`, we can at least improve the rendering slightly by using an italic (serif) fallback font.
This patch adds support for non-embedded Arial Black fonts, that use a `Arial-Black...` format for the font names.
Also, this patch changes `canvas.js` such that we always render Arial Black fonts with the maximum weight, which actually improves a number of existing test-cases. This should thus explain the test "failures", which are clear improvements compared with e.g. Adobe Reader.
Fixes 7835.
It seems that certain bad PDF generators can create badly encoded "Prefix" entries for Page Labels, one example being http://ukjewishfilm.org/wp-content/uploads/2015/09/Jewish-Film-Festival-Programme-ONLINE.pdf.
Unfortunately I didn't come across such a PDF file while adding the API support for Page Labels, but with them now being used in the viewer I just found this issue. With this patch, we now display the Page Labels in the same way as Adobe Reader.
Note that in `FIREFOX/MOZCENTRAL/CHROME` builds of the standard viewer the `docBaseUrl` parameter will be set by default, since in that case it makes sense to use the current URL as a base.
For the `GENERIC` viewer, or the API itself, it doesn't make sense to try and set the `docBaseUrl` by default. However, custom deployments/implementations may still find the parameter useful.
While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:
[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ
getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
For PDF files with multiple `/Filter`s, where the `/Length` entry is zero, we fail to render the file correctly. The reason is that `maybeLength` is `null` for the every filter except the first, and `!maybeLength` is thus truthy.
Hence it seems that we should completely ignore the `/Length` entry and also explicitly check `maybeLength === 0`.
Note that I've not (yet) come across a PDF file with this issue in the wild, but given all the stupid things PDF generators do I wouldn't be surprised if such a file actually exists. In order to prevent a possible future bug, I'm submitting this patch which includes a hand-edited PDF file that we currently cannot render correctly (but e.g. Adobe Reader can).
This patch improves the unit tests by testing the support for read-only
and multiline fields. Moreover, we add a reference test to ensure that
the text widgets are not only rendered, but also that their contents are
styled properly.
Finally, we perform minor improvements in `src/core/annotation.js`, for
example adding missing comments.
Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry.
Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data.
According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts.
There's obviously a risk that fixing this particular issue *could* break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage.
Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact *not* Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat.
Fixes 7580.
Note that in order to prevent any possible issues, this patch does *not* try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary.
Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.
I've started to look into how we can fix issue 7580, but quickly became worried that fixing it could easily mean that we'd trade one fixed PDF file for a multitude of broken ones.
Hence I started going through the history of the code that choose the fallback encoding, and noticed that it has been changed a number of times over the years to deal with various cases of weirdness/errors in non-embedded fonts.
To my relief it turned out that almost all the PRs, please see a possibly incomplete [list here], that changed this code actually included `eq` test-cases.
However, in one case it appears that a PR missed to add a test-case. Furthermore since the fallback encoding may also be the only source for creating a `toUnicode` map, changing the encoding could possibly regress only the text-selection despite a PDF file still rendering correctly.
Therefore, this PR adds one new `eq` test, and also a number of additional `text` tests for PDF files already present in the test-suite.
Note that it's obviously possible that there's a certain overlap between the added tests, but I'd be *a whole lot* more concerned with causing regressions.
This patch is yet another instalment in the (never ending) series of patches for PDF files that specify completely incorrect Type/Subtype for its fonts. In this case Type1/Type1C, when in fact OpenType would have been correct.
Fixes 7598.
The PDF file contains an image that we're allowed to use, since it's just the PDF.js logo.
The logo image was simply inverted (so that it requires a /Decode entry in the image dictionary that triggers the use of `jpg.js` instead of the browser), converted to JPEG, and finally edited by hand to change the order of the DQT/SOF{n} markers.
Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.
This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2.
The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before *every* showText command.
Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.
Note that I used a separate warning message for this case, instead of utilizing the same one as in the unsupported subtype case, to more clearly indicate that the PDF file itself is to blame rather than PDF.js.
Fixes 7446.
After PR 7039, the PDF file in issue 7492 no longer renders at all, but note that text selection wasn't working correctly previously.
The problem with the PDF file in issue 7492 is that the `cMap`, in the `toUnicode` entry in the font, contains an invalid name:
```
/CMapName /-usr-share-fonts-truetype-Panton-Panton Family-Fontfabric - Panton.otf,000-UTF16 def
```
When we parse that line, things obviously break because there are spaces present in the wrong places.
To avoid that issue, the patch simply lets `parseCMap` continue when errors are encountered, to try and recover usable data. Note that by not aborting immediatly when an error is encountered, we are also able to fix the text selection.
Obviously, it could be argued that we should just immediatly reject a corrupt `cMap`. But given that they usually are correct, it seems that trying to recover as much data as possible from corrupt one can only be a good thing for both glyph mapping and text selection.
Fixes 7492.
In the PDF file in the issue, some of the glyphs end up being mapped to the Lepcha Unicode block; see https://en.wikipedia.org/wiki/Lepcha_(Unicode_block).
This didn't use to matter, but after HarfBuzz updates that improved support for Lepcha fonts, in particular https://bugzilla.mozilla.org/show_bug.cgi?id=1249861, some glyphs are now moved horizontally.
To avoid that, this patch adds the Lepcha block to the list of Unicode ranges that we skip when building the glyph mapping.
Fixes 7426.
Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439)
In fonts with only upper-case glyphs, that are also missing a space glyph, `get spaceWidth` won't be able to return anything useful.
By adding upper-case `I` as a fallback, we can thus improve text-selection in some PDF files.
Note that locally, the patch causes slight movement in a few existing `text` tests, but in my opinion this actually looks like slight improvements.
Fixes 7180.
Fixes http://www.pdf-archive.com/2013/09/30/file2/file2.pdf.
Note how it's not possible to show the various Popup Annotations in the above document.
To fix that, this patch lets the Popup inherit the flags of the parent, in the special case where the parent is `viewable` *and* the Popup is not.
In general, I don't think that a Popup must have the same flags set as the parent. However, it seems very strange to have a `viewable` parent annotation, and then not being able to view the Popup.
Annoyingly the PDF specification doesn't, as far as I can find, mention anything about how this case should be handled, but this patch seem consistent with the actual behaviour in Adobe Reader.
In the font in question, there are a couple of `topDict` entries that have invalid values (`0xF 0xF`, i.e. just eof markers without any actual numbers).
This causes the `parseFloatOperand` function, inside `CFFParser_parseDict`, to return `NaN`. Currently we pass this broken font onto the browser, which OTS unsurprisingly rejects.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1068432.
Note that in the PDF files provided by the reporter, this issue was limited to `Rect` arrays in AcroForm entries (which we currently don't support).
However, since a bad PDF generator could create this problem in *any* kind of annotation, the reduced test-case included here uses a simple LinkAnnotation instead.
Fixes 7115.
This patch adds a `getUnicodeForGlyph` helper function, which is used to recover Unicode values for non-standard glyph names.
Some PDF generators, e.g. Scribus PDF, use improper `uniXXXX` glyph names which breaks the glyph mapping. We can avoid this by converting them to "standard" glyph names instead.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1132849.
Fixes 6893.
Fixes 6894.
In the PDF file in question, some of the 'name' table entries have `record.length === 0`. This becomes problematic in the non-unicode case, since `font.getBytes(0)` will fetch the *entire* stream.
Given that OTS rejects 'name' entries larger than `2^16`, this thus explain the sanitizer errors.
Fixes 7020.
*A more robust solution for issue 6066.*
As a temporary work-around for (the upstream) [bug 1164199](https://bugzilla.mozilla.org/show_bug.cgi?id=1164199), we parsed *all* images in the Firefox addon during a short time.
Doing so uncovered an issue with our image handling (see 6066), for JPEG images with a `DeviceGray` ColorSpace *and* `bpc !== 1` (bits per component).
As long as we let the browser handle image decoding in this case, this isn't going to be an issue, but I do think that we should proactively fix this to avoid future issues if we change where the images are decoded (in `jpg.js` vs in browser).
Also, we currently don't seem to have a test-case for that kind of image data.
Some bad PDF generators, in particular "Scribus PDF", duplicates resources *a lot* at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete.
The reason is that the current code is using `Dict_getAll`, which recursively dereferences *all* indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects.
I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.)
Fixes 6961.
Re: issue 5089.
(Note that since there are other outline features that we currently don't support, e.g. bold/italic text and custom colours, I thus think we can keep the referenced issue open.)
issue5726
We remove yo01. The file is not available, it was introduced in
3529658ac5
along with another test file (so we should not need it) and it was
replaced in
6b2c6fc223,
so it was only in the repository for a very short time.
As part of the link cleanup in issue 6854, obtaining this file through the Internet Archive didn't work.
However, given that the file was added in order to test an issue with `CropBox/MediaBox`, a reduced test-case should do just fine instead.
Please refer to issue 1155, and PR 1212.
It seems to be fairly common for OCR software to include incomplete TrueType fonts, notable missing the "glyf" table, in PDF files. Since we currently reject such fonts, the result is that text-selection/copying is broken.
This patch contains a suggested approach to try and use these kind of broken fonts, by using existing code in `sanitizeGlyphLocations` to replace a missing "glyf" table with dummy data.
Fixes 4684.
Fixes 6007.
Fixes 6829.