There have been lots of problems with trying to map glyphs to their unicode
values. It's more reliable to just use the private use areas so the browser's
font renderer doesn't mess with the glyphs.
Using the private use area for all glyphs did highlight other issues that this
patch also had to fix:
* small private use area - Previously, only the BMP private use area was used
which can't map many glyphs. Now, the (much bigger) PUP 16 area can also be
used.
* glyph zero not shown - Browsers will not use the glyph from a font if it is
glyph id = 0. This issue was less prevalent when we mapped to unicode values
since the fallback font would be used. However, when using the private use
area, the glyph would not be drawn at all. This is illustrated in one of the
current test cases (issue #8234) where there's an "ä" glyph at position
zero. The PDF looked like it rendered correctly, but it was actually not
using the glyph from the font. To properly show the first glyph it is always
duplicated and appended to the glyphs and the maps are adjusted.
* supplementary characters - The private use area PUP 16 is 4 bytes, so
String.fromCodePoint must be used where we previously used
String.fromCharCode. This is actually an issue that should have been fixed
regardless of this patch.
* charset - Freetype fails to load fonts when the charset size doesn't match
number of glyphs in the font. We now write out a fake charset with the
correct length. This also brought up the issue that glyphs with seac/endchar
should only ever write a standard charset, but we now write a custom one.
To get around this the seac analysis is permanently enabled so those glyphs
are instead always drawn as two glyphs.
Apparently there's some PDF generators, in this case the culprit is "Nooog Pdf Library / Nooog PStoPDF v1.5", that manage to mess up PDF creation enough that endstream[1] commands actually become truncated.
*Please note:* The solution implemented here isn't perfect, since it won't be able to cope with PDF files that contains a *mixture* of correct and truncated endstream commands.
However, considering that this particular mode of corruption *fortunately* doesn't seem very common[2], a slightly less complex solution ought to suffice for now.
Fixes 10004.
---
[1] Scanning through the PDF data to find endstream commands becomes necessary, in order to determine the stream length in cases where the `Length` entry of the (stream) dictionary is missing/incorrect.
[2] I cannot recall having seen any (previous) issues/bugs with "Missing endstream" errors.
Please note that while this *improves* issue 9984 slightly (and likely others too), it's not a complete solution.
The remaining issues are related to the, more general, problems with the existing heuristics related to attempting to combine separate text items.
The font in the PDF is marked as a CIDFontType0, but the font file is
actually a true type font. To fully address this issue we should really
peek into the font file and try to determine what it is. However, this
is the first case of this issue, so I think this solution is acceptable for
now.
According to the PDF specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#page=45
> When using the JPXDecode filter with image XObjects, the following changes to and constraints on some entries in the image dictionary shall apply (see 8.9.5, "Image Dictionaries" for details on these entries):
>
> - Width and Height shall match the corresponding width and height values in the JPEG2000 data.
>
> - . . .
Hence it seems reasonable to use the Width/Height of the image data *itself*, rather than the image dictionary when there's a mismatch. Given that JPEG 2000 images are already being parsed, in order to obtain basic parameters, the actual Width/Height is readily available in the `PDFImage` constructor.
*I was feeling bored; so this is a very quick, and somewhat naive, attempt at fixing the bug.*
The breaking error, i.e. `Error during font loading: invalid array length`, was thrown when attempting to re-size the `stack` to a *negative* length when parsing the CALL functions.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1473809.
Incomplete path operators, in particular, can result in fairly chaotic rendering artifacts, as can be observed on page four of the referenced PDF file.
The initial (naive) solution that was attempted, was to simply throw a `FormatError` as soon as any invalid (i.e. too short) operator was found and rely on the existing `ignoreErrors` code-paths. However, doing so would have caused regressions in some files; see the existing `issue2391-1` test-case, which was promoted to an `eq` test to help prevent future bugs.
Hence this patch, which adds special handling for invalid path operators since those may cause quite bad rendering artifacts.
You could, in all fairness, argue that the patch is a handwavy solution and I wouldn't object. However, given that this only concerns *corrupt* PDF files, the way that PDF viewers (PDF.js included) try to gracefully deal with those could probably be described as a best-effort solution anyway.
This patch also adjusts the existing `warn`/`info` messages to print the command name according to the PDF specification, rather than an internal PDF.js enumeration value. The former should be much more useful for debugging purposes.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1443140.
This patch avoids choosing a (possible) 'trailer' dictionary that `XRef.parse` and/or the `Catalog` constructor/methods will reject anyway.
Since `XRef.indexObjects` is already parsing the entire PDF file, the extra dictionary look-ups added here shouldn't matter much. Besides, this is a fallback code-path that only applies to corrupt PDF files anyway.
Obviously it's still not possible to render non-embedded fonts as paths, but in this way the rest of the page will at least be allowed to continue rendering.
*Please note:* Including the 14 standard fonts in PDF.js probably wouldn't be *that* difficult to implement. (I'm not a lawyer, but the fonts from PDFium could probably be used given their BSD license.)
However, the main blocker ought to be the total size of the necessary font data, since I cannot imagine people being OK with shipping ~5 MB of (additional) font data with Firefox. (Based on the reactions when the CMap files were added, and those are only ~1 MB in size.)
The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.
There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be *way* too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution.
The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of *not* caching given that too aggressive caching can easily lead to rendering bugs.
One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the *exact* stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1]
With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618.
Fixes 9398; also improves/fixes the `issue8823` reference test.
---
[1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.
Fallback to the built-in JPEG decoder when browser decoding fails, and attempt to handle JPEG images with DNL (Define Number of Lines) markers (issue 8614)
Please refer to the specification, found at https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=49
Given how the JPEG decoder is currently implemented, we need to know the value of the scanLines parameter (among others) *before* parsing of the SOS (Start of Scan) data begins.
Hence the best solution I could come up with here, is to re-parse the image in the *hopefully* rare case of JPEG images that include a DNL (Define Number of Lines) marker.
Fixes 8614.
In the JPEG images in the referenced PDF file, the DHT (Define Huffman Tables) segments contain more data than expected based on the length parameter.
Fixes 9425.
Please refer to the PDF specification, in particular http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.3801570
> A colour space shall be specified in one of two ways:
> - Within a content stream, the CS or cs operator establishes the current colour space parameter in the graphics state. The operand shall always be name object, which either identifies one of the colour spaces that need no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, or some cases of Pattern) or shall be used as a key in the ColorSpace subdictionary of the current resource dictionary (see 7.8.3, "Resource Dictionaries"). In the latter case, the value of the dictionary entry in turn shall be a colour space array or name. A colour space array shall never be inline within a content stream.
>
> - Outside a content stream, certain objects, such as image XObjects, shall specify a colour space as an explicit parameter, often associated with the key ColorSpace. In this case, the colour space array or name shall always be defined directly as a PDF object, not by an entry in the ColorSpace resource subdictionary. This convention also applies when colour spaces are defined in terms of other colour spaces.
I've been looking into the remaining point in 8637 about blurry images, to see if we could perhaps improve the rendering quality slightly there. After quite a bit of debugging, it seems that the issue is limited to certain progressive JPEG images.
As mentioned previously, I've got no detailed knowledge of the JPEG format, but this patch does seem to improve things quite a bit for the images in question.
Squinting at https://searchfox.org/mozilla-central/rev/6c33dde6ca02b389c52e8db3d22494df8b916f33/media/libjpeg/jdphuff.c#492-639, it seems reasonable that we should take the sign of the data into account. Furthermore, looking at the specification in https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=118, the "F.2.4.3 Decoding the binary decision sequence for non-zero DC differences and AC coefficients" section even contains a description of this (even though I cannot claim to really understand the details).
The bug that this patch fixes is limited to the built-in JPEG decoder, and was unearthed by PR 9260. The underlying issue has existed since PR 6984, where the contents of this patch ought to have been included (if it weren't for the fact that we had no *easy* way to test `src/core/jpg.js` back then).
*Please note:* The slight movement in the reference test is a result of using the `src/core/jpg.js` decoder, rather than the native browser one.
This was an oversight in PR 9095, which unfortunately breaks rendering in some PDF files (e.g. the one from issue 6737).
It thus appears that we don't have any test-coverage for this code-path, and given the relative complexity of the PDF files affected by this bug I wasn't able to easily create a reduced test-case.
*Please note:* The linked test-case included in this patch is currently *not* rendered correctly (that'd be the PR 6606), but it at least gives us some test-coverage here.
The PDF file in the issue uses a number of *embedded* versions of Lucida fonts, but for some reason does *not* embed the LucidaSans-Demi font. According to https://en.wikipedia.org/wiki/Lucida#Usages that one should be bold, so we can at least improve rendering here (even though it won't look perfect).
Fixes 9291.
This patch refactors the searching for 'endobj', to try and find the next occurance of "obj" and then check if it was in fact an 'endobj' and continue searching otherwise.
This approach is used to avoid having to first find 'endobj', and then re-check the entire contents of the object and having to run (potentially expensive) regular expressions on arbitrary long strings.
Fixes 9105.
There's a number of issues with the fonts in the referenced PDF file. First of all, they contain broken `ToUnicode` data (`NUL` bytes all over the place). However even if you skip those, the `ToUnicode` data appears to contain nothing but a `IdentityH` CMap which won't help provide a proper glyph mapping.
The real issue actually turns out to be that the PDF file uses the "Calibri" font[1], but doesn't include any font files. Since that one isn't a standard font, and uses a fairly different CID to GID map compared to the standard fonts, we're not able to render the file even remotely correct.
To work around this, I'm thus proposing that we include a (incomplete) glyph map for Calibri, and fallback to the standard Helvetica font. Obviously this isn't going to look perfect, but it's really the best that we can hope to achieve given that the PDF file is missing the necessary font data.
Finally, please note that none of the PDF readers I've tried (Adobe Reader, PDFium in Chrome) were able to extract the text (which isn't very surprising, given the broken `ToUnicode` data).
Fixes 9195.
---
[1] According to Wikipedia, see https://en.wikipedia.org/wiki/Calibri, Calibri is (primarily) a Windows font.
In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things.
Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case.
Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data.
According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172):
> A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value.
> ...
Reading that paragraph literally, it doesn't seem too unreasonable to use *different* methods for different charcodes.
Fixes 8229.
In the PDF file, the `ToUnicode` data first maps the hyphen correctly, and then *overwrites* it to point to the softhyphen instead. That one cannot be rendered in browsers, and an empty space thus appear instead.
Fixes 9084.
This patch makes use of the existing `ignoreErrors` property in `src/core/evaluator.js`, see PRs 8240 and 8441, thus allowing us to attempt to recovery as much as possible of a page even when it contains broken XObjects.
Fixes 8702.
Fixes 8704.
Bug 1392647 has a PDF where the default width of the font
is 0. It draws some charcodes that don't have glyphs, but
we were wrongly using the 1000 default width for these
charcodes causing some text to be overlapping.
(for issue #6289)
This does the same for 16 bit as the existing 8 bit tiff predictor code, an addition of the last word to this word.
The last two "& 0xFF" may or may not be needed, I see this isn't done in the 8 bit code, but I'm not a JS developer.
Rather than displaying links that does *nothing* when clicked, it probably makes more sense to simply not render them instead. Especially since it turns out that, at least at this point in time, this is *very* easy to both implement and test.
Fixes 3897.
Since this patch will now treat (some) `NUL` bytes as "ASCII", the number of `followingBytes` checked are thus increased to (hopefully) reduce the risk of introducing new false positives.
Fixes 8823.
In issue #8707, there's a char code mapped to a non-
existing glyph which shouldn't be drawn. However, we
saw it was missing and tried to then use the post table and
end up mapping it incorrectly.
This illuminated a problem with issue #5704 and bug
893730 where glyphs disappeared after above fix. This was
from the cmap returning the wrong glyph id. Which in turn was
caused because the font had multiple of the same type of cmap
table and we were choosing the last one. Now, we instead
default to the first one. I'm unsure if we should instead be
merging the multiple cmaps, but using only the first one works.
The PDF file uses a non-embedded SegoeUISymbol font, which is *not* a standard font (and is mainly used by Microsoft, see https://en.wikipedia.org/wiki/Segoe).
Fixes 8697.
Issue 7696 was one of the issues fixed by PR 8580. The other ones were all cases of missing glyphs, however in this particular one glyphs did render but every single one was incorrect.
Hence it probably cannot hurt to have a small, reduced, reference test for that PDF file as well.
This is a trivial follow-up to PR 5383, and it's a bit strange that this has been wrong since late 2014 without anyone noticing (maybe because inline images aren't too common).
So, apparently code works better if you actually spell correctly, who knew ;-)
Fixes 8613.
In the PDF from issue 8527, the clip operator (W) shows up before a path
is defined. The current SVG backend however expects a path to exist
before generating a `<svg:clipPath>` element.
In the example, the path was defined after the clip, followed by a
endPath operator (n).
So this commit fixes the bug by moving the path generation logic from
clip to endPath.
Our canvas backend appears to use similar logic:
`CanvasGraphics_endPath` calls `consumePath`, which in turn draws the
clip and resets the `pendingClip` state. The canvas backend calls
`consumePath` from multiple other places, so we probably need to check
whether doing so is also necessary for the SVG backend.
I scanned our corpus of PDF files in test/pdfs, and found that in every
instance (except for one), the "W" PDF operator (clip) is immediately
followed by "n" (endPath). The new test from this commit (clippath.pdf)
starts with "W", followed by a path definition and then "n".
# Commands used to find some of the clipping commands:
grep -ra '^W$' -C7 | less -S
grep -ra '^W ' -C7 | less -S
grep -ra ' W$' -C7 | less -S
test/pdfs/issue6413.pdf is the only file where "W" (a tline 55) is not
followed by "n". In fact, the "W" is the last operation of a series of
XObject painting operations, and removing it does not have any effect
on the rendered PDF (confirmed by looking at the output of PDF.js's
canvas backend, and ImageMagick's convert command).
*As mentioned the last time that I touched this particular part of the font code, I'm sincerely hope that this doesn't cause any regressions!*
However, the patch passes all tests added in PRs 5770, 6270, and 7904 (and obviously all other tests as well). Furthermore, I've manually checked all the issues/bugs referenced in those PRs without finding any issues.
Fixes 8480.
For some reason, we're putting all kind of images *except* JPEG into the `imageCache` in `evaluator.js`.[1]
This means that in the PDF file in issue 8380, we'll keep sending the *same* two small images[2] to the main-thread and decoding them over and over. This is obviously hugely inefficient!
As can be seen from the discussion in the issue, the performance becomes *extremely* bad if the user has the addon "Adblock Plus" installed. However, even in a clean Firefox profile, the performance isn't that great.
This patch not only addresses the performance implications of the "Adblock Plus" addon together with that particular PDF file, but it *also* improves the rendering times considerably for *all* users.
Locally, with a clean profile, the rendering times are reduced from `~2000 ms` to `~500 ms` for my setup!
Obviously, the general structure of the PDF file and its operator sequence is still hugely inefficient, however I'd say that the performance with this patch is good enough to consider the issue (as it stands) resolved.[3]
Fixes 8380.
---
[1] Not technically true, since inline images are cached from `parser.js`, but whatever :-)
[2] The two JPEG images have dimensions 1x2, respectively 4x2.
[3] To make this even more efficient, a new state would have to be added to the `QueueOptimizer`. Given that PDF files this stupid fortunately aren't too common, I'm not convinced that it's worth doing.
Please refer to the JBIG2 standard, see https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.88-200002-I!!PDF-E&type=items.
In particular, section "6.3.5.3 Fixed templates and adaptive templates" mentions that the offsets should be *subtracted*; where the offsets are defined according to "Table 6" under section "6.3.2 Input parameters".
Fixes 7145.
Fixes 7308.
Fixes 7401.
Fixes 7850.
Fixes 8270.
[api-minor] Always allow e.g. rendering to continue even if there are errors, and add a `stopAtErrors` parameter to `getDocument` to opt-out of this behaviour (issue 6342, issue 3795, bug 1130815)
This patch implements support for line annotations. Other viewers only
show the popup annotation when hovering over the line, which may have
any orientation. To make this possible, we render an invisible line (SVG
element) over the line on the canvas that acts as the trigger for the
popup annotation. This invisible line has the same starting coordinates,
ending coordinates and width of the line on the canvas.
Other PDF readers, e.g. Adobe Reader and PDFium (in Chrome), will attempt to render as much of a page as possible even if there are errors present.
Currently we just bail as soon the first error is hit, which means that we'll usually not render anything in these cases and just display a blank page instead.
NOTE: This patch changes the default behaviour of the PDF.js API to always attempt to recover as much data as possible, even when encountering errors during e.g. `getOperatorList`/`getTextContent`, which thus improve our handling of corrupt PDF files and allow the default viewer to handle errors slightly more gracefully.
In the event that an API consumer wishes to use the old behaviour, where we stop parsing as soon as an error is encountered, the `stopAtErrors` parameter can be set at `getDocument`.
Fixes, inasmuch it's possible since the PDF files are corrupt, e.g. issue 6342, issue 3795, and [bug 1130815](https://bugzilla.mozilla.org/show_bug.cgi?id=1130815) (and probably others too).
*My apologies for inadvertently breaking this in PR 8064; apparently we don't have any tests that cover this use-case :(*
Without this patch `getTextContent` will fail if called before `getOperatorList`, since loading of fonts during text-extraction may require fetching of built-in CMap files.
*Please note:* The `text` test added here, which uses an already existing PDF file, fails without this patch.
*After browsing through (a version of) the JPEG specification, see https://www.w3.org/Graphics/JPEG/itu-t81.pdf, I hope that this patch makes sense.*
Note that while issue 7828 became a problem after PR 7661, it isn't really a regression from than PR. The explanation is rather that we're now relying on `core/jpg.js` instead of the Native Image decoder in more situations than before, which thus exposed an *existing* issue in our JPEG decoder.
Another factor also seems to be that in many JPEG images, the DRI (Define Restart Interval) marker isn't present, in which case this bug won't manifest either.
According to https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=89 (at the bottom of the page):
"NOTE – The final restart interval may be smaller than the size specified by the DRI marker segment, as it includes only the number of MCUs remaining in the scan."
Furthermore, according to https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=39 (in the middle of the page):
"[...] If restart is enabled and the restart interval is defined to be Ri, each entropy-coded segment except the last one shall contain Ri MCUs. The last one shall contain whatever number of MCUs completes the scan."
Based on the above, it thus seem to me that we should simply ensure that we're not attempting to continue to parse Scan data once we've found all MCUs (Minimum Coded Unit) of the image.
Fixes 7828.
It appears that I accidentally broke this in PR 6065, sorry about that!
The issue in this particular PDF file is that there's `/Rotate` entries on different levels of the `/Pages` tree. We're supposed to use the `/Rotate` entry in the `/Page` dict (which is `0`), but because of an incorrect condition we instead ended up with the one from the `/Pages` dict (which is `180`).
Fixes 8125.
As discussed on IRC, we need to check all nodes at the *bottom* of the tree to ensure that we find the correct `Page` dict.
Furthermore, this patch also gets rid of the caching present in a previous version, since it's not clear if that really helps.
Note that this patch purposely adds an `eq` test, using a reduced test-case, so that we can be sure that the algorithm actually finds the correct `Page` dict for each `pageIndex`.
Fixes 8088.
This patch basically reverts one aspect of TrueType (3, 1) cmap parsing to the state prior to PR 4259. After that PR, a number of regressions occurred in this particular code-path, which necessitated a number of follow-ups such as PRs 5703, 5743, and 6425.
The empirical data suggests, at least to me, that we should always prefer a (3, 1) cmap for TrueType fonts when they have an encoding, regardless of the Symbolic font flag.
Obviously this patch passes all unit/font/reference tests locally, and I made sure that all the PRs mentioned above landed with test-cases included.
However, in my opinion, there's still a very real possibility that this patch could potentially cause new regressions.
Given that the PDF file in bug 1337429 has been broken for almost *three* years before anyone noticed, and considering that the code-path in question has been the source of numerous regressions, I do *not* intend to request uplift of this patch to previous Firefox versions (assuming that it's even accepted).
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1337429.
*Please note:* The rendering of the PDF file in issue 8061 first regressed in PR 7039, and then PR 7493 exacerbated the problem even further by causing an infinite loop.
In this particular case, when errors were encountered inside of the `Lexer.getObject` method *itself*, we didn't advance the stream position. This thus caused an inifinite loop in `parseCMap`, since the exact same character was then parsed over and over again.
Fixes 8061.
Further adjust the heuristics used to detect OpenType font files with CFF data, to ensure that all Type0 fonts are handled the same way regardless of font Subtype (issue 7901)
Changing this particular code makes me somewhat nervous about regressions, since PR 5770 necessitated the follow-up PR 6270.
However, the patch passes all tests added in those PRs (and obviously all other tests). Furthermore, I've manually checked all the issues/bugs referenced in PRs 5770 and 6270 without finding any issues.
**Please note:** This patch fixes *only* the font bug, not the SVG conversion, present on pages two and three of the PDF file in issue 7901.
*This patch fixes something that I noticed while debugging https://bugzilla.mozilla.org/show_bug.cgi?id=1308536.*
The PDF file contains a font called "NuptialScript", which unfortunately is not embedded. Since that is a non-standard font we will not be able to render it entirely correct. However, by adding "NuptialScript" to the `getNonStdFontMap`, we can at least improve the rendering slightly by using an italic (serif) fallback font.
This patch adds support for non-embedded Arial Black fonts, that use a `Arial-Black...` format for the font names.
Also, this patch changes `canvas.js` such that we always render Arial Black fonts with the maximum weight, which actually improves a number of existing test-cases. This should thus explain the test "failures", which are clear improvements compared with e.g. Adobe Reader.
Fixes 7835.
While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:
[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ
getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
For PDF files with multiple `/Filter`s, where the `/Length` entry is zero, we fail to render the file correctly. The reason is that `maybeLength` is `null` for the every filter except the first, and `!maybeLength` is thus truthy.
Hence it seems that we should completely ignore the `/Length` entry and also explicitly check `maybeLength === 0`.
Note that I've not (yet) come across a PDF file with this issue in the wild, but given all the stupid things PDF generators do I wouldn't be surprised if such a file actually exists. In order to prevent a possible future bug, I'm submitting this patch which includes a hand-edited PDF file that we currently cannot render correctly (but e.g. Adobe Reader can).
This patch improves the unit tests by testing the support for read-only
and multiline fields. Moreover, we add a reference test to ensure that
the text widgets are not only rendered, but also that their contents are
styled properly.
Finally, we perform minor improvements in `src/core/annotation.js`, for
example adding missing comments.
Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry.
Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data.
According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts.
There's obviously a risk that fixing this particular issue *could* break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage.
Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact *not* Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat.
Fixes 7580.
Note that in order to prevent any possible issues, this patch does *not* try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary.
Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.
I've started to look into how we can fix issue 7580, but quickly became worried that fixing it could easily mean that we'd trade one fixed PDF file for a multitude of broken ones.
Hence I started going through the history of the code that choose the fallback encoding, and noticed that it has been changed a number of times over the years to deal with various cases of weirdness/errors in non-embedded fonts.
To my relief it turned out that almost all the PRs, please see a possibly incomplete [list here], that changed this code actually included `eq` test-cases.
However, in one case it appears that a PR missed to add a test-case. Furthermore since the fallback encoding may also be the only source for creating a `toUnicode` map, changing the encoding could possibly regress only the text-selection despite a PDF file still rendering correctly.
Therefore, this PR adds one new `eq` test, and also a number of additional `text` tests for PDF files already present in the test-suite.
Note that it's obviously possible that there's a certain overlap between the added tests, but I'd be *a whole lot* more concerned with causing regressions.
This patch is yet another instalment in the (never ending) series of patches for PDF files that specify completely incorrect Type/Subtype for its fonts. In this case Type1/Type1C, when in fact OpenType would have been correct.
Fixes 7598.
Assign the `quantizationTables` after parsing the entire JPEG image, to prevent issues when the DQT (Define Quantization Tables) marker is encountered after SOF{n} (Start of Frame) markers (issue 7406)
The PDF file contains an image that we're allowed to use, since it's just the PDF.js logo.
The logo image was simply inverted (so that it requires a /Decode entry in the image dictionary that triggers the use of `jpg.js` instead of the browser), converted to JPEG, and finally edited by hand to change the order of the DQT/SOF{n} markers.
Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.
This patch improves the performance of issue 5808, but I'm not sure if it's enough to call it fixed. On average, this patch reduces the number of textLayer div's by a factor of 3, and it also reduces the time spend in `getTextContent` by a factor of ~2.
The PDF file is generated by `Scribus PDF`, which for reasons I cannot understand is placing redundant `Tf` commands before *every* showText command.
Note how the PDF file also contains lots of (basically) identical fonts, but with slightly different names, which causes unnecessary font-switching. This causes some unnecessary breaking of textLayer div's, but this issue cannot be easily worked around.
Note that I used a separate warning message for this case, instead of utilizing the same one as in the unsupported subtype case, to more clearly indicate that the PDF file itself is to blame rather than PDF.js.
Fixes 7446.
After PR 7039, the PDF file in issue 7492 no longer renders at all, but note that text selection wasn't working correctly previously.
The problem with the PDF file in issue 7492 is that the `cMap`, in the `toUnicode` entry in the font, contains an invalid name:
```
/CMapName /-usr-share-fonts-truetype-Panton-Panton Family-Fontfabric - Panton.otf,000-UTF16 def
```
When we parse that line, things obviously break because there are spaces present in the wrong places.
To avoid that issue, the patch simply lets `parseCMap` continue when errors are encountered, to try and recover usable data. Note that by not aborting immediatly when an error is encountered, we are also able to fix the text selection.
Obviously, it could be argued that we should just immediatly reject a corrupt `cMap`. But given that they usually are correct, it seems that trying to recover as much data as possible from corrupt one can only be a good thing for both glyph mapping and text selection.
Fixes 7492.
In the PDF file in the issue, some of the glyphs end up being mapped to the Lepcha Unicode block; see https://en.wikipedia.org/wiki/Lepcha_(Unicode_block).
This didn't use to matter, but after HarfBuzz updates that improved support for Lepcha fonts, in particular https://bugzilla.mozilla.org/show_bug.cgi?id=1249861, some glyphs are now moved horizontally.
To avoid that, this patch adds the Lepcha block to the list of Unicode ranges that we skip when building the glyph mapping.
Fixes 7426.
Fallback to attempt to recover standard glyph names when amending the `charCodeToGlyphId` with entries from the `differences` array in `type1FontGlyphMapping` (issue 7439)
In fonts with only upper-case glyphs, that are also missing a space glyph, `get spaceWidth` won't be able to return anything useful.
By adding upper-case `I` as a fallback, we can thus improve text-selection in some PDF files.
Note that locally, the patch causes slight movement in a few existing `text` tests, but in my opinion this actually looks like slight improvements.
Fixes 7180.
Currently `setGState` is completely broken, and looking through the history of that code, it seems to me that this may never have worked correctly.
This patch fixes the text-selection in `extgstate.pdf` in the test-suite, which is also added as a `text` test.
Fixes http://www.pdf-archive.com/2013/09/30/file2/file2.pdf.
Note how it's not possible to show the various Popup Annotations in the above document.
To fix that, this patch lets the Popup inherit the flags of the parent, in the special case where the parent is `viewable` *and* the Popup is not.
In general, I don't think that a Popup must have the same flags set as the parent. However, it seems very strange to have a `viewable` parent annotation, and then not being able to view the Popup.
Annoyingly the PDF specification doesn't, as far as I can find, mention anything about how this case should be handled, but this patch seem consistent with the actual behaviour in Adobe Reader.
In the font in question, there are a couple of `topDict` entries that have invalid values (`0xF 0xF`, i.e. just eof markers without any actual numbers).
This causes the `parseFloatOperand` function, inside `CFFParser_parseDict`, to return `NaN`. Currently we pass this broken font onto the browser, which OTS unsurprisingly rejects.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1068432.
Note that in the PDF files provided by the reporter, this issue was limited to `Rect` arrays in AcroForm entries (which we currently don't support).
However, since a bad PDF generator could create this problem in *any* kind of annotation, the reduced test-case included here uses a simple LinkAnnotation instead.
Fixes 7115.
This patch adds a `getUnicodeForGlyph` helper function, which is used to recover Unicode values for non-standard glyph names.
Some PDF generators, e.g. Scribus PDF, use improper `uniXXXX` glyph names which breaks the glyph mapping. We can avoid this by converting them to "standard" glyph names instead.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1132849.
Fixes 6893.
Fixes 6894.
In the PDF file in question, some of the 'name' table entries have `record.length === 0`. This becomes problematic in the non-unicode case, since `font.getBytes(0)` will fetch the *entire* stream.
Given that OTS rejects 'name' entries larger than `2^16`, this thus explain the sanitizer errors.
Fixes 7020.
*A more robust solution for issue 6066.*
As a temporary work-around for (the upstream) [bug 1164199](https://bugzilla.mozilla.org/show_bug.cgi?id=1164199), we parsed *all* images in the Firefox addon during a short time.
Doing so uncovered an issue with our image handling (see 6066), for JPEG images with a `DeviceGray` ColorSpace *and* `bpc !== 1` (bits per component).
As long as we let the browser handle image decoding in this case, this isn't going to be an issue, but I do think that we should proactively fix this to avoid future issues if we change where the images are decoded (in `jpg.js` vs in browser).
Also, we currently don't seem to have a test-case for that kind of image data.
Some bad PDF generators, in particular "Scribus PDF", duplicates resources *a lot* at various levels of the PDF files. This can lead to `PartialEvaluator_hasBlendModes` taking an unreasonable amount of time to complete.
The reason is that the current code is using `Dict_getAll`, which recursively dereferences *all* indirect objects, which can be really slow. This patch instead uses `Dict_getKeys`, and then manually looks up only the necessary indirect objects.
I've added the PDF file as a `load` test. The most important thing here is probably to ensure that the file remains available in the repo, and the comment should help reduced the chance of regressions. (Note that locally, the `load` test times out without this patch, but we cannot really assume that that always happens.)
Fixes 6961.
As part of the link cleanup in issue 6854, obtaining this file through the Internet Archive didn't work.
However, given that the file was added in order to test an issue with `CropBox/MediaBox`, a reduced test-case should do just fine instead.
Please refer to issue 1155, and PR 1212.
It seems to be fairly common for OCR software to include incomplete TrueType fonts, notable missing the "glyf" table, in PDF files. Since we currently reject such fonts, the result is that text-selection/copying is broken.
This patch contains a suggested approach to try and use these kind of broken fonts, by using existing code in `sanitizeGlyphLocations` to replace a missing "glyf" table with dummy data.
Fixes 4684.
Fixes 6007.
Fixes 6829.
The test case was changed in 1faca19021 because the original file was not available anymore. However, its hash was also changed, meaning that we do not test the intended version anymore.
This patch makes sure that we test the intented version by reverting to
the original hash and using a link, also pointing to the Internet
Archive, with the original file.
This test was disabled in PR 4732, because the file was no longer available. The motivation being that there were two other files which should be good replacements. However, one of those has since been replaced with a reduced test-case (which doesn't exercise the same code-path), and in the other one the error does not appear to be entirely identical.
Hence it seems reasonable to re-add the 'aboutstacks.pdf' test, since it was possible to find it on the Internet Archive (by searching using a different URL, compared to the current one).
Note that despite the new file having a different hash than the the current one, it does render *identically* and most importantly it uses *the same* JBIG2 functionality.
For reference, please see issue 3666 and PR 3738.
*This patch follows a similar idea as PR 5756.*
The patch is based on the nice debugging done by Brendan in the referenced issue 6782.
A better way to handle this, and similar issues, would probably be to completely ignore what the PDF file claims about font type/subtype, and just check the actual data. But until that kind of rewrite happens, this patch should help.
Fixes 6782.
When generating new references locally on Windows, after PR 6724, I get the following output:
```
WARNING: Unable to open file for reading "Error: ENOENT, open 'c:\Users\Jonas\Git\pdfjs\test\pdfs\issue_3694_reduced.pdf'".
Unable to verify the checksum for the files that are used for testing.
Please re-download the files, or adjust the MD5 checksum in the manifest for the files listed above.
```
Compared to the name of the file (`issue3694_reduced.pdf`), you see that the manifest entry has a superfluous underscore in the "file" entry.
Currently we're not applying Patterns for text, but only for graphics.
This patch is unfortunately not a complete solution, but rather a step on the way, since there are still some PDF files where the Patterns look more like a solid colour, rather than the intended gradient.
I've been unable to fix these issues completely, and I've not managed to determine if the remaining issues are caused either by the pattern code, the canvas code, or perhaps both.
However, given that even this simple patch improves the current situation quite a bit, I figured that it couldn't hurt to submit it as-is.
- Fixes 5804.
- Fixes 6130.
- Improves 3988 a lot, since the text is now visible. However, it looks like the text is *one* solid colour, instead of the correct gradient.
- Improves 5432, since the text is no longer gray. (This file also suffers from the same problem as the previous one.)
Most code for Popup annotations is already present for Text annotations.
This patch extracts the popup creation logic from the Text annotation
code so it can be reused for Popup annotations.
Not only does this add support for Popup annotations, the Text
annotation code is also considerably easier. If a `Popup` entry is
available for a Text annotation, it will not be more than an image. The
popup will be handled by the Popup annotation. However, it is also
possible for Text annotations to not have a separate Popup annotation,
in which case the Text annotation handles the popup creation itself.
This PDF file (see issue 4914) originally regressed in PR 4318, and was subsequently fixed in PR 4915.
I added the PDF file as a (linked) test-case in PR 6481, in an effort to prevent regressions. Since we at that time didn't have the necessary framework in place, in order to correctly test annotations, this almost regressed *again* in PR https://github.com/mozilla/pdf.js/pull/6672#issuecomment-158689392.
In that PDF file, some of the annotations are both printable and hidden, and should definitely *not* be visible on normal display. Hence this patch, which adds the `annotations` flag to the manifest in order to ensure that those annotations won't be rendered when `intent === 'display'`.
In `Font_checkAndRepair` we can decide that a font isn't TrueType, and instead parse it as CFF. In that case it's quite possible that the `fontMatrix` will be changed, and without calling `adjustWidths` we're failing to update the glyph widths correctly.
Fixes 5027.
Fixes 5084.
Fixes 6556.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1204903.
After PR 6590, `font.spaceWidth` is now called in more cases than before (in `PartialEvaluator_getTextContent`), which exposed an underlying issue with `IdentityToUnicodeMap_charCodeOf` throwing an error.
This breaks text-selection in some PDF files found in the wild, hence this patch replaces the `error` with an actual function instead (modelled after `IdentityCMap_charCodeOf`).
*This is a regression from PR 3424.*
The PDF file in the referenced issue is using `Type3` fonts. In one of those, the `/CharProcs` dictionary contains an entry with the name `/#`. Before the changes to `Lexer_getName` in PR 3424, we were allowing certain invalid `Name` patterns containing the NUMBER SIGN (#).
It's unfortunate that this has been broken for close to two and a half years before the bug surfaced, but it should at least indicate that this is not a widespread issue.
Fixes 6692.
This patch goes a bit further than issue 6612 requires, and replaces all kinds of whitespace with standard spaces.
When testing this locally, it actually seemed to slightly improve two existing test-cases (`tracemonkey-text` and `taro-text`).
Fixes 6612.