As part of attempting to fix a number issues containing PDF documents with corrupt XRef tables, I'd like to improve the reference test-coverage slightly *first*.
Obviously this will increase the runtime of the tests a bit, however I'd rather "waste" resources on the bots instead of developer time fixing regressions which could have been avoided.
*Please note:* I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code.
However, it occured to me that we're essentially trying to add a heuristic *on top* of another heuristic, and that it shouldn't matter how efficient the code is as long as it works.
In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.
Hopefully this patch makes sense, and in order to reduce the regression risk the implementation ensures that only completely missing widths are being replaced.
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".
The patch makes the following notable changes:
- Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
- Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
- Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
- Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
- Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.
---
[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
This patch will not incur any (measurable) overhead, since the glyphlist is already quite long and one more entry won't really matter, which is important given that this sort of PDF corruption ought to be very rare.
Furthermore, this patch purposely does *not* add a bunch of similarly modified ligature names on pure speculation. Any similar additions, for other ligatures, should only be made if there's real-world examples of PDF files where that's actually necessary.
The border `width` will instead fallback to the default value of `1`, rather than ignoring it altoghether, to also ensure that e.g. `LinkAnnotation`s become clickable as intended.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1552113
First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.
This transform resulted in an incorrectly positioned object when the
bounding box's upper-left corner did not start at (0,0), because
the translation was not reverted. This patch adds the missing transform.
The test file (tiling-pattern-box.pdf) is based on the PDF from #2825.
All but the first cube (including the PDF data) have been removed.
To trigger the bug that is fixed by this commit, I changed the BBox of
the first pattern from "[ 0 0 596 842]" to "[90 0 596 842]". Without
this patch, the dashed vertical line that intersects the corners at A
and E would disappear.
The new test file (tiling-pattern-large-steps.pdf) was manually created,
to have the following characteristics:
- Large xstep and ystep (90000)
- Page width is 4000 (which is larger than MAX_PATTERN_SIZE)
- Visually, the page consists of a red rectangle with a black border,
surrounded by a 50 unit white padding.
- Before patch: blurry; After patch: sharp
Fixes#6496Fixes#5698Fixes#1434Fixes#2825
Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.
The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
*Hopefully this patch makes sense, since I cannot claim to fully understand this function.*
With the changes made in PR 3354 *some* Type3 glyph outlines are no longer rendering correctly, since the final paths were being accidentally ignored.
The fact that Type3 fonts are not very common in PDF documents, and that most Type3 glyphs are unaffected by this regression, probably explains why this has gone unnoticed since 2013.
pdf.js had a problem when copying characters on supplementary planes
(0xPPXXXX where PP is nonzero). This is because certain methods of
PartialEvaluator use classic String.fromCharCode instead of ES6's
String.fromCodePoint.
Despite the fact that readToUnicode method *tried* to parse out-of-UCS2
code points by parsing UTF-16BE, it was inadequate because
String.fromCharCode only supports UCS-2 range of Unicode.
There have been lots of problems with trying to map glyphs to their unicode
values. It's more reliable to just use the private use areas so the browser's
font renderer doesn't mess with the glyphs.
Using the private use area for all glyphs did highlight other issues that this
patch also had to fix:
* small private use area - Previously, only the BMP private use area was used
which can't map many glyphs. Now, the (much bigger) PUP 16 area can also be
used.
* glyph zero not shown - Browsers will not use the glyph from a font if it is
glyph id = 0. This issue was less prevalent when we mapped to unicode values
since the fallback font would be used. However, when using the private use
area, the glyph would not be drawn at all. This is illustrated in one of the
current test cases (issue #8234) where there's an "ä" glyph at position
zero. The PDF looked like it rendered correctly, but it was actually not
using the glyph from the font. To properly show the first glyph it is always
duplicated and appended to the glyphs and the maps are adjusted.
* supplementary characters - The private use area PUP 16 is 4 bytes, so
String.fromCodePoint must be used where we previously used
String.fromCharCode. This is actually an issue that should have been fixed
regardless of this patch.
* charset - Freetype fails to load fonts when the charset size doesn't match
number of glyphs in the font. We now write out a fake charset with the
correct length. This also brought up the issue that glyphs with seac/endchar
should only ever write a standard charset, but we now write a custom one.
To get around this the seac analysis is permanently enabled so those glyphs
are instead always drawn as two glyphs.
Please note that while this *improves* issue 9984 slightly (and likely others too), it's not a complete solution.
The remaining issues are related to the, more general, problems with the existing heuristics related to attempting to combine separate text items.
The font in the PDF is marked as a CIDFontType0, but the font file is
actually a true type font. To fully address this issue we should really
peek into the font file and try to determine what it is. However, this
is the first case of this issue, so I think this solution is acceptable for
now.
*I was feeling bored; so this is a very quick, and somewhat naive, attempt at fixing the bug.*
The breaking error, i.e. `Error during font loading: invalid array length`, was thrown when attempting to re-size the `stack` to a *negative* length when parsing the CALL functions.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1473809.
This patch avoids choosing a (possible) 'trailer' dictionary that `XRef.parse` and/or the `Catalog` constructor/methods will reject anyway.
Since `XRef.indexObjects` is already parsing the entire PDF file, the extra dictionary look-ups added here shouldn't matter much. Besides, this is a fallback code-path that only applies to corrupt PDF files anyway.
I've been looking into the remaining point in 8637 about blurry images, to see if we could perhaps improve the rendering quality slightly there. After quite a bit of debugging, it seems that the issue is limited to certain progressive JPEG images.
As mentioned previously, I've got no detailed knowledge of the JPEG format, but this patch does seem to improve things quite a bit for the images in question.
Squinting at https://searchfox.org/mozilla-central/rev/6c33dde6ca02b389c52e8db3d22494df8b916f33/media/libjpeg/jdphuff.c#492-639, it seems reasonable that we should take the sign of the data into account. Furthermore, looking at the specification in https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=118, the "F.2.4.3 Decoding the binary decision sequence for non-zero DC differences and AC coefficients" section even contains a description of this (even though I cannot claim to really understand the details).
The PDF file in the issue uses a number of *embedded* versions of Lucida fonts, but for some reason does *not* embed the LucidaSans-Demi font. According to https://en.wikipedia.org/wiki/Lucida#Usages that one should be bold, so we can at least improve rendering here (even though it won't look perfect).
Fixes 9291.
This patch refactors the searching for 'endobj', to try and find the next occurance of "obj" and then check if it was in fact an 'endobj' and continue searching otherwise.
This approach is used to avoid having to first find 'endobj', and then re-check the entire contents of the object and having to run (potentially expensive) regular expressions on arbitrary long strings.
Fixes 9105.
In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things.
Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case.
Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data.
According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172):
> A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value.
> ...
Reading that paragraph literally, it doesn't seem too unreasonable to use *different* methods for different charcodes.
Fixes 8229.
In the PDF file, the `ToUnicode` data first maps the hyphen correctly, and then *overwrites* it to point to the softhyphen instead. That one cannot be rendered in browsers, and an empty space thus appear instead.
Fixes 9084.
This patch makes use of the existing `ignoreErrors` property in `src/core/evaluator.js`, see PRs 8240 and 8441, thus allowing us to attempt to recovery as much as possible of a page even when it contains broken XObjects.
Fixes 8702.
Fixes 8704.
Bug 1392647 has a PDF where the default width of the font
is 0. It draws some charcodes that don't have glyphs, but
we were wrongly using the 1000 default width for these
charcodes causing some text to be overlapping.
Rather than displaying links that does *nothing* when clicked, it probably makes more sense to simply not render them instead. Especially since it turns out that, at least at this point in time, this is *very* easy to both implement and test.
Fixes 3897.
Since this patch will now treat (some) `NUL` bytes as "ASCII", the number of `followingBytes` checked are thus increased to (hopefully) reduce the risk of introducing new false positives.
Fixes 8823.
In issue #8707, there's a char code mapped to a non-
existing glyph which shouldn't be drawn. However, we
saw it was missing and tried to then use the post table and
end up mapping it incorrectly.
This illuminated a problem with issue #5704 and bug
893730 where glyphs disappeared after above fix. This was
from the cmap returning the wrong glyph id. Which in turn was
caused because the font had multiple of the same type of cmap
table and we were choosing the last one. Now, we instead
default to the first one. I'm unsure if we should instead be
merging the multiple cmaps, but using only the first one works.
The PDF file uses a non-embedded SegoeUISymbol font, which is *not* a standard font (and is mainly used by Microsoft, see https://en.wikipedia.org/wiki/Segoe).
Fixes 8697.
Issue 7696 was one of the issues fixed by PR 8580. The other ones were all cases of missing glyphs, however in this particular one glyphs did render but every single one was incorrect.
Hence it probably cannot hurt to have a small, reduced, reference test for that PDF file as well.
Use the environment's zlib implementation if available to get
reasonably-sized SVG files when an XObject image is converted to PNG.
The generated PNG is not optimal because we do not use a PNG predictor.
Futher, when our SVG backend is run in a browser, the generated PNG
images will still be unnecessarily large (though the use of blob:-URLs
when available should reduce the impact on memory usage). If we want to
optimize PNG images in browsers too, we can either try to use a DEFLATE
library such as pako, or re-use our XObject image painting logic in
src/display/canvas.js. This potential improvement is not implemented by
this commit
Tested with:
- Node.js 8.1.3 (uses zlib)
- Node.js 0.11.12 (uses zlib)
- Node.js 0.10.48 (falls back to inferior existing implementation).
- Chrome 59.0.3071.86
- Firefox 54.0
Tests:
Unit test on Node.js:
```
$ gulp lib
$ JASMINE_CONFIG_PATH=test/unit/clitests.json node ./node_modules/.bin/jasmine --filter=SVG
```
Unit test in browser: Run `gulp server` and open
http://localhost:8888/test/unit/unit_test.html?spec=SVGGraphics
To verify that the patch works as desired,
```
$ node examples/node/pdf2svg.js test/pdfs/xobject-image.pdf
$ du -b svgdump/xobject-image-1.svg
# ^ Calculates the file size. Confirm that the size is small
# (784 instead of 80664 bytes).
```
In the PDF from issue 8527, the clip operator (W) shows up before a path
is defined. The current SVG backend however expects a path to exist
before generating a `<svg:clipPath>` element.
In the example, the path was defined after the clip, followed by a
endPath operator (n).
So this commit fixes the bug by moving the path generation logic from
clip to endPath.
Our canvas backend appears to use similar logic:
`CanvasGraphics_endPath` calls `consumePath`, which in turn draws the
clip and resets the `pendingClip` state. The canvas backend calls
`consumePath` from multiple other places, so we probably need to check
whether doing so is also necessary for the SVG backend.
I scanned our corpus of PDF files in test/pdfs, and found that in every
instance (except for one), the "W" PDF operator (clip) is immediately
followed by "n" (endPath). The new test from this commit (clippath.pdf)
starts with "W", followed by a path definition and then "n".
# Commands used to find some of the clipping commands:
grep -ra '^W$' -C7 | less -S
grep -ra '^W ' -C7 | less -S
grep -ra ' W$' -C7 | less -S
test/pdfs/issue6413.pdf is the only file where "W" (a tline 55) is not
followed by "n". In fact, the "W" is the last operation of a series of
XObject painting operations, and removing it does not have any effect
on the rendered PDF (confirmed by looking at the output of PDF.js's
canvas backend, and ImageMagick's convert command).
*As mentioned the last time that I touched this particular part of the font code, I'm sincerely hope that this doesn't cause any regressions!*
However, the patch passes all tests added in PRs 5770, 6270, and 7904 (and obviously all other tests as well). Furthermore, I've manually checked all the issues/bugs referenced in those PRs without finding any issues.
Fixes 8480.
[api-minor] Always allow e.g. rendering to continue even if there are errors, and add a `stopAtErrors` parameter to `getDocument` to opt-out of this behaviour (issue 6342, issue 3795, bug 1130815)
This patch implements support for line annotations. Other viewers only
show the popup annotation when hovering over the line, which may have
any orientation. To make this possible, we render an invisible line (SVG
element) over the line on the canvas that acts as the trigger for the
popup annotation. This invisible line has the same starting coordinates,
ending coordinates and width of the line on the canvas.
Other PDF readers, e.g. Adobe Reader and PDFium (in Chrome), will attempt to render as much of a page as possible even if there are errors present.
Currently we just bail as soon the first error is hit, which means that we'll usually not render anything in these cases and just display a blank page instead.
NOTE: This patch changes the default behaviour of the PDF.js API to always attempt to recover as much data as possible, even when encountering errors during e.g. `getOperatorList`/`getTextContent`, which thus improve our handling of corrupt PDF files and allow the default viewer to handle errors slightly more gracefully.
In the event that an API consumer wishes to use the old behaviour, where we stop parsing as soon as an error is encountered, the `stopAtErrors` parameter can be set at `getDocument`.
Fixes, inasmuch it's possible since the PDF files are corrupt, e.g. issue 6342, issue 3795, and [bug 1130815](https://bugzilla.mozilla.org/show_bug.cgi?id=1130815) (and probably others too).
It appears that I accidentally broke this in PR 6065, sorry about that!
The issue in this particular PDF file is that there's `/Rotate` entries on different levels of the `/Pages` tree. We're supposed to use the `/Rotate` entry in the `/Page` dict (which is `0`), but because of an incorrect condition we instead ended up with the one from the `/Pages` dict (which is `180`).
Fixes 8125.