With PDF.js version `2.0` we'll only support browsers with built-in `TypedArray` functionality, hence there doesn't seem to be any good reason not to implement this now.
Fixes 4888.
The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.
There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be *way* too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution.
The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of *not* caching given that too aggressive caching can easily lead to rendering bugs.
One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the *exact* stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1]
With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618.
Fixes 9398; also improves/fixes the `issue8823` reference test.
---
[1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.
Fallback to the built-in JPEG decoder when browser decoding fails, and attempt to handle JPEG images with DNL (Define Number of Lines) markers (issue 8614)
Please refer to the specification, found at https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=49
Given how the JPEG decoder is currently implemented, we need to know the value of the scanLines parameter (among others) *before* parsing of the SOS (Start of Scan) data begins.
Hence the best solution I could come up with here, is to re-parse the image in the *hopefully* rare case of JPEG images that include a DNL (Define Number of Lines) marker.
Fixes 8614.
This works by making `PartialEvaluator.buildPaintImageXObject` wait for the success/failure of `loadJpegStream` on the API side *before* parsing continues.
Please note that in practice, it should be quite rare for the browser to fail loading/decoding of a JPEG image. In the general case, it should thus not be completely surprising if even `src/core/jpg.js` will fail to decode the image.
This method currently requires a fair number of parameters, which creates quite unwieldy call-sites. When invoking `buildPaintImageXObject`, you have to remember not only which arguments to supply, but also the correct order, to prevent run-time errors.
This commit is the first step for extracting a base class for the
`AES128Cipher` and the `AES256Cipher` classes. The objective here is to
make code changes (not altering the logic) to make the implementations
as similar as possible as found by creating a diff of both classes.
In particular, we extract the key size and cycles of repetitions
constants since they are different for AES-128 and AES-256. Moreover, we
rename functions to be similar.
In the `AES256Cipher` class, there was an additional assignment to
`this` in the decryption function. However, this was unnecessary because
the assignment would also be done when the loop was exited.
In the JPEG images in the referenced PDF file, the DHT (Define Huffman Tables) segments contain more data than expected based on the length parameter.
Fixes 9425.
This patch updates the `IPDFStreamReader` interface and ensures that the interface/implementation of `network.js`, `fetch_stream.js`, `node_stream.js`, and `transport_stream.js` all match properly.
The unit-tests are also adjusted, to more closely replicate the actual behaviour of the various actual `IPDFStreamReader` implementations.
Finally, this patch adjusts the use of the Content-Disposition filename when setting the title in the viewer, and adds `PDFDocumentProperties` support as well.
These were removed in PR 9170, since they were unused in the browsers that we'll support in PDF.js version `2.0`.
However looking at the output of Travis, where a subset of the unit-tests are run using Node.js, there's warnings about `btoa` being undefined. This doesn't appear to cause any errors, which probably explains why we didn't notice this before (despite PR 9201).
Please refer to the PDF specification, in particular http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.3801570
> A colour space shall be specified in one of two ways:
> - Within a content stream, the CS or cs operator establishes the current colour space parameter in the graphics state. The operand shall always be name object, which either identifies one of the colour spaces that need no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, or some cases of Pattern) or shall be used as a key in the ColorSpace subdictionary of the current resource dictionary (see 7.8.3, "Resource Dictionaries"). In the latter case, the value of the dictionary entry in turn shall be a colour space array or name. A colour space array shall never be inline within a content stream.
>
> - Outside a content stream, certain objects, such as image XObjects, shall specify a colour space as an explicit parameter, often associated with the key ColorSpace. In this case, the colour space array or name shall always be defined directly as a PDF object, not by an entry in the ColorSpace resource subdictionary. This convention also applies when colour spaces are defined in terms of other colour spaces.
This is something that I noticed while attempting to debug https://bugzilla.mozilla.org/show_bug.cgi?id=1374945.
Just looking at the code, the `YRsiz` parameter seemed immediately wrong and the fact that every component used the *same* data also looked strange.
Comparing with the specification, see https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.800-200208-S!!PDF-E&type=items#page=37, confirmed that this is indeed incorrect.
Note that I haven't got any example of a PDF file that is fixed by this patch, but that might be more luck than anything else. Manually checking a couple of files with included JPEG 2000 images, the `Csiz`/`XRsiz`/`YRsiz` parameters were `1` which could explain why this hasn't been an issue before.
Obviously we shouldn't generally make changes to `core` code without adding tests, but in this case I'm simply not sure how to obtain/create one. However, since the existing code doesn't make sense this patch could hopefully be deemed acceptable anyway.
Since multiple empty lines is virtually unused in the code-base, and the few cases that do exist look like "typos", let's enforce greater consistency here; please see https://eslint.org/docs/rules/no-multiple-empty-lines.
I've been looking into the remaining point in 8637 about blurry images, to see if we could perhaps improve the rendering quality slightly there. After quite a bit of debugging, it seems that the issue is limited to certain progressive JPEG images.
As mentioned previously, I've got no detailed knowledge of the JPEG format, but this patch does seem to improve things quite a bit for the images in question.
Squinting at https://searchfox.org/mozilla-central/rev/6c33dde6ca02b389c52e8db3d22494df8b916f33/media/libjpeg/jdphuff.c#492-639, it seems reasonable that we should take the sign of the data into account. Furthermore, looking at the specification in https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=118, the "F.2.4.3 Decoding the binary decision sequence for non-zero DC differences and AC coefficients" section even contains a description of this (even though I cannot claim to really understand the details).
The bug that this patch fixes is limited to the built-in JPEG decoder, and was unearthed by PR 9260. The underlying issue has existed since PR 6984, where the contents of this patch ought to have been included (if it weren't for the fact that we had no *easy* way to test `src/core/jpg.js` back then).
*Please note:* The slight movement in the reference test is a result of using the `src/core/jpg.js` decoder, rather than the native browser one.
Initially I just implemented the unit tests, but quickly found that they
were failing my expectation of having a size of 256 items. Some of them
did contain 256 items and some did not. I looked up various resources
and figured that they indeed all need to have 256 items. One of the good
resources is https://github.com/davidben/poppler/blob/master/poppler/FontEncodingTables.cc
Aside from some missing `notdef` (empty string) entries at the end of
the arrays, which I assume causes issues since it may cause
out-of-bounds array access which in JavaScript gives `undefined`, there
was a `notdef` entry missing in the `MacExpertEncoding`, causing the
entries after that to be shifted. This fix for this is similar to the
one in #8589.
The unit tests verify that, for known encoding names, the return value
is not only an array, but that it is also of the right length and
contains only strings.
The PDF file in the issue uses a number of *embedded* versions of Lucida fonts, but for some reason does *not* embed the LucidaSans-Demi font. According to https://en.wikipedia.org/wiki/Lucida#Usages that one should be bold, so we can at least improve rendering here (even though it won't look perfect).
Fixes 9291.
I recall being confused as to the purpose of the `encrypted` property all the way back when working on PR 4750.
Looking at the history, this property was added in PR 1698 when password support was added to the API/viewer. However, its only purpose seem to have been to facilitate the addition of a `isEncrypted` function in the API. That function never, as far as I can tell, saw any use and was unceremoniously removed in PR 4144.
Since we want to avoid sending all non-essential data early during initial document loading (e.g. PR 4750), it seems correct to get rid of the `encrypted` property. Especially since it hasn't even been exposed in the API for over three years, with no complaints that I'm aware of.
Finally note that the `encrypt` property on the `XRef` instance isn't tied to the code that's being removed here. Given that we're calling `PDFDocument.parse` during `createDocumentHandler` in the worker which, via `PDFDocument.setup`, calls `XRef.parse` where the `Encrypt` data (if it exists) is always parsed.
This patch refactors the searching for 'endobj', to try and find the next occurance of "obj" and then check if it was in fact an 'endobj' and continue searching otherwise.
This approach is used to avoid having to first find 'endobj', and then re-check the entire contents of the object and having to run (potentially expensive) regular expressions on arbitrary long strings.
Fixes 9105.
Note that no other image stream implements a special `getBytes` method, which makes `JpegStream` look somewhat odd.
I'm actually not sure what purpose this methods serves, since I successfully ran all tests locally with it commented out. Furhermore, I also ran tests with an added `if (length && length !== this.bufferLength) { throw new Error('length mismatch'); }` check, and didn't get a single test failure in that case either.
Looking at the history, it seems that this code originated back in PR 4528, but as far as I can tell there's no mention in either commit messages nor PR comments of why it was necessary to add a "special" `getBytes` function for the `JpegStream`.
My assumption is that there's a good reason why this method was added, e.g. to address a *specific* regression in one of the reference tests. However, I did check out commit 58f697f977 locally and ran tests with this method commented out, and there didn't seem to be any image-related failures in that case either!?
Hence I'm suggesting that we attempt to simplify this code slightly be removing this special `getBytes` method. However, please note that there's perhaps a *small* risk of regressions in an edge-case where we currently have insufficient test-coverage.
There's a number of issues with the fonts in the referenced PDF file. First of all, they contain broken `ToUnicode` data (`NUL` bytes all over the place). However even if you skip those, the `ToUnicode` data appears to contain nothing but a `IdentityH` CMap which won't help provide a proper glyph mapping.
The real issue actually turns out to be that the PDF file uses the "Calibri" font[1], but doesn't include any font files. Since that one isn't a standard font, and uses a fairly different CID to GID map compared to the standard fonts, we're not able to render the file even remotely correct.
To work around this, I'm thus proposing that we include a (incomplete) glyph map for Calibri, and fallback to the standard Helvetica font. Obviously this isn't going to look perfect, but it's really the best that we can hope to achieve given that the PDF file is missing the necessary font data.
Finally, please note that none of the PDF readers I've tried (Adobe Reader, PDFium in Chrome) were able to extract the text (which isn't very surprising, given the broken `ToUnicode` data).
Fixes 9195.
---
[1] According to Wikipedia, see https://en.wikipedia.org/wiki/Calibri, Calibri is (primarily) a Windows font.
In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things.
Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case.
Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data.
According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172):
> A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value.
> ...
Reading that paragraph literally, it doesn't seem too unreasonable to use *different* methods for different charcodes.
Fixes 8229.
The interface of all of the "image" streams look kind of weird, and I'm actually a bit surprised that there hasn't been any errors because of it.
For example: None of them actually implement `readBlock` methods, and it seems more luck that anything else that we're not calling `getBytes()` (without providing a length) for those streams, since that would trigger a code-path in `getBytes` that assumes `readBlock` to exist.
To address this long-standing issue, the `ensureBuffer` methods are thus renamed to `readBlock`. Furthermore, the new `ensureBuffer` methods are now no-ops.
Finally, this patch also replaces `var` with `let` in a number of places.
In the PDF file, the `ToUnicode` data first maps the hyphen correctly, and then *overwrites* it to point to the softhyphen instead. That one cannot be rendered in browsers, and an empty space thus appear instead.
Fixes 9084.
This patch makes use of the existing `ignoreErrors` property in `src/core/evaluator.js`, see PRs 8240 and 8441, thus allowing us to attempt to recovery as much as possible of a page even when it contains broken XObjects.
Fixes 8702.
Fixes 8704.
*Follow-up to PR 8909.*
This requires us to pass around `pdfFunctionFactory` to quite a lot of existing code, however I don't see another way of handling this while still guaranteeing that we can access `PDFFunction` as freely as in the old code.
Please note that the patch passes all tests locally (unit, font, reference), and I *very* much hope that we have sufficient test-coverage for the code in question to catch any typos/mistakes in the re-factoring.
The `inline` parameter is passed to a number of methods/functions in `PDFImage`, despite not actually being used. Its value is never checked, nor is it ever assigned to the current `PDFImage` instance (i.e. no `this.inline = inline` exists).
Looking briefly at the history of this code, I was also unable to find a point in time where `inline` was being used.
As far as I'm concerned, `inline` does nothing more than add clutter to already very unwieldy method/function signatures, hence why I'm proposing that we just remove it.
To further simplify call-sites using `PDFImage`/`NativeImageDecoder`, a number of methods/functions are changed to take Objects rather than a bunch of (somewhat) randomly ordered parameters.
I don't have a good example at hand right know, but I recall seeing custom deployments of PDF.js that bundle a *specific* version of the `build/pdf.js` file and then set `PDFJS.workerSrc` to point to https://mozilla.github.io/pdf.js/build/pdf.worker.js.
That practice seems really bad since, besides (obviously) causing unnecessary server load, it will very quickly result in a version mismatch between the `pdf.js` and `pdf.worker.js` files in those PDF.js deployments.
Such a version mismatch could easily lead to either breaking errors, or even worse slightly inconsistent behaviour for an API call (if the API -> Worker interface changes, which does happen from time to time).
To avoid the problems described above, I'm thus proposing that we enforce that the versions of the `pdf.js` and `pdf.worker.js` files must always match.
Looking at `ColorSpace.parseToIR`, it will do one of the following things when called:
1. Return a String.
2. Return an Array.
3. Throw a `FormatError`.
4. In one case, return the result of *another* `ColorSpace.parseToIR` call.
However, under no circumstances will it ever return an `AlternateCS` instance.
Since it's often useful to understand why code, which has become unused, existed in the first place, let's grab a hard hat and a shovel and start digging through the history of this code :-)
The current condition was introduced in commit c198ec4323, in PR 794, but it was actually already obsolete by that time.
The preceeding `instanceof SeparationCS` condition predates commit a7278b7fbc, in PR 700.
That condition was originally introduced all the way back in commit 4e3f87b60c, in PR 692. However, it was made obsolete by commit 9dcefe1efc, which is included in the very same PR!
Hence we're left with the conclusion that not only has this code be unused for *almost* six years, it was basically never used at all save for a few refactoring commits that're part of PR 692.
Bug 1392647 has a PDF where the default width of the font
is 0. It draws some charcodes that don't have glyphs, but
we were wrongly using the 1000 default width for these
charcodes causing some text to be overlapping.
(for issue #6289)
This does the same for 16 bit as the existing 8 bit tiff predictor code, an addition of the last word to this word.
The last two "& 0xFF" may or may not be needed, I see this isn't done in the 8 bit code, but I'm not a JS developer.
Currently `PDFFunction` is implemented (basically) like a class with only `static` methods. Since it's used directly in a number of different `src/core/` files, attempting to pass in `isEvalSupported` would result in code that's *very* messy, not to mention difficult to maintain (since *every* single `PDFFunction` method call would need to include a `isEvalSupported` argument).
Rather than having to wait for a possible re-factoring of `PDFFunction` that would avoid the above problems by design, it probably makes sense to at least set `isEvalSupported` globally for `PDFFunction`.
Please note that there's one caveat with this solution: If `PDFJS.getDocument` is used to open multiple files simultaneously, with *different* `PDFJS.isEvalSupported` values set before each call, then the last one will always win.
However, that seems like enough of an edge-case that we shouldn't have to worry about it. Besides, since we'll also test that `eval` is actually supported, it should be fine.
Fixes 5573.
When looking briefly at using `Number.isInteger`/`Number.isNan` rather than `isInt`/`isNaN`, I noticed that there's a couple of not entirely straightforward cases to consider.
At first I really couldn't understand why `parseInt` is being used like it is in `XRef.fetchUncompressed`, since the `num` and `gen` properties of an object reference should *always* be integers.
However, doing a bit of code archaeology pointed to PR 4348, and it thus seem that this was a very deliberate change. Since I didn't want to inadvertently introduce any regressions, I've kept the `parseInt` calls intact but moved them to occur *only* when actually necessary.[1]
Secondly, I noticed that there's a redundant `isCmd` check for an edge-case of broken operators. Since we're throwing a `FormatError` if `obj3` isn't a command, we don't need to repeat that check.
In practice, this patch could perhaps be considered as a micro-optimization, but considering that `XRef.fetchUncompressed` can be called *many* thousand times when loading larger PDF documents these changes at least cannot hurt.
---
[1] I even ran all tests locally, with an added `assert(Number.isInteger(obj1) && Number.isInteger(obj2));` check, and everything passed with flying colours.
However, since it appears that this was in fact necessary at one point, one possible explanation is that the failing test-case(s) have now been replaced by reduced ones.
Since this patch will now treat (some) `NUL` bytes as "ASCII", the number of `followingBytes` checked are thus increased to (hopefully) reduce the risk of introducing new false positives.
Fixes 8823.