Commit Graph

1267 Commits

Author SHA1 Message Date
Shikhar Agnihotri
43e003cf5c
removed parseJbig2 function 2018-01-20 19:49:06 +05:30
Jonas Jenwald
0e1b5589e7 Restore the btoa/atob polyfills for Node.js
These were removed in PR 9170, since they were unused in the browsers that we'll support in PDF.js version `2.0`.
However looking at the output of Travis, where a subset of the unit-tests are run using Node.js, there's warnings about `btoa` being undefined. This doesn't appear to cause any errors, which probably explains why we didn't notice this before (despite PR 9201).
2018-01-13 01:31:05 +01:00
Jonas Jenwald
d0c8992e8a Attempt to actually resolve ColourSpace names in accordance with the specification (issue 9285)
Please refer to the PDF specification, in particular http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.3801570

> A colour space shall be specified in one of two ways:
>  - Within a content stream, the CS or cs operator establishes the current colour space parameter in the graphics state. The operand shall always be name object, which either identifies one of the colour spaces that need no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, or some cases of Pattern) or shall be used as a key in the ColorSpace subdictionary of the current resource dictionary (see 7.8.3, "Resource Dictionaries"). In the latter case, the value of the dictionary entry in turn shall be a colour space array or name. A colour space array shall never be inline within a content stream.
>
> - Outside a content stream, certain objects, such as image XObjects, shall specify a colour space as an explicit parameter, often associated with the key ColorSpace. In this case, the colour space array or name shall always be defined directly as a PDF object, not by an entry in the ColorSpace resource subdictionary. This convention also applies when colour spaces are defined in terms of other colour spaces.
2018-01-10 20:20:43 +01:00
Jonas Jenwald
d6c028b946 Add support for TrueType Collection fonts (issue 9262)
The specification can be found at https://www.microsoft.com/typography/otspec/otff.htm, under the "Font Collections" heading.

Fixes 9262.
2018-01-08 22:31:08 +01:00
Tim van der Meij
6b2ed504b7
Merge pull request #9336 from Snuffleupagus/jpx-SIZ
Correctly extract component data from "Image and tile size" (SIZ) markers in JPEG 2000 images
2018-01-03 23:34:34 +01:00
Jonas Jenwald
873556865b Correctly extract component data from "Image and tile size" (SIZ) markers in JPEG 2000 images
This is something that I noticed while attempting to debug https://bugzilla.mozilla.org/show_bug.cgi?id=1374945.
Just looking at the code, the `YRsiz` parameter seemed immediately wrong and the fact that every component used the *same* data also looked strange.
Comparing with the specification, see https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.800-200208-S!!PDF-E&type=items#page=37, confirmed that this is indeed incorrect.

Note that I haven't got any example of a PDF file that is fixed by this patch, but that might be more luck than anything else. Manually checking a couple of files with included JPEG 2000 images, the `Csiz`/`XRsiz`/`YRsiz` parameters were `1` which could explain why this hasn't been an issue before.

Obviously we shouldn't generally make changes to `core` code without adding tests, but in this case I'm simply not sure how to obtain/create one. However, since the existing code doesn't make sense this patch could hopefully be deemed acceptable anyway.
2018-01-03 16:26:28 +01:00
Jonas Jenwald
2db75a2a3a Update the ESLint dependencies, and also tweak the no-multiple-empty-lines rules
Since multiple empty lines is virtually unused in the code-base, and the few cases that do exist look like "typos", let's enforce greater consistency here; please see https://eslint.org/docs/rules/no-multiple-empty-lines.
2018-01-03 13:32:57 +01:00
Jonas Jenwald
c5700211d6 Adjust decodeACSuccessive in src/core/jpg.js to improve the rendering quality of (progressive) JPEG images
I've been looking into the remaining point in 8637 about blurry images, to see if we could perhaps improve the rendering quality slightly there. After quite a bit of debugging, it seems that the issue is limited to certain progressive JPEG images.

As mentioned previously, I've got no detailed knowledge of the JPEG format, but this patch does seem to improve things quite a bit for the images in question.
Squinting at https://searchfox.org/mozilla-central/rev/6c33dde6ca02b389c52e8db3d22494df8b916f33/media/libjpeg/jdphuff.c#492-639, it seems reasonable that we should take the sign of the data into account. Furthermore, looking at the specification in https://www.w3.org/Graphics/JPEG/itu-t81.pdf#page=118, the "F.2.4.3 Decoding the binary decision sequence for non-zero DC differences and AC coefficients" section even contains a description of this (even though I cannot claim to really understand the details).
2017-12-30 15:24:09 +01:00
Jonas Jenwald
d6eed132e5 Correct the indentation in the switch statement in decodeACSuccessive in src/core/jpg.js 2017-12-30 15:22:30 +01:00
Jonas Jenwald
8c4b7d0439 Avoid truncating JPEG images with DeviceGray ColourSpaces when using the src/core/jpg.js built-in decoder
The bug that this patch fixes is limited to the built-in JPEG decoder, and was unearthed by PR 9260. The underlying issue has existed since PR 6984, where the contents of this patch ought to have been included (if it weren't for the fact that we had no *easy* way to test `src/core/jpg.js` back then).

*Please note:* The slight movement in the reference test is a result of using the `src/core/jpg.js` decoder, rather than the native browser one.
2017-12-29 18:44:07 +01:00
Jonas Jenwald
ec21bd9626
Merge pull request #9314 from timvandermeij/encodings
Implement unit tests for the encodings and fix missing items
2017-12-27 22:02:38 +01:00
Tim van der Meij
c7af2db2ec
Implement unit tests for the encodings and fix missing items
Initially I just implemented the unit tests, but quickly found that they
were failing my expectation of having a size of 256 items. Some of them
did contain 256 items and some did not. I looked up various resources
and figured that they indeed all need to have 256 items. One of the good
resources is https://github.com/davidben/poppler/blob/master/poppler/FontEncodingTables.cc

Aside from some missing `notdef` (empty string) entries at the end of
the arrays, which I assume causes issues since it may cause
out-of-bounds array access which in JavaScript gives `undefined`, there
was a `notdef` entry missing in the `MacExpertEncoding`, causing the
entries after that to be shifted. This fix for this is similar to the
one in #8589.

The unit tests verify that, for known encoding names, the return value
is not only an array, but that it is also of the right length and
contains only strings.
2017-12-24 18:14:40 +01:00
Jonas Jenwald
d4cd44fd16 Add a fallback for non-embedded LucidaSans-Demi fonts (issue 9291)
The PDF file in the issue uses a number of *embedded* versions of Lucida fonts, but for some reason does *not* embed the LucidaSans-Demi font. According to https://en.wikipedia.org/wiki/Lucida#Usages that one should be bold, so we can at least improve rendering here (even though it won't look perfect).

Fixes 9291.
2017-12-24 17:36:58 +01:00
Jonas Jenwald
e58f2f513a [api-major] Remove the unused encrypted property from the pdfInfo object sent from the worker via the GetDoc message
I recall being confused as to the purpose of the `encrypted` property all the way back when working on PR 4750.

Looking at the history, this property was added in PR 1698 when password support was added to the API/viewer. However, its only purpose seem to have been to facilitate the addition of a `isEncrypted` function in the API. That function never, as far as I can tell, saw any use and was unceremoniously removed in PR 4144.

Since we want to avoid sending all non-essential data early during initial document loading (e.g. PR 4750), it seems correct to get rid of the `encrypted` property. Especially since it hasn't even been exposed in the API for over three years, with no complaints that I'm aware of.

Finally note that the `encrypt` property on the `XRef` instance isn't tied to the code that's being removed here. Given that we're calling `PDFDocument.parse` during `createDocumentHandler` in the worker which, via `PDFDocument.setup`, calls `XRef.parse` where the `Encrypt` data (if it exists) is always parsed.
2017-12-21 13:10:23 +01:00
Jonas Jenwald
1dc54ddb40 Handle PDF files with missing 'endobj' operators, by searching for the "obj" string rather than "endobj" in XRef.indexObjects (issue 9105)
This patch refactors the searching for 'endobj', to try and find the next occurance of "obj" and then check if it was in fact an 'endobj' and continue searching otherwise.
This approach is used to avoid having to first find 'endobj', and then re-check the entire contents of the object and having to run (potentially expensive) regular expressions on arbitrary long strings.

Fixes 9105.
2017-12-18 13:17:45 +01:00
Tim van der Meij
6bbe91079b
Merge pull request #9272 from nveenjain/fix/8846
Replaced occurence of `throw new Error` with `unreachable`
2017-12-15 22:11:32 +01:00
Brendan Dahl
9b51cea724 Fix loca table when offsets aren't in ascending order. 2017-12-15 11:20:28 -06:00
Naveen Jain
1135674647 Replaced occurence of throw new Error with unreachable where applicable 2017-12-14 12:58:50 +05:30
Jonas Jenwald
84de1e9a92 Attempt to remove the special JpegStream.getBytes method and utilize the regular DecodeStream one instead
Note that no other image stream implements a special `getBytes` method, which makes `JpegStream` look somewhat odd.

I'm actually not sure what purpose this methods serves, since I successfully ran all tests locally with it commented out. Furhermore, I also ran tests with an added `if (length && length !== this.bufferLength) { throw new Error('length mismatch'); }` check, and didn't get a single test failure in that case either.

Looking at the history, it seems that this code originated back in PR 4528, but as far as I can tell there's no mention in either commit messages nor PR comments of why it was necessary to add a "special" `getBytes` function for the `JpegStream`.
My assumption is that there's a good reason why this method was added, e.g. to address a *specific* regression in one of the reference tests. However, I did check out commit 58f697f977 locally and ran tests with this method commented out, and there didn't seem to be any image-related failures in that case either!?

Hence I'm suggesting that we attempt to simplify this code slightly be removing this special `getBytes` method. However, please note that there's perhaps a *small* risk of regressions in an edge-case where we currently have insufficient test-coverage.
2017-12-10 13:31:08 +01:00
Brendan Dahl
af1d80d45e
Merge pull request #9230 from Snuffleupagus/issue-9195
Add basic support for non-embedded Calibri fonts (issue 9195)
2017-12-08 10:15:43 -08:00
Jonas Jenwald
a5e3261b48
Merge pull request #9062 from mozilla/no_high
Move char codes from high surrogate pair range into private use.
2017-12-08 12:31:22 +01:00
Brendan Dahl
306999c325 Move char codes from high surrogate pair range into private use.
Fixes #2884
2017-12-07 10:35:50 -08:00
Jonas Jenwald
08de655177 Add basic support for non-embedded Calibri fonts (issue 9195)
There's a number of issues with the fonts in the referenced PDF file. First of all, they contain broken `ToUnicode` data (`NUL` bytes all over the place). However even if you skip those, the `ToUnicode` data appears to contain nothing but a `IdentityH` CMap which won't help provide a proper glyph mapping.

The real issue actually turns out to be that the PDF file uses the "Calibri" font[1], but doesn't include any font files. Since that one isn't a standard font, and uses a fairly different CID to GID map compared to the standard fonts, we're not able to render the file even remotely correct.
To work around this, I'm thus proposing that we include a (incomplete) glyph map for Calibri, and fallback to the standard Helvetica font. Obviously this isn't going to look perfect, but it's really the best that we can hope to achieve given that the PDF file is missing the necessary font data.

Finally, please note that none of the PDF readers I've tried (Adobe Reader, PDFium in Chrome) were able to extract the text (which isn't very surprising, given the broken `ToUnicode` data).

Fixes 9195.

---

[1] According to Wikipedia, see https://en.wikipedia.org/wiki/Calibri, Calibri is (primarily) a Windows font.
2017-12-03 17:23:33 +01:00
Jonas Jenwald
f3c50fe2f9
Merge pull request #9192 from Snuffleupagus/issue-8229
Build a fallback `ToUnicode` map for simple fonts (issue 8229)
2017-11-30 10:27:32 +01:00
Jani Pehkonen
06d083b04b Fix pattern-filled text 2017-11-28 19:40:22 +02:00
Jonas Jenwald
61e19bee43 Build a fallback ToUnicode map for simple fonts (issue 8229)
In some fonts, the included `ToUnicode` data is incomplete causing text-selection to not work properly. For simple fonts that contain encoding data, we can manually build a `ToUnicode` map to attempt to improve things.

Please note that since we're currently using the `ToUnicode` data during glyph mapping, in an attempt to avoid rendering regressions, I purposely didn't want to amend to original `ToUnicode` data for this text-selection edge-case.
Instead, I opted for the current solution, which will (hopefully) give slightly better text-extraction results in PDF file with incomplete `ToUnicode` data.

According to the PDF specification, see [section 9.10.2](http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1873172):

> A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value.
> ...

Reading that paragraph literally, it doesn't seem too unreasonable to use *different* methods for different charcodes.

Fixes 8229.
2017-11-26 14:45:15 +01:00
Tim van der Meij
0fe80df2a7
Button widget annotations: implement support for pushbuttons 2017-11-26 14:09:48 +01:00
Jonas Jenwald
ffbfc3c2a7 Refactor the building of ToUnicode maps for simple fonts a helper method 2017-11-26 13:30:29 +01:00
Tim van der Meij
25b07812b9
Sanitize the display value for choice widget annotations 2017-11-18 20:37:27 +01:00
Tim van der Meij
ae07adf143
Merge pull request #9073 from Snuffleupagus/image-streams-fixes
Fix the interface of `JpegStream`/`JpxStream`/`Jbig2Stream` to agree with the other `DecodeStream`s
2017-11-17 23:26:36 +01:00
Tim van der Meij
9686f6652c
Merge pull request #9089 from yurydelendik/rm-chunks
Extracts OperatorList class and prepares for streaming
2017-11-13 23:35:40 +01:00
Jonas Jenwald
de5297b9ea Fix the interface of JpegStream/JpxStream/Jbig2Stream to agree with the other DecodeStreams
The interface of all of the "image" streams look kind of weird, and I'm actually a bit surprised that there hasn't been any errors because of it.
For example: None of them actually implement `readBlock` methods, and it seems more luck that anything else that we're not calling `getBytes()` (without providing a length) for those streams, since that would trigger a code-path in `getBytes` that assumes `readBlock` to exist.

To address this long-standing issue, the `ensureBuffer` methods are thus renamed to `readBlock`. Furthermore, the new `ensureBuffer` methods are now no-ops.
Finally, this patch also replaces `var` with `let` in a number of places.
2017-11-11 11:22:16 +01:00
Jonas Jenwald
36593d6bbc Move JpegStream and JpxStream to their own files 2017-11-11 11:22:16 +01:00
Yury Delendik
5fa56f6a9d For backwards compatibility: use addOp amount instead of queue size. 2017-11-09 18:46:48 -06:00
Yury Delendik
877c2d7743 Changing QueueOptimizer to be more iterative. 2017-11-09 18:46:48 -06:00
Max Schaefer
3ae37d1b06 Remove a few useless assignments. 2017-11-03 11:36:48 +00:00
Max Schaefer
bc8f673522 Remove spurious arguments to NullStream constructor. 2017-11-03 10:14:32 +00:00
Max Schaefer
3ab1a9922a Rearrange a few declarations so that they precede their uses. 2017-11-03 10:14:32 +00:00
Jonas Jenwald
2dbd3f2603 [api-major] Remove the TypedArray polyfills 2017-11-01 10:31:28 +01:00
Brendan Dahl
b46443f0c1
Merge pull request #9077 from yurydelendik/v2
Version 2.0 merge
2017-10-31 14:24:20 -07:00
Jonas Jenwald
83e8398ff2 For non-embedded fonts, map softhyphen (0x00AD) to regular hyphen (0x002D) (issue 9084)
In the PDF file, the `ToUnicode` data first maps the hyphen correctly, and then *overwrites* it to point to the softhyphen instead. That one cannot be rendered in browsers, and an empty space thus appear instead.

Fixes 9084.
2017-10-31 13:26:04 +01:00
Jonas Jenwald
92fcfce685
Merge pull request #9082 from brendandahl/issue7562
Overwrite glyphs contour count if it's less than -1.
2017-10-30 20:44:01 +01:00
Yury Delendik
85f544f55a Moves OperatorList and QueueOptimizer into separate file. 2017-10-30 13:29:58 -05:00
Brendan Dahl
17037b5e51 Overwrite glyphs contour count if it's less than -1.
The test pdf has a contour count of -70, but OTS doesn't
like values less than -1.

Fixes issue #7562.
2017-10-30 09:16:51 -07:00
Yury Delendik
b4e25fb2e8 Merge remote-tracking branch 'mozilla/version-2.0' into v2 2017-10-27 14:01:45 -05:00
Jonas Jenwald
5e627810e4 Use stringToBytes in more places
Rather than having (basically) verbatim copies of `stringToBytes` in a few places, we can simply use the helper function directly instead.
2017-10-26 11:01:13 +02:00
Jonas Jenwald
e94a0fd4e7 Extract the actual decoding in CCITTFaxStream into a new CCITTFaxDecoder "class", which the new CCITTFaxStream depends on 2017-10-24 16:03:08 +02:00
Jonas Jenwald
bb35095083 Move CCITTFaxStream and Jbig2Stream, from src/core/stream.js, to separate files 2017-10-24 12:00:40 +02:00
Jonas Jenwald
d71a576b30 Merge pull request #9045 from brendandahl/sani-name
Sanitize name index in compile phase of CFF.
2017-10-24 11:48:03 +02:00
Brendan Dahl
6b12612a52 Sanitize name index in compile phase of CFF.
Fixes #8960
2017-10-23 17:13:49 -07:00