According to the PDF spec 5.3.2, a positive value means in horizontal,
that the next glyph is further to the left (so narrower), and in
vertical that it is further down (so wider).
This change fixes the way PDF.js has interpreted the value.
This patch adjusts `get fingerprint` to also check that the `/ID` entry contains (non-empty) strings, to prevent more possible failures when loading corrupt PDF files (follow-up to PR 5602).
Note that I've not actually encountered such a PDF file in the wild. However given that `stringToBytes` will assert that the input is a string, and that we'll thus fail to load a document unless `get fingerprint` succeeds, making this more robust seems like a good idea to me.
For (1, 0) cmaps, we have two different codepaths depending on whether the font has/hasn't got an encoding. But with (3, 1) cmaps we don't have a good fallback when the encoding is missing, hence this patch changes `readCmapTable` to only choose a (3, 1) cmap table if the font is non-symbolic *and* an encoding exists. Without this, we'll not be able to successfully create a working glyph map for some TrueType fonts with (3, 1) cmap tables.
Fixes 6410.
This code was added in PR 1214, but was made obsolete by PRs 1488/1493. Prior to the latter ones, `Dict_get` retured the raw objects. However, afterwards (and currently) `Dict_get` now resolves indirect objects, which makes `Parser_fetchIfRef` redundant.
*Potential risks with this patch:*
This patch passes all tests locally, but there's a *small* possibility that it could break some weird PDF files.
In the current code, wrapping `Dict_get` inside `Parser_fetchIfRef` will potentially mean two back-to-back call of `XRef_fetch`, if a reference points directly to another reference. I'm not sure if this can actually happen in practice, and I'd think that if that were the case we'd already have run into it elsewhere in the code-base, given that `Parser` is the only place where we try to "double" resolve references.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1200096.
The problematic font has a `format 2` cmap, which we've never supported properly. Prior to PR 2606, we were able to fallback to a working state, despite not having proper support for that cmap format.
Obviously the best/correct solution would be to implement actual support for more cmap formats[1]. However, I'm hoping that a simple patch will be OK for now, given that:
- `format 2` cmaps seem to be quite rare in practice, since this has been broken for 2.5 years before anyone noticed.
- Having a simple patch will make potential uplifts a lot easier.
[1] See the specification at https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
Having a warning here would have meant that issue 6360 could have been solved in approximately five minutes, instead of an hour. To avoid that happening again, this patch adds a warning whenever we treat a stream as empty.
This patch improves the detection of `xref` in files where it is followed by an arbitrary whitespace character (not just a line-breaking char).
It also adds a check for missing whitespace, e.g. `1 0 obj<<`, to speed up `readToken` for the PDF file in the referenced issue.
Finally, the patch also replaces a bunch of magic numbers with suitably named constants.
Fixes 5752.
Also improves 6243, but there are still issues.
The problem with the PDF files in the issue, besides the obviously broken XRef tables which we're able to recover from, is that many/most of the streams have Dictionaries where the `Length` entry is set to `0`. This causes us to return `NullStream`, instead of the appropriate one in `Parser_makeFilter`.
Fixes 6360.
In some cases, such as in use with a CSP header, constructing a function with a
string of javascript is not allowed. However, compiling the various commands
that need to be done on the canvas element is faster than interpreting them.
This patch changes the font renderer to instead emit commands that are compiled
by the font loader. If, during compilation, we receive an EvalError, we instead
interpret them.
Serialize errors to make sure that the callback is still invoked when
an error is thrown.
Firefox:
"DataCloneError: The object could not be cloned."
Chrome:
"DataCloneError: Failed to execute 'postMessage' on 'WorkerGlobalScope': An object could not be cloned."
CMaps may be sparse. Array.prototype.forEach is terribly slow in Chrome
(and also in Firefox) when the sparse array contains a key with a high
value. E.g.
console.time('forEach sparse')
var a = [];
a[0xFFFFFF] = 1;
a.forEach(function(){});
console.timeEnd('forEach sparse');
// Chrome: 2890ms
// Firefox: 1345ms
Switching to CMap.prototype.forEach, which is optimized for such
scenarios fixes the problem.
pi is an index in the stream and is explained on page 201 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row". (Same principle as in #6304). Note that this change will not improve the lamp_cairo.pdf file, only the two files mentioned in #6305.
Short story: somebody got lost in two different indices. pi is an index in the stream and is explained on page 198 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row".
Currently, `src/core/core.js` uses the `fromRef` method on an `Annotation` object to obtain the right annotation type object (such as `LinkAnnotation` or `TextAnnotation`). That method in turn uses a method `getConstructor` to find out which annotation type object must be returned.
Aside from the fact that there is currently a lot of code to achieve this, these methods should not be part of the base `Annotation` class at all. Creation of annotation object should be done by a factory (as also recommended by @yurydelendik at https://github.com/mozilla/pdf.js/pull/5218#issuecomment-52779659) that handles finding out the correct annotation type object and returning it. This patch implements this separation of concerns.
Doing this allows us to also simplify the code quite a bit and to make it more readable. Additionally, we are now able to get rid of the hardcoded array of supported annotation types. The factory takes care of checking the annotation types and falls back to returning the base annotation type (and issuing a warning, which the current code also does not do well) when an annotation type is unsupported.
I have manually tested this commit with 20 test PDFs with different annotation types, such as /Link, /Text, /Widget, /FileAttachment and /FreeText. All render identically before and after the patch, and unsupported annotation types are now properly indicated with a warning in the console.
This patch refactors the code responsible for setting the annotation's rectangle. Its goal is to:
- Actually check that the input array is actually an array, and if so, that it contains exactly four elements.
- Only call `normalizeRect` if the input array is valid, i.e., we do not call it for the default rectangle anymore.
Unit tests are provided just like with the other patches in this series.
Fixes#6106
To avoid future regressions, two new unit tests were added:
1. A new PDF based on the report from #6106, which contains an
OpenAction of type JavaScript and a string "this.print({...}".
2. An existing PDF from https://bugzil.la/1001080 (from #4698).
Although it does not matter, since we don't execute the JavaScript code,
I have also changed "print(true)" to "print({})" since the print method
takes an object (not a boolean). See "Printing PDF documents", page 62:
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_developer_guide.pdf
Basic mathematics would suggest that a double negative should always become positive, but it appears that Adobe Reader simply ignores that case. Hence I think that it makes sense for us to do the same.
Fixes 6218.
When the parser finds a stream, it retrieves the Length from the stream
dictionary and advances the lexer to the offset as specified in Length.
If this Length is incorrect, the lexer could end up anywhere.
When the lexer gets in an invalid state, it could throw errors. For
example, in issue 6108, the lexer ends up inside the stream data. This
stream has the ASCIIHexDecode filter, so all characters are made up from
ASCII characters, and the lexer interprets it as a command token. Tokens
cannot be longer than 127 bytes, so eventually 128 bytes are consumed
and the lexer throws "Command token too long" error.
Another possible error is "Illegal character: 41" when the lexer happens
to end up at a ')' due to the length mismatch.
These problems are solved by catching lexer errors and recovering the
parser via the existing stream length detection branch.
Xref offsets are relative to the start of the PDF data, not to the start
of the PDF file. This is clear if you look at the other code:
- In the XRef's readXRefTable and processXRefTable methods of XRef, the
offset of a xref entry is set to the bytes as given by a PDF file.
These values are always relative to the start of the PDF file (%PDF-).
- The XRef's readXRef method adds the start offset of the stream to
Xref entry's offset: "stream.pos = startXRef + stream.start".
Clearly, this line assumes that the entry offset excludes the start
offset.
However, when the PDF is parsed in recovery mode, the xref table is
filled with entries whose offset is relative to the start of the stream
rather than the PDF file. This is incorrect, and the fix is to subtract
the start offset of the stream from the entry's byte offset.
The manually created PDF file serves as a regression test. It is a valid
PDF, except:
- The integer to point to the start of the xref table and the %%EOF
trailer are missing. This will activate recovery mode in PDF.js
- Some junk was added before the start of the PDF file. This exposes the
bad offset bug.
The PDF specification (cited below) specifies a maximum length of a name
in bytes as a minimal architectural limit. This means that PDF *writers*
should not create names that exceed 127 bytes.
It does not forbid PDF *readers* to accept such names though. These
names are only used internally to link PDF objects to other objects. For
these use cases, the lengths of the names do not really matter. Hence I
have changed the implementation to not treat long names as errors, but
warnings.
> (7.3.5) The length of a name shall be subject to an implementation
> limit; see Annex C.
>
> (Annex C.2) Table C.1 describes the minimum architectural limits that
> should be accommodated by conforming readers running on 32-bit
> machines. Because conforming readers may be subject to these limits,
> conforming writers producing PDF files should remain within them.
>
> (Table C.1) name 127 "Maximum length of a name, in bytes."
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
For named destinations that are contained in a `Dict`, as opposed to a `NameTree`, we currently iterate through the *entire* dictionary just to fetch *one* destination.
This code appears to simply have been copy-pasted from the `get destinations` method, but in its current form it's quite unnecessary/inefficient since can just get the required destination directly instead.
Doing this helped uncover an issue with the `getDestination` implementation.
Currently if a named destination doesn't exist, the method (in `obj.js`) may return `undefined` which leads to the promise being stuck in a pending state.
*Note:* returning `null` for this case is consistent with other methods, e.g. `getOutline` and `getAttachments`.
This became obsolete in bdeca30fbf. All it does is call the Annotation contructor and add hasHtml. This patch lets the Link and Text annotations directly extend the Annotation class and add hasHtml themselves.
This patch also removes an unused global.
Recently I've landed a number patches which fixed issues with ColorSpaces. In most of these cases the cause of the failures were, either partially or entirely, related to the fact that we didn't resolve indirect objects (i.e. the code was missing `xref.fetchIfRef(...)`).
The purpose of this patch is to fix the few remaining cases where indirect objects *could* potentially cause failures.
Given that we have seen how this causes failures in practice, I thus think that it makes sense to try and avoid further issues, instead of waiting for users to file even more bugs for this part of the code-base.
Currently non-embedded ArialBlack fonts are not rendered bold enough, compared to e.g. Adobe Reader.
The issue is that we set the font weight to `bolder`, but since that is actually relative to the font weight of the parent, the result is that there's no practical difference from just using `bold`.
This patch attempts to address that, by explicitly setting the font weight to the maximum value instead (see https://developer.mozilla.org/en-US/docs/Web/CSS/font-weight).
*Note:* I expect one test "failure" in `issue5801`, which in this case is an improvement, since that PDF file uses ArialBlack.
Fixes 6068.
The most notable issue with the font in question is that the `differences` array contains lots of strange entries (of the type `uniXXXX`, instead of proper glyph names).
The 'Version' field of the most recent document catalog, if present, is
intended to supersede the value in the file prologue.
This is significant for incrementally-built PDF documents and generators that
emit a low version in the prologue and later apply a format version based on
PDF features used, such as Apple's CoreGraphics/Quartz PDF backend.
Fixes the internal version variable, as well as the PDFFormatVersion reported
by the API and consumed by viewers.
For passwords where the encoding already is correct, the conversion is a no-op.
Also, since `encodeURIComponent` might throw, we need to make sure that we handle that case too.
Fixes 6010.