Currently `getAnnotations` will *only* fetch annotations that are either `viewable` or `printable`. This is "hidden" inside the `core.js` file, meaning that API consumers might be confused as to why they are not recieving *all* the annotations present for a page.
I thus think that the API should, by default, return *all* available annotations unless specifically told otherwise. In e.g. the default viewer, we obviously only want to display annotations that are `viewable`, hence this patch adds an `intent` parameter to `getAnnotations` that makes it possible to decide if only `viewable` or `printable` annotations should be fetched.
This patch makes it possible to set and get all possible flags that the PDF specification defines. Even though we do not support all possible annotation types and not all possible annotation flags yet, this general framework makes it easy to access all flags for each annotation such that annotation type implementations can use this information.
We add constants for all possible annotation flags such that we do not need to hardcode the flags in the code anymore. The `isViewable()` and `isPrintable()` methods are now easier to read. Additionally, unit tests have been added to ensure correct behavior.
This is another part of #5218.
I received multiple reports about the following cryptic error in the
Chrome extension when the user tried to open a local file:
> PDF.js v1.1.527 (build: 2096a2a)
> Message: Cannot read property 'Symbol(Symbol.iterator)' of null
This error most likely originated from core/stream.js:
function Stream(arrayBuffer, start, length, dict) {
this.bytes = (arrayBuffer instanceof Uint8Array ?
arrayBuffer : new Uint8Array(arrayBuffer));
^^^^^^^^^^^
`arrayBuffer` is `null`, and that in turn is caused by the fact that
for non-existing files, there is no data. I've applied two fixes:
1. Never call onDone with a void buffer, but call the error handler
instead.
2. Show a sensible error message for local files with status = 0.
In the `RenderPageRequest` handler in `worker.js`, we attempt to print an `info` message containing the rendering time and the length of the operator list. The latter is currently broken (and has been for quite some time), since the `length` of an `OperatorList` is reset when flushing occurs.
This patch attempts to rectify this, by adding a getter which keeps track of the total length.
*Follow-up to PR 6546.*
If rendering has already started when the document is destroyed, then `this.pageCache[data.pageIndex]` may already have been cleared when the `StartRenderingPage`/`RenderPageChunk` messages are recieved in `api.js`, which results in `TypeError`s being thrown.
We should never actually try to execute `paintXObject` in canvas.js, but in some cases where we fail to parse the PDF file correctly it can happen. Currently this will potentially cause an entire page to fail to render, which seems suboptimal.
With this patch, we will instead continue rendering with a warning that things might not work correctly.
`operatorList.addOp` adds the arguments to the list which is then
passed as-is by postMessage to the main thread. But since we don't
parse these operations, they are raw PDF objects and may therefore
cause a serialization error.
This is a conservative patch, and only affects operators which are
known to be unsupported. We should ignore all unknown operators,
but I haven't really looked into the consequences of doing that.
Fixes#6549
In PR 6485 I somehow missed to account for the case where `xref` is undefined. Since a dictonary can be initialized without providing a reference to an `xref` instance, `Dict_getArray` can thus fail without this added check.
Replaces `PDFJS.openExternalLinksInNewWindow` with a more generic configuration option.
*Note:* `PDFJS.openExternalLinksInNewWindow = true;` is equal to `PDFJS.externalLinkTarget = PDFJS.LinkTarget.BLANK;`.
According to the PDF spec 5.3.2, a positive value means in horizontal,
that the next glyph is further to the left (so narrower), and in
vertical that it is further down (so wider).
This change fixes the way PDF.js has interpreted the value.
This patch adjusts `get fingerprint` to also check that the `/ID` entry contains (non-empty) strings, to prevent more possible failures when loading corrupt PDF files (follow-up to PR 5602).
Note that I've not actually encountered such a PDF file in the wild. However given that `stringToBytes` will assert that the input is a string, and that we'll thus fail to load a document unless `get fingerprint` succeeds, making this more robust seems like a good idea to me.
For (1, 0) cmaps, we have two different codepaths depending on whether the font has/hasn't got an encoding. But with (3, 1) cmaps we don't have a good fallback when the encoding is missing, hence this patch changes `readCmapTable` to only choose a (3, 1) cmap table if the font is non-symbolic *and* an encoding exists. Without this, we'll not be able to successfully create a working glyph map for some TrueType fonts with (3, 1) cmap tables.
Fixes 6410.
This code was added in PR 1214, but was made obsolete by PRs 1488/1493. Prior to the latter ones, `Dict_get` retured the raw objects. However, afterwards (and currently) `Dict_get` now resolves indirect objects, which makes `Parser_fetchIfRef` redundant.
*Potential risks with this patch:*
This patch passes all tests locally, but there's a *small* possibility that it could break some weird PDF files.
In the current code, wrapping `Dict_get` inside `Parser_fetchIfRef` will potentially mean two back-to-back call of `XRef_fetch`, if a reference points directly to another reference. I'm not sure if this can actually happen in practice, and I'd think that if that were the case we'd already have run into it elsewhere in the code-base, given that `Parser` is the only place where we try to "double" resolve references.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1200096.
The problematic font has a `format 2` cmap, which we've never supported properly. Prior to PR 2606, we were able to fallback to a working state, despite not having proper support for that cmap format.
Obviously the best/correct solution would be to implement actual support for more cmap formats[1]. However, I'm hoping that a simple patch will be OK for now, given that:
- `format 2` cmaps seem to be quite rare in practice, since this has been broken for 2.5 years before anyone noticed.
- Having a simple patch will make potential uplifts a lot easier.
[1] See the specification at https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
Having a warning here would have meant that issue 6360 could have been solved in approximately five minutes, instead of an hour. To avoid that happening again, this patch adds a warning whenever we treat a stream as empty.
This patch improves the detection of `xref` in files where it is followed by an arbitrary whitespace character (not just a line-breaking char).
It also adds a check for missing whitespace, e.g. `1 0 obj<<`, to speed up `readToken` for the PDF file in the referenced issue.
Finally, the patch also replaces a bunch of magic numbers with suitably named constants.
Fixes 5752.
Also improves 6243, but there are still issues.
The problem with the PDF files in the issue, besides the obviously broken XRef tables which we're able to recover from, is that many/most of the streams have Dictionaries where the `Length` entry is set to `0`. This causes us to return `NullStream`, instead of the appropriate one in `Parser_makeFilter`.
Fixes 6360.
In some cases, such as in use with a CSP header, constructing a function with a
string of javascript is not allowed. However, compiling the various commands
that need to be done on the canvas element is faster than interpreting them.
This patch changes the font renderer to instead emit commands that are compiled
by the font loader. If, during compilation, we receive an EvalError, we instead
interpret them.
Serialize errors to make sure that the callback is still invoked when
an error is thrown.
Firefox:
"DataCloneError: The object could not be cloned."
Chrome:
"DataCloneError: Failed to execute 'postMessage' on 'WorkerGlobalScope': An object could not be cloned."
CMaps may be sparse. Array.prototype.forEach is terribly slow in Chrome
(and also in Firefox) when the sparse array contains a key with a high
value. E.g.
console.time('forEach sparse')
var a = [];
a[0xFFFFFF] = 1;
a.forEach(function(){});
console.timeEnd('forEach sparse');
// Chrome: 2890ms
// Firefox: 1345ms
Switching to CMap.prototype.forEach, which is optimized for such
scenarios fixes the problem.
pi is an index in the stream and is explained on page 201 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row". (Same principle as in #6304). Note that this change will not improve the lamp_cairo.pdf file, only the two files mentioned in #6305.
Short story: somebody got lost in two different indices. pi is an index in the stream and is explained on page 198 of the 32000-spec (however 1-based there), and ps is an index to something in PDF.js. I used the code from flag 0 (which works) to understand which is which. It is also important to understand that for flags 1,2 and 3, the stream is always assigned to the same coordinates and colors. What changes is which "old" coordinates and colors are assigned to what is "missing" in the stream. This is why for these flags, the code is identical except for the assignments in the first "row".
Currently, `src/core/core.js` uses the `fromRef` method on an `Annotation` object to obtain the right annotation type object (such as `LinkAnnotation` or `TextAnnotation`). That method in turn uses a method `getConstructor` to find out which annotation type object must be returned.
Aside from the fact that there is currently a lot of code to achieve this, these methods should not be part of the base `Annotation` class at all. Creation of annotation object should be done by a factory (as also recommended by @yurydelendik at https://github.com/mozilla/pdf.js/pull/5218#issuecomment-52779659) that handles finding out the correct annotation type object and returning it. This patch implements this separation of concerns.
Doing this allows us to also simplify the code quite a bit and to make it more readable. Additionally, we are now able to get rid of the hardcoded array of supported annotation types. The factory takes care of checking the annotation types and falls back to returning the base annotation type (and issuing a warning, which the current code also does not do well) when an annotation type is unsupported.
I have manually tested this commit with 20 test PDFs with different annotation types, such as /Link, /Text, /Widget, /FileAttachment and /FreeText. All render identically before and after the patch, and unsupported annotation types are now properly indicated with a warning in the console.
This patch refactors the code responsible for setting the annotation's rectangle. Its goal is to:
- Actually check that the input array is actually an array, and if so, that it contains exactly four elements.
- Only call `normalizeRect` if the input array is valid, i.e., we do not call it for the default rectangle anymore.
Unit tests are provided just like with the other patches in this series.
Fixes#6106
To avoid future regressions, two new unit tests were added:
1. A new PDF based on the report from #6106, which contains an
OpenAction of type JavaScript and a string "this.print({...}".
2. An existing PDF from https://bugzil.la/1001080 (from #4698).
Although it does not matter, since we don't execute the JavaScript code,
I have also changed "print(true)" to "print({})" since the print method
takes an object (not a boolean). See "Printing PDF documents", page 62:
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_developer_guide.pdf
Basic mathematics would suggest that a double negative should always become positive, but it appears that Adobe Reader simply ignores that case. Hence I think that it makes sense for us to do the same.
Fixes 6218.
When the parser finds a stream, it retrieves the Length from the stream
dictionary and advances the lexer to the offset as specified in Length.
If this Length is incorrect, the lexer could end up anywhere.
When the lexer gets in an invalid state, it could throw errors. For
example, in issue 6108, the lexer ends up inside the stream data. This
stream has the ASCIIHexDecode filter, so all characters are made up from
ASCII characters, and the lexer interprets it as a command token. Tokens
cannot be longer than 127 bytes, so eventually 128 bytes are consumed
and the lexer throws "Command token too long" error.
Another possible error is "Illegal character: 41" when the lexer happens
to end up at a ')' due to the length mismatch.
These problems are solved by catching lexer errors and recovering the
parser via the existing stream length detection branch.
Xref offsets are relative to the start of the PDF data, not to the start
of the PDF file. This is clear if you look at the other code:
- In the XRef's readXRefTable and processXRefTable methods of XRef, the
offset of a xref entry is set to the bytes as given by a PDF file.
These values are always relative to the start of the PDF file (%PDF-).
- The XRef's readXRef method adds the start offset of the stream to
Xref entry's offset: "stream.pos = startXRef + stream.start".
Clearly, this line assumes that the entry offset excludes the start
offset.
However, when the PDF is parsed in recovery mode, the xref table is
filled with entries whose offset is relative to the start of the stream
rather than the PDF file. This is incorrect, and the fix is to subtract
the start offset of the stream from the entry's byte offset.
The manually created PDF file serves as a regression test. It is a valid
PDF, except:
- The integer to point to the start of the xref table and the %%EOF
trailer are missing. This will activate recovery mode in PDF.js
- Some junk was added before the start of the PDF file. This exposes the
bad offset bug.
The PDF specification (cited below) specifies a maximum length of a name
in bytes as a minimal architectural limit. This means that PDF *writers*
should not create names that exceed 127 bytes.
It does not forbid PDF *readers* to accept such names though. These
names are only used internally to link PDF objects to other objects. For
these use cases, the lengths of the names do not really matter. Hence I
have changed the implementation to not treat long names as errors, but
warnings.
> (7.3.5) The length of a name shall be subject to an implementation
> limit; see Annex C.
>
> (Annex C.2) Table C.1 describes the minimum architectural limits that
> should be accommodated by conforming readers running on 32-bit
> machines. Because conforming readers may be subject to these limits,
> conforming writers producing PDF files should remain within them.
>
> (Table C.1) name 127 "Maximum length of a name, in bytes."
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
For named destinations that are contained in a `Dict`, as opposed to a `NameTree`, we currently iterate through the *entire* dictionary just to fetch *one* destination.
This code appears to simply have been copy-pasted from the `get destinations` method, but in its current form it's quite unnecessary/inefficient since can just get the required destination directly instead.
Doing this helped uncover an issue with the `getDestination` implementation.
Currently if a named destination doesn't exist, the method (in `obj.js`) may return `undefined` which leads to the promise being stuck in a pending state.
*Note:* returning `null` for this case is consistent with other methods, e.g. `getOutline` and `getAttachments`.
This became obsolete in bdeca30fbf. All it does is call the Annotation contructor and add hasHtml. This patch lets the Link and Text annotations directly extend the Annotation class and add hasHtml themselves.
This patch also removes an unused global.
Recently I've landed a number patches which fixed issues with ColorSpaces. In most of these cases the cause of the failures were, either partially or entirely, related to the fact that we didn't resolve indirect objects (i.e. the code was missing `xref.fetchIfRef(...)`).
The purpose of this patch is to fix the few remaining cases where indirect objects *could* potentially cause failures.
Given that we have seen how this causes failures in practice, I thus think that it makes sense to try and avoid further issues, instead of waiting for users to file even more bugs for this part of the code-base.
Currently non-embedded ArialBlack fonts are not rendered bold enough, compared to e.g. Adobe Reader.
The issue is that we set the font weight to `bolder`, but since that is actually relative to the font weight of the parent, the result is that there's no practical difference from just using `bold`.
This patch attempts to address that, by explicitly setting the font weight to the maximum value instead (see https://developer.mozilla.org/en-US/docs/Web/CSS/font-weight).
*Note:* I expect one test "failure" in `issue5801`, which in this case is an improvement, since that PDF file uses ArialBlack.
Fixes 6068.
The most notable issue with the font in question is that the `differences` array contains lots of strange entries (of the type `uniXXXX`, instead of proper glyph names).
The 'Version' field of the most recent document catalog, if present, is
intended to supersede the value in the file prologue.
This is significant for incrementally-built PDF documents and generators that
emit a low version in the prologue and later apply a format version based on
PDF features used, such as Apple's CoreGraphics/Quartz PDF backend.
Fixes the internal version variable, as well as the PDFFormatVersion reported
by the API and consumed by viewers.
For passwords where the encoding already is correct, the conversion is a no-op.
Also, since `encodeURIComponent` might throw, we need to make sure that we handle that case too.
Fixes 6010.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1050040.
With this patch the file is completely readable, but given that the font is broken enough to be rejected by OTS the rendering differs slightly from Adobe Reader.
*Note:* the PDF file is sufficiently broken that even Adobe Reader complains about the font, *and* also about another more general issue.
Instead of trying to hack around various browser defects, let's just disable PresentationMode in the affected browsers. This patch:
- Disables PresentationMode in IE11+ when the viewer is embedded; fixes 4711.
Set transformation matrix in (polyfilled) mozPrintCallback when a scale
is applied. Removed _scaleX and _scaleY in favor of _transformMatrix to
emphasize that the caller MUST ensure that the state of the matrix is
correct before `addContextCurrentTransform` is called.
Currently if a font contains a `toUnicode` entry, we always create a new `ToUnicodeMap` in evaluator.js. This is done even for `IdentityV/IdentityH`, despite to possibility to use the much more compact `IdentityToUnicodeMap` representation.
This patch refactors the `IdentityH/IdentityV` cases, to:
- Avoid calling `IdentityCMap.getMap`, since this prevents allocating and iterating through an array with 65536 elements.
- Ensure that the handling of `toUnicode` is actually correct in fonts.js.
We rely on `toUnicode instanceof IdentityToUnicodeMap` in a few places, and currently this does not work correctly for `IdentityH/IdentityV`.
This is a very small follow-up to PR 5536, which sets `isStandardFont` to `false` instead of `undefined` (as currently happens for some font names).
Since the patch is so small, I hope it's OK to also fix an unrelated copy-and-paste error in a comment that was added in PR 5260.
This change does the following:
* Address TODO to remove getEmptyContainer helper.
* Not set container bg-color. The old code is incorrect, causing it to
not have any effect. It sets color to an array (item.color) rather
css string. Also in most cases it would set it to black background
which is incorrect.
* only add border instructions when there is actually a border
* reduce memory consumption by not creating new 3 element arrays for
annotation colors. In fact according to spec, this would be incorrect,
as the default should be "transparent" for an empty array. Adobe
Reader interprets a missing color array as black however.
Note that only Link annotations were actually setting a border style and
color. While Text annotations might have calculated a border they did
not color it. This behaviour is now controlled by the boolean flag.
According to practical experiments, falling back to "Helvetica" when we encounter a non-embedded "[Century Gothic](http://en.wikipedia.org/wiki/Century_Gothic)" `CIDFontType2` font seems to work well.
(Also, the section on Wikipedia about "Printer ink usage" *might* provide some anecdotal evidence that Century Gothic is a fairly standard sans-serif font.)
Obviously this patch doesn't make "Century Gothic" fonts render perfectly, as is often the case with non-embedded fonts, but all the text is now legible in the referenced issues.
Fixes 4722.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=879561.
Fixes two issues:
- #4456 : The first 100 bytes are often not unique as they can be
filled with standard PDF headers - so we use the first 200 KB instead.
(This may be overkill)
- Some documents we encountered have invalid xref ids, which were
always coming out as ‘0000000000000000’ - so we detect that and use the
MD5 instead.
This is a tentative patch that adds *very* basic support for non-embedded Wingdings fonts (a Windows version of Dingbats), by falling back to the ZapfDingbats encoding. Obviously this approach will not work perfectly, but in my opinion it seems to work reasonably well in pratice.
Instead of this very simple patch, another option would be to try and include more complete glyph data for Wingdings, e.g. a Unicode map and glyph widths, similar to what was done for ZapfDingbats.
However there is, in my opinion, one important difference between Wingdings and ZapfDingbats: ZapfDingbats is part of the 14 standard fonts, which in previous versions of the PDF specification was assumed to be available in PDF readers. To improve compatibility with older files, it thus makes sense for us to include data for ZapfDingbats.
However Wingdings has never been a standard font in PDF files, hence PDF files using it *should* thus contain all the necessary font data.
Given the above, I thus believe that it should be OK to fall back to ZapfDingbats for now. If non-embedded Wingdings fonts turns out to be *a lot* more common, then we can revisit this later.
Fixes 4301 completely.
Fixes 4837 almost completely. With this patch the bullets are displayed correctly, but the arrows are not of the correct type.
Fixes `artofwar.pdf`, pages 14 and 15.
maskData comes out of maskCtx.getImageData, so is 0..255 clamped, and
the used multiplications will not create fractions needing rounding,
neither would addition.
As described in #5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.
This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A1 A2 A3 A4 would
be seen as A1 A2 A2 A2 by the evaluator, which prevents using the
"repeat" optimization. Also only the last encountered image is cached,
so A1 B1 A2 B2, would stay A1 B1 A2 B2.
The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low. It also caches any
eligible image by its adler32.
The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of #2618)
This patch makes the image from #5349 appear correctly, the artefacts
for the last packet are fixed in #5426.
This patch also optimizes some "in-checks" and adds a few header parsings.
After PR 5263, setting `disableAutoFetch = true` in the generic viewer no longer works correctly, since the entire file loads even with `disableStream = true`.