In core/document.js: `PDFDocument.prototype.parse` accesses a dictionary
property, which could throw if the underlying data is not yet available.
In core/obj.js: `get Catalog.prototype.metadata` calls
`stream.getBytes`, which can throw MissingDataException too when the
stream is a ChunkedStream.
This fixes something that I noticed while working with the code in `Catalog.getPageDict` when debugging issue 8088.
Note that while I don't have an example where this patch really matters, given that e.g. `PartialEvaluator.hasBlendModes` depends on the `objId` to avoid cyclic references this patch could potentially help for some PDF files.
As discussed on IRC, we need to check all nodes at the *bottom* of the tree to ensure that we find the correct `Page` dict.
Furthermore, this patch also gets rid of the caching present in a previous version, since it's not clear if that really helps.
Note that this patch purposely adds an `eq` test, using a reduced test-case, so that we can be sure that the algorithm actually finds the correct `Page` dict for each `pageIndex`.
Fixes 8088.
Using `else` after `return` is not necessary, and can often lead to unnecessarily cluttered code. By using the `no-else-return` rule in ESLint we can avoid this pattern, see http://eslint.org/docs/rules/no-else-return.
We're already passing in a, currently unused, `PdfManager` instance when initializing the `XRef`. To avoid having to pass a single `password` parameter around, we could thus simply get the `password` through the `PdfManager` instance instead.
By only allowing very specific type of `JavaScript` actions, and also utilizing the existing `URL` validation, this patch shouldn't pose too much risk.
Fixes one of the points in issue 3897 (with the PDF file taken from issue 3438).
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=843699 (probably, since that bug doesn't contain a test-case).
It seems that certain bad PDF generators can create badly encoded "Prefix" entries for Page Labels, one example being http://ukjewishfilm.org/wp-content/uploads/2015/09/Jewish-Film-Festival-Programme-ONLINE.pdf.
Unfortunately I didn't come across such a PDF file while adding the API support for Page Labels, but with them now being used in the viewer I just found this issue. With this patch, we now display the Page Labels in the same way as Adobe Reader.
In general we neither want, nor can, support arbitrary `Launch` actions. But in practice, all the cases we've seen so far just contains relative URLs to other PDF files. Building on PR 7689, we can thus at least support basic `Launch` actions.
Note that in `FIREFOX/MOZCENTRAL/CHROME` builds of the standard viewer the `docBaseUrl` parameter will be set by default, since in that case it makes sense to use the current URL as a base.
For the `GENERIC` viewer, or the API itself, it doesn't make sense to try and set the `docBaseUrl` by default. However, custom deployments/implementations may still find the parameter useful.
Note that this will automatically reject any relative URL.
To make the API more useful to consumers, URLs that are rejected will be available via the `unsafeUrl` property in the data object returned by `PDFPageProxy_getAnnotations`.
The patch also adds a bit more validation of the data for `Named` actions.
This not only reduces code duplication, but it also allow us to easily support the same kind of URLs we currently do for Link annotations in the Outline as well.
Instead of having `Parser_getObj` fail unconditionally for the referenced PDF file, this patch attempts to let searching for the main trailer continue even if there are errors.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1250079.
Currently the `getPageIndex` method will happily return `0`, even if the `Ref` parameter doesn't actually point to a proper /Page dictionary.
Having the API trust that the consumer is doing the right thing seems error-prone, hence this patch which adds a check for this case.
Given that the `Catalog_getPageIndex` method isn't used in any hot part of the codebase, this extra check shouldn't be a problem.
(Note: in the standard viewer, it is only ever used from `PDFLinkService_navigateTo` if a destination needs to be resolved during document loading, which isn't common enough to be an issue IMHO.)
As evident from e.g. PRs 6485 and 7118, some bad PDF generators unfortunately create Arrays where *some* elements are indirect objects (i.e. `Ref`s). This seems to mostly affect Arrays that contain numbers, such as e.g. `Matrix/FontMatrix/BBox/FontBBox/Rect/Color/...`, and has manifested itself in PDF files that fail to render correctly (some elements are missing).
The problem in both the cases above, besides broken rendering, was that there were *no* errors/warnings that indicated what the problem was, making it difficult to pinpoint the issue.
Hence this patch, where I've audited all usages of `Dict_get` in `src/core/` files, and replaced it with `Dict_getArray` where appropriate to try and prevent unnecessary future bugs.
Currently the `C` entry in an outline item is returned as is, which is neither particularly useful nor what the API documentation claims.
This patch also adds unit-tests for both the color handling, and the `F` entry (bold/italic flags).
Re: issue 5089.
(Note that since there are other outline features that we currently don't support, e.g. bold/italic text and custom colours, I thus think we can keep the referenced issue open.)
In PR 6485 I somehow missed to account for the case where `xref` is undefined. Since a dictonary can be initialized without providing a reference to an `xref` instance, `Dict_getArray` can thus fail without this added check.
This patch improves the detection of `xref` in files where it is followed by an arbitrary whitespace character (not just a line-breaking char).
It also adds a check for missing whitespace, e.g. `1 0 obj<<`, to speed up `readToken` for the PDF file in the referenced issue.
Finally, the patch also replaces a bunch of magic numbers with suitably named constants.
Fixes 5752.
Also improves 6243, but there are still issues.
Fixes#6106
To avoid future regressions, two new unit tests were added:
1. A new PDF based on the report from #6106, which contains an
OpenAction of type JavaScript and a string "this.print({...}".
2. An existing PDF from https://bugzil.la/1001080 (from #4698).
Although it does not matter, since we don't execute the JavaScript code,
I have also changed "print(true)" to "print({})" since the print method
takes an object (not a boolean). See "Printing PDF documents", page 62:
http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_developer_guide.pdf
Xref offsets are relative to the start of the PDF data, not to the start
of the PDF file. This is clear if you look at the other code:
- In the XRef's readXRefTable and processXRefTable methods of XRef, the
offset of a xref entry is set to the bytes as given by a PDF file.
These values are always relative to the start of the PDF file (%PDF-).
- The XRef's readXRef method adds the start offset of the stream to
Xref entry's offset: "stream.pos = startXRef + stream.start".
Clearly, this line assumes that the entry offset excludes the start
offset.
However, when the PDF is parsed in recovery mode, the xref table is
filled with entries whose offset is relative to the start of the stream
rather than the PDF file. This is incorrect, and the fix is to subtract
the start offset of the stream from the entry's byte offset.
The manually created PDF file serves as a regression test. It is a valid
PDF, except:
- The integer to point to the start of the xref table and the %%EOF
trailer are missing. This will activate recovery mode in PDF.js
- Some junk was added before the start of the PDF file. This exposes the
bad offset bug.