Commit Graph

2659 Commits

Author SHA1 Message Date
Jonas Jenwald
e8562173b8 Prevent an infinite loop when parsing corrupt /CCITTFaxDecode data (issue 14305)
Fixes one of the documents in issue 14305.
2021-12-07 13:57:25 +01:00
Jonas Jenwald
909f012fb8 Add a (linked) test-case for issue 8022
Given that [bug 1336591](https://bugzilla.mozilla.org/show_bug.cgi?id=1336591) was just closed as fixed, thus fixing issue 8022 in Firefox, let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves.
2021-12-06 15:27:40 +01:00
Tim van der Meij
911a9d34b1
Fix code duplication in the rasterization logic in test/driver.js
Now that the rasterization logic is encapsulated in a class, we can
easily move the container creation into a separate static method.
2021-12-05 19:29:39 +01:00
Tim van der Meij
03506f25c0
Move the rasterization logic into one single class
This refactoring ensures that we can get rid of the closures and
encapsulate the logic in a nicer way with e.g., getters for the style
promises.
2021-12-05 19:28:51 +01:00
Tim van der Meij
33dc0628a0
Enable the no-var linting rule in test/driver.js
This is done automatically with the `gulp lint --fix` command with the
only exception of the `annotationLayerContext` variable.
2021-12-05 15:41:36 +01:00
Tim van der Meij
5fd4276dcf
Use async/await in the rasterization classes in test/driver.js
This is achieved by letting the `writeSVG` function return a promise so
we don't need callback passing anymore.
2021-12-05 14:11:09 +01:00
Tim van der Meij
13786ef806
Use arrow functions instead of self variables in test/driver.js 2021-12-05 14:11:08 +01:00
Tim van der Meij
1d1f713bfc
Inline loadStyles calls in the rasterization classes in test/driver.js
The wrapper functions in this case only really added indirection, so
this commit simplifies the code a bit.
2021-12-05 13:49:04 +01:00
Tim van der Meij
a58700b0dc
Convert the Driver class to ES6 syntax in test/driver.js 2021-12-05 13:43:02 +01:00
Tim van der Meij
dc455c836e
Merge pull request #14339 from Snuffleupagus/issue-8019-reftest
Add a (linked) test-case for issue 8019
2021-12-04 13:26:47 +01:00
Tim van der Meij
335c4c8a43
Merge pull request #14338 from Snuffleupagus/XRef-more-Pages-validation
[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303)
2021-12-04 13:23:40 +01:00
Jonas Jenwald
40291d1943 Handle errors when fetching the raw /Metadata (issue 14305)
Currently the `Catalog.metadata` getter only handles errors during parsing, however in a *corrupt* PDF document fetching of the raw /Metadata can obviously fail as well.
Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it *never* should and this will cause the viewer to not initialize all state as expected.

Fixes one of the documents in issue 14305.
2021-12-04 09:41:42 +01:00
Jonas Jenwald
ca82e1832f Add a (linked) test-case for issue 8019
Given that [bug 1336572](https://bugzilla.mozilla.org/show_bug.cgi?id=1336572) was just closed as fixed, thus fixing issue 8019 in Firefox[1], let's add a test-case to enable us to catch any future regressions either in PDF.js or in browsers themselves.

---
[1] It also seems to be working in Google Chrome, although I'm having a slightly difficult time deciphering *exactly* what configurations were affected when looking through issue 8019.
2021-12-04 08:56:04 +01:00
Jonas Jenwald
ad3a271fc4 [api-minor] Clear all caches in XRef.indexObjects, and improve /Root dictionary validation in XRef.parse (issue 14303)
*This patch improves handling of a couple of PDF documents from issue 14303.*

 - Update `XRef.indexObjects` to actually clear *all* XRef-caches. Invalid XRef tables *usually* cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.

 - Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched *and* that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.

 - Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)

With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than *appearing* to work).
2021-12-03 11:57:38 +01:00
Jonas Jenwald
1fac6371d3 [Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up)
*Please note:* This is similar to the method that existed prior to PR 3848, but the new method will *only* be used as a fallback when parsing of corrupt PDF documents.

The implementation in PR 14311 unfortunately turned out to be *way* too simplistic, as evident by the recently added test-files in issue 14303, since it may *cause* infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1]
To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the *entire* /Pages-tree when the /Count-entry validation fails during document initialization.

Fixes *at least* two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents.

---
[1] The whole point of PR 14311 was obviously to *get rid of* infinte loops during document initialization, not to introduce any more of those.
2021-12-02 14:31:04 +01:00
Jonas Jenwald
8ea740c800 Slightly extend the "creates pdf doc from PDF file with bad XRef table" unit-test (PR 14304 follow-up)
Given that we're able to "render" this document, let's extend the unit-test to actually check that we're able to obtain the operatorList; although given the overall issues in the document it'll be empty.
2021-12-02 11:51:40 +01:00
Jonas Jenwald
63be23f05b Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303)
This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree.

Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.
2021-12-02 10:54:40 +01:00
Tim van der Meij
0d2cdff6c5
Fix browser page navigation for Puppeteer 11+ in test/test.js
In Puppeteer 11 we noticed that Firefox doesn't shut down once the tests
are done anymore. I tracked this down to the `page.goto` call, in
`startBrowser`, never resolving anymore. I can only assume that
something changed in Puppeteer, possibly in combination with recent
Firefox Nightly versions, that caused this, but haven't been able to
fully track it down.

However, I did find that the problem is that the `load` event no longer
triggers, so fortunately we can fix the problem by explicitly waiting
for the `domcontentloaded` event instead. In general this change might
even be better since we now wait until the test framework is fully
loaded before we continue. Note that this also still works for the
current Puppeteer version.

I did find two upstream references that appear to track this issue, both
on the Puppeteer side and on the Firefox side, making me further suspect
that the issue is partly on both sides:

- https://github.com/puppeteer/puppeteer/issues/5806
- https://bugzilla.mozilla.org/show_bug.cgi?id=1706353
2021-11-28 18:58:22 +01:00
Tim van der Meij
60ed3cd297
Fix compatibility with Node.js 17 in test/test.js
Node.js 17, which as of writing is the most recent version, contains a
breaking change in its DNS resolver, causing Firefox not to start
anymore in our test framework. The inline comment together with the
following resources provide more background:

- https://github.com/nodejs/node/issues/40702
- https://github.com/nodejs/node/pull/39987
- https://github.com/cyrus-and/chrome-remote-interface/issues/467
- https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V17.md#other-notable-changes
- https://github.com/DeviceFarmer/adbkit/issues/209
- https://nodejs.org/api/dns.html#dnssetdefaultresultorderorder

This commit ensures that versions both older and newer than Node.js 17
work as expected. This is mainly necessary since the bots as of writing
run Node.js 14.17.0 which is from before this API got introduced and for
example Node.js 12 LTS is only end-of-life in April 2022, so we have to
keep support for those older versions unfortunately.
2021-11-28 18:52:51 +01:00
Tim van der Meij
5309133a9d
Fix browser error logging in test/test.js
If a browser cannot be started, we currently get the following log:
`Error while starting firefox: [object Object]`. This is simply an
oversight from the initial Puppeteer integration work since we never got
into this code path before. With this fix the error log becomes more
useful: `Error while starting firefox: connect ECONNREFUSED ::1:45387`
2021-11-28 18:08:08 +01:00
Jonas Jenwald
a807ffe907 Prevent circular references in XRef tables from hanging the worker-thread (issue 14303)
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.

Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.

Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).

Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.

Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
2021-11-27 23:50:26 +01:00
Jonas Jenwald
d0c4bbd828 [api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303)
*This patch basically extends the approach from PR 10392, by also checking the last page.*

Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
 - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
 - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
 - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
2021-11-27 21:57:35 +01:00
Calixte Denizet
31e13515f5 XFA - Draw arcs correctly
- it aims to fix #14315;
- take into account the startAngle to compute the coordinates of the final point.
2021-11-27 19:30:12 +01:00
Jonas Jenwald
ca8d2bdce4 Abort parsing when the XRef /W-array contain bogus entries (issue 14303)
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.

While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.

Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
2021-11-25 18:35:08 +01:00
Jonas Jenwald
ae4f1ae3e7 Ensure that ChunkedStream won't attempt to request data *beyond* the document size (issue 14303)
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).

The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.

Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
2021-11-24 19:19:43 +01:00
Jonas Jenwald
0ebac67a9f Remove the {BaseViewer, PDFThumbnailViewer}._pagesRequests caches
In the `BaseViewer` this cache is mostly relevant in the `disableAutoFetch = true` mode, since the pages are being initialized *lazily* in that case.
In the `PDFThumbnailViewer` this cache is mostly used for thumbnails that are actually being rendered, as opposed to those created directly from the "regular" pages.

Please note that I'm not suggesting that we remove these caches because they're only used in some situations, but rather because they're for all intents and purposes actually *redundant*. In the API itself, we're already caching both the page-promises and the actual pages themselves on the `WorkerTransport`-instance.
Hence these viewer-caches aren't really necessary in practice, and adds what to me mostly seems like an unnecessary level of indirection.[1]

Given that the viewer now relies on caching in the API itself, this patch also adds a new unit-test to ensure that page-caching works (and keep working) as expected.

---
[1] In the `WorkerTransport.getPage`-method the parameter is being validated on every call, but that's hardly enough code to warrant keeping the "duplicate" caches in the viewer in my opinion.
2021-11-21 11:40:45 +01:00
Jonas Jenwald
6da0944fc7 [api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).

---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)

[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
2021-11-20 12:20:55 +01:00
Tim van der Meij
b1e9e214bf
Merge pull request #14229 from brendandahl/term-log
Add an easy way to log to the terminal during browser tests.
2021-11-19 19:48:59 +01:00
Brendan Dahl
c6cb39ef30
Merge pull request #14262 from Snuffleupagus/issue-14261
Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)
2021-11-19 07:51:21 -08:00
Brendan Dahl
052db56a2e Add an easy way to log to the terminal during browser tests.
On the main thread call `driver.log` and the message will output in the
terminal with the pdf id and the message.

I've been using this a lot when trying to find certain PDFs or logging
stats.
2021-11-18 15:38:56 -08:00
Brendan Dahl
9f4a2cf5ce
Merge pull request #14276 from Snuffleupagus/issue-14242-2
Only show the `loadingIcon`-spinner on visible pages (issue 14242)
2021-11-18 13:43:58 -08:00
Tim van der Meij
3dccaccbb4
Merge pull request #14278 from Snuffleupagus/rm-removeChild
Replace the remaining `Node.removeChild()` instances with `Element.remove()`
2021-11-17 20:17:55 +01:00
Jonas Jenwald
4ef1a129fa Replace the remaining Node.removeChild() instances with Element.remove()
Using `Element.remove()` is a slightly more compact way of removing an element, since you no longer need to explicitly find/use its parent element.
Furthermore, the patch also replaces a couple of loops that're used to delete all elements under a node with simply overwriting the contents directly (a pattern already used throughout the viewer).

See also:
 - https://developer.mozilla.org/en-US/docs/Web/API/Node/removeChild
 - https://developer.mozilla.org/en-US/docs/Web/API/Element/remove
2021-11-16 17:52:50 +01:00
Brendan Dahl
3209c013c4
Merge pull request #14247 from calixteman/button
[api-minor] Render pushbuttons on their own canvas (bug 1737260)
2021-11-16 08:10:40 -08:00
Jonas Jenwald
7d4c37e988 Use the new iterator in the PDFPageViewBuffer unit-tests
The previous patch introduced an iterator in the `PDFPageViewBuffer`-class, hence the test-only `_buffer`-getter is no longer necessary.
2021-11-15 14:06:17 +01:00
Jonas Jenwald
971ac8e993 Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)
*Please note:* This is a tentative patch, since I don't have the necessary a11y-software to actually test it.
2021-11-14 12:37:41 +01:00
Calixte Denizet
fe95e100e4 Parse query string in using URLSearchParams
- I just noticed in reading the code that we parse that stuff when something exists in the web api;
 - see https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams/URLSearchParams.
2021-11-13 21:10:54 +01:00
calixteman
85c6dd59ce
Merge pull request #14268 from calixteman/outline
Remove non-displayable chars from outline title (#14267)
2021-11-13 08:12:56 -08:00
Calixte Denizet
7041c62ccf Remove non-displayable chars from outline title (#14267)
- it aims to fix #14267;
 - there is nothing about chars in range [0-1F] in the specs but acrobat doesn't display them in any way.
2021-11-13 16:56:08 +01:00
Jonas Jenwald
afcc99a86d When parsing corrupt documents without any trailer-dictionary, fallback to the "top"-dictionary (issue 14269)
There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.

Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.

---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
2021-11-13 13:21:38 +01:00
Jonas Jenwald
28fb3975eb
Merge pull request #14266 from calixteman/bug931481
Don't consider space as real space when there is an extra spacing (bug 931481)
2021-11-12 21:42:32 +01:00
Calixte Denizet
a88ff34eb7 Don't consider space as real space when there is an extra spacing (bug 931481)
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
 - real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
 - for example, an extra spacing can cancel a space so visually there are no space.
2021-11-12 18:53:48 +01:00
Calixte Denizet
5b7e1f5232 XFA - Avoid an exception when looking for a font in a parent node
- it aims to fix issue https://github.com/mozilla/pdf.js/issues/14150;
  - a parent can be null in case the root has been reached, so just add a check.
2021-11-12 16:27:08 +01:00
Calixte Denizet
33ea817b20 [api-minor] Render pushbuttons on their own canvas (bug 1737260)
- First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260;
 - several interactive pdfs use the possibility to hide/show buttons to show different icons;
 - render pushbuttons on their own canvas and then insert it the annotation_layer;
 - update test/driver.js in order to convert canvases for pushbuttons into images.
2021-11-12 15:37:33 +01:00
Jonas Jenwald
ea1c348c67 Always prefer abbreviated keys, over full ones, when doing any dictionary lookups (issue 14256)
Note that issue 14256 was specifically about *inline* images, please refer to:
 - https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.1852045
 - https://www.pdfa.org/safedocs-unearths-pdf-inline-image-issue/
 - https://pdf-issues.pdfa.org/32000-2-2020/clause08.html#H8.9.7

However, during review of the initial PR in https://github.com/mozilla/pdf.js/pull/14257#issuecomment-964469710, it was suggested that we instead do this *unconditionally for all* dictionary lookups.
In addition to re-ordering the existing call-sites in the `src/core`-code, and adding non-PRODUCTION/TESTING asserts to catch future errors, for consistency a number of existing `if`/`switch`-blocks were re-factored to also check the abbreviated keys first.
2021-11-10 11:56:18 +01:00
calixteman
4bb9de4b00
Merge pull request #14239 from calixteman/1739502
XFA - Fix a breakBefore issue when target is a contentArea and startNew is 1 (bug 1739502)
2021-11-08 03:14:42 -08:00
Calixte Denizet
13ae6d493a XFA - Encode tag names in UTF-8 when saving (fix #14249) 2021-11-07 21:41:37 +01:00
Tim van der Meij
891f21fba6
Merge pull request #14245 from Snuffleupagus/PDFPageViewBuffer-class
Convert `PDFPageViewBuffer` to a standard class, and use a `Set` internally
2021-11-07 14:37:33 +01:00
calixteman
efb4455749
Merge pull request #14240 from calixteman/14014
XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014)
2021-11-06 13:21:43 -07:00
Calixte Denizet
1681e25008 XFA - Get each page asynchronously in order to avoid blocking the event loop (#14014) 2021-11-06 13:25:03 +01:00