Commit Graph

15191 Commits

Author SHA1 Message Date
Jonas Jenwald
d9e0de8515
Merge pull request #14333 from Snuffleupagus/Pages-tree-corruption
Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303)
2021-12-02 11:49:11 +01:00
Jonas Jenwald
63be23f05b Handle errors correctly when data lookup fails during /Pages-tree parsing (issue 14303)
This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree.

Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.
2021-12-02 10:54:40 +01:00
Jonas Jenwald
487a7ddc7d Update (primarily) the Node.js examples to release page resources
Given that Node.js doesn't support Workers, general PDF.js performance will be worse when compared to browsers. In an attempt to improve at least memory usage a little bit, update the Node.js examples to release page resources once parsing is done for that page.
2021-11-30 13:11:50 +01:00
Jonas Jenwald
6dfe4a9140 Enforce PAGE-scrolling for *very* large/long documents (bug 1588435, PR 11263 follow-up)
This patch is essentially a continuation of PR 11263, which tried to improve loading/initialization performance of *very* large/long documents.

Note that browsers, in general, don't handle a huge amount of DOM-elements very well, with really poor (e.g. sluggish scrolling) performance once the number gets "large". Furthermore, at least in Firefox, it seems that DOM-elements towards the bottom of a HTML-page can effectively be ignored; for the PDF.js viewer that means that pages at the end of the document can become impossible to access.

Hence, in order to improve things for these *very* large/long documents, this patch will now enforce usage of the (recently added) PAGE-scrolling mode for these documents. As implemented, this will only happen once the number of pages *exceed* 15000 (which is hopefully rare in practice).
While this might feel a bit jarring to users being *forced* to use PAGE-scrolling, it seems all things considered like a better idea to ensure that the entire document actually remains accessible and with (hopefully) more acceptable performance.

Fixes [bug 1588435](https://bugzilla.mozilla.org/show_bug.cgi?id=1588435), to the extent that doing so is possible since the document contains 25560 pages (and is 197 MB large).
2021-11-29 13:54:24 +01:00
Jonas Jenwald
f15eb63ed5 Remove the PDFSinglePageViewer-specific code from web/secondary_toolbar.js (PR 9877 follow-up)
This was added on the assumption that the viewer would (eventually) start using the `PDFSinglePageViewer` for e.g. PAGE-scrolling mode and PresentationMode. However, having both a `PDFViewer` and a `PDFSinglePageViewer` side-by-side in the viewer would've been tricky to implement well, which is why PR 14112 implemented PAGE-scrolling for the general `BaseViewer` instead.

Given that the default viewer is no longer (potentially) going to use `PDFSinglePageViewer`, there's code in the `SecondaryToolbar` (and related CSS rules) which is now unnecessary.
2021-11-29 13:13:17 +01:00
Jonas Jenwald
700eaecddd
Merge pull request #14321 from timvandermeij/puppeteer
Upgrade to Puppeteer 12
2021-11-29 11:38:24 +01:00
Tim van der Meij
d5b5b665e4
Upgrade to Puppeteer 12 2021-11-28 19:24:11 +01:00
Tim van der Meij
96bb3c6217
Convert package-lock.json to lock file version 2
Since NPM 7, which is over a year old now since it released in October
2020, NPM automatically transforms lock files from version 1 to version
2. In the NPM 7 release notes they reported:

"One change to take note of is the new lockfile format, which is
backwards compatible with npm 6 users. The lockfile v2 unlocks the
ability to do deterministic and reproducible builds to produce a
package tree."

Not only is this change backwards compatible (so older versions of NPM
will still be able to install everything as expected), reproducability
is also a nice property to have and modern NPM versions will otherwise
constantly do the conversion anyway, causing contributors to explicitly
have to revert the change. Therefore, I believe we should do this now
since it doesn't break backwards compatibility for consumers of this
file. It only means that producers of this file (i.e., us contributors)
need to use at least NPM 7 or higher (as of writing NPM 8 is even
available). According to https://nodejs.org/en/download/releases/ this
means contributors should at least run Node.js 15.0.0, while 17.1.0 is
the most recent as of writing, so to me that sounds reasonable to ask.
2021-11-28 19:23:25 +01:00
Tim van der Meij
0d2cdff6c5
Fix browser page navigation for Puppeteer 11+ in test/test.js
In Puppeteer 11 we noticed that Firefox doesn't shut down once the tests
are done anymore. I tracked this down to the `page.goto` call, in
`startBrowser`, never resolving anymore. I can only assume that
something changed in Puppeteer, possibly in combination with recent
Firefox Nightly versions, that caused this, but haven't been able to
fully track it down.

However, I did find that the problem is that the `load` event no longer
triggers, so fortunately we can fix the problem by explicitly waiting
for the `domcontentloaded` event instead. In general this change might
even be better since we now wait until the test framework is fully
loaded before we continue. Note that this also still works for the
current Puppeteer version.

I did find two upstream references that appear to track this issue, both
on the Puppeteer side and on the Firefox side, making me further suspect
that the issue is partly on both sides:

- https://github.com/puppeteer/puppeteer/issues/5806
- https://bugzilla.mozilla.org/show_bug.cgi?id=1706353
2021-11-28 18:58:22 +01:00
Tim van der Meij
60ed3cd297
Fix compatibility with Node.js 17 in test/test.js
Node.js 17, which as of writing is the most recent version, contains a
breaking change in its DNS resolver, causing Firefox not to start
anymore in our test framework. The inline comment together with the
following resources provide more background:

- https://github.com/nodejs/node/issues/40702
- https://github.com/nodejs/node/pull/39987
- https://github.com/cyrus-and/chrome-remote-interface/issues/467
- https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V17.md#other-notable-changes
- https://github.com/DeviceFarmer/adbkit/issues/209
- https://nodejs.org/api/dns.html#dnssetdefaultresultorderorder

This commit ensures that versions both older and newer than Node.js 17
work as expected. This is mainly necessary since the bots as of writing
run Node.js 14.17.0 which is from before this API got introduced and for
example Node.js 12 LTS is only end-of-life in April 2022, so we have to
keep support for those older versions unfortunately.
2021-11-28 18:52:51 +01:00
Tim van der Meij
5309133a9d
Fix browser error logging in test/test.js
If a browser cannot be started, we currently get the following log:
`Error while starting firefox: [object Object]`. This is simply an
oversight from the initial Puppeteer integration work since we never got
into this code path before. With this fix the error log becomes more
useful: `Error while starting firefox: connect ECONNREFUSED ::1:45387`
2021-11-28 18:08:08 +01:00
Tim van der Meij
c14552874b
Merge pull request #14312 from Snuffleupagus/XRef-circular-reference
Prevent circular references in XRef tables from hanging the worker-thread (issue 14303)
2021-11-28 14:07:02 +01:00
Tim van der Meij
7613bb5522
Merge pull request #14320 from Snuffleupagus/update-packages
Update packages and translations
2021-11-28 13:58:53 +01:00
Jonas Jenwald
d62f847db2 Update Stylelint to version 14 (along with related packages)
Based on https://github.com/stylelint/stylelint/blob/14.0.0/docs/migration-guide/to-14.md none of the changes look directly relevant for us.
2021-11-28 12:36:16 +01:00
Jonas Jenwald
37bed4ea45 Update l10n files 2021-11-28 11:14:03 +01:00
Jonas Jenwald
fdf3f03985 Update the eslint-plugin-unicorn package to the latest version
Please see https://github.com/sindresorhus/eslint-plugin-unicorn/releases/tag/v39.0.0
2021-11-28 11:14:02 +01:00
Jonas Jenwald
0c301dfa8b Update the dommatrix package to the latest version 2021-11-28 11:14:02 +01:00
Jonas Jenwald
cfc55b044e Update npm packages 2021-11-28 11:13:58 +01:00
Jonas Jenwald
a807ffe907 Prevent circular references in XRef tables from hanging the worker-thread (issue 14303)
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.

Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.

Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).

Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.

Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
2021-11-27 23:50:26 +01:00
Jonas Jenwald
a669fce762 Inline the isDict, isRef, and isStream checks in the src/core/xref.js file 2021-11-27 23:49:17 +01:00
Jonas Jenwald
680e0efb9d Use Array-destructuring in the XRef.readXRefStream-method 2021-11-27 23:49:17 +01:00
Jonas Jenwald
a2a5376adf
Merge pull request #14311 from Snuffleupagus/validate-Pages-Count
[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303)
2021-11-27 23:47:05 +01:00
Jonas Jenwald
d0c4bbd828 [api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303)
*This patch basically extends the approach from PR 10392, by also checking the last page.*

Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
 - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
 - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
 - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
2021-11-27 21:57:35 +01:00
Tim van der Meij
9a1e27efc5
Merge pull request #14313 from Snuffleupagus/PDFDocument_pagePromises-map
Change the `_pagePromises` cache, in the worker, from an Array to a Map
2021-11-27 20:58:23 +01:00
calixteman
bbd8b5ce9f
Merge pull request #14319 from calixteman/xfa_arc
XFA - Draw arcs correctly
2021-11-27 11:32:32 -08:00
Calixte Denizet
31e13515f5 XFA - Draw arcs correctly
- it aims to fix #14315;
- take into account the startAngle to compute the coordinates of the final point.
2021-11-27 19:30:12 +01:00
Jonas Jenwald
b11091c0f9
Merge pull request #14318 from calixteman/14317
Handle sub/super-scripts in rich text
2021-11-27 18:49:52 +01:00
Calixte Denizet
cfdaa57353 Handle sub/super-scripts in rich text
- it aims to fix #14317;
 - change the fontSize and the verticalAlign properties according to the position of the text.
2021-11-27 16:06:09 +01:00
Jonas Jenwald
e439f4d620
Merge pull request #14314 from Snuffleupagus/XFA-viewer-refs-cache-fix
[Regression] Prevent errors, during loading, in the viewer for XFA-documents (PR 14295 follow-up)
2021-11-26 20:47:45 +01:00
Jonas Jenwald
8fa5fcfe72 [Regression] Prevent errors, during loading, in the viewer for XFA-documents (PR 14295 follow-up)
In the second commit in PR 14295, I forgot that the pages in XFA-documents don't have references (like in regular PDF documents); sorry about that!
2021-11-26 20:21:12 +01:00
Jonas Jenwald
4c56214ab4 Convert PDFDocument._getLinearizationPage to an async method
This, ever so slightly, simplifies the code and reduces overall indentation.
2021-11-26 19:57:47 +01:00
Jonas Jenwald
080996ac68 Change the _pagePromises cache, in the worker, from an Array to a Map
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.
2021-11-26 19:53:57 +01:00
Jonas Jenwald
2e2d049a9c
Merge pull request #14310 from Snuffleupagus/XRef-bogus-byteWidths
Abort parsing when the XRef /W-array contain bogus entries (issue 14303)
2021-11-25 19:57:40 +01:00
Jonas Jenwald
ca8d2bdce4 Abort parsing when the XRef /W-array contain bogus entries (issue 14303)
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.

While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.

Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
2021-11-25 18:35:08 +01:00
Tim van der Meij
a2c380ccb3
Merge pull request #14298 from Snuffleupagus/issue-10906
Center pages vertically in PresentationMode (issue 10906)
2021-11-24 21:39:28 +01:00
Tim van der Meij
973932321e
Merge pull request #14304 from Snuffleupagus/huge-XRef-entry
Ensure that `ChunkedStream` won't attempt to request data *beyond* the document size (issue 14303)
2021-11-24 21:28:24 +01:00
Jonas Jenwald
ae4f1ae3e7 Ensure that ChunkedStream won't attempt to request data *beyond* the document size (issue 14303)
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).

The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.

Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
2021-11-24 19:19:43 +01:00
Jonas Jenwald
5e2aec7dd7
Merge pull request #14299 from SaiKiranMukka/pageviewer-to-async
Convert examples/components/pageviewer.js to await/async (issue 14127)
2021-11-24 14:52:31 +01:00
Jonas Jenwald
f7b1da418f Center pages vertically in PresentationMode (issue 10906)
*This patch can be tested e.g. with the `sizes.pdf` document in the test-suite.*

While this patch isn't necessarily the best solution, e.g. it might be possible to solve this with *only* CSS, it's what I was able to come up with to address an old issue.
The solution here re-uses the `spread`-class in PresentationMode, since that one already takes care of centering pages *vertically*, together with a dummy-page that takes up the entire height of the window.

Finally, some PresentationMode-related CSS-rules are also simplified slightly, since the changes in PR 14112 (using Page-scrolling) allows some clean-up here.
2021-11-24 14:09:34 +01:00
Sai Kiran Mukka
711fbe1376 Convert examples/components/pageviewer.js to await/async (issue 14127) 2021-11-24 15:22:21 +05:30
Tim van der Meij
70fc30d97c
Merge pull request #14295 from Snuffleupagus/rm-viewer-_pagesRequests
Remove the `{BaseViewer, PDFThumbnailViewer}._pagesRequests` caches
2021-11-21 14:30:37 +01:00
Jonas Jenwald
58a2728647 Ensure that BaseViewer.#ensurePdfPageLoaded updates the PDFLinkService-pagesRefCache if necessary
The issue that this patch fixes has existed ever since the viewer was first re-factored into components, however it only really affects the `disableAutoFetch = true` mode.

By default we're fetching all pages in `BaseViewer.setDocument`, and as part of the parsing/initialization we're also populating the `PDFLinkService`-pagesRefCache. The purpose of that cache is to make navigating to any internal destinations faster, by not having to (asynchronously) lookup the pageNumber via the API when handling the destination.
In comparison, when the `disableAutoFetch = true` mode is being used we're instead *lazily* initializing the pages in the `BaseViewer.#ensurePdfPageLoaded`-method. For some reason, that I can only assume is a simple oversight, we're not attempting to update the `PDFLinkService`-pagesRefCache in that case.
2021-11-21 11:53:19 +01:00
Jonas Jenwald
0ebac67a9f Remove the {BaseViewer, PDFThumbnailViewer}._pagesRequests caches
In the `BaseViewer` this cache is mostly relevant in the `disableAutoFetch = true` mode, since the pages are being initialized *lazily* in that case.
In the `PDFThumbnailViewer` this cache is mostly used for thumbnails that are actually being rendered, as opposed to those created directly from the "regular" pages.

Please note that I'm not suggesting that we remove these caches because they're only used in some situations, but rather because they're for all intents and purposes actually *redundant*. In the API itself, we're already caching both the page-promises and the actual pages themselves on the `WorkerTransport`-instance.
Hence these viewer-caches aren't really necessary in practice, and adds what to me mostly seems like an unnecessary level of indirection.[1]

Given that the viewer now relies on caching in the API itself, this patch also adds a new unit-test to ensure that page-caching works (and keep working) as expected.

---
[1] In the `WorkerTransport.getPage`-method the parameter is being validated on every call, but that's hardly enough code to warrant keeping the "duplicate" caches in the viewer in my opinion.
2021-11-21 11:40:45 +01:00
Tim van der Meij
aabd4e5092
Merge pull request #14294 from Snuffleupagus/getStats-refactor
[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter
2021-11-20 15:42:46 +01:00
Jonas Jenwald
6da0944fc7 [api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).

---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)

[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
2021-11-20 12:20:55 +01:00
Tim van der Meij
41ac3f0c07
Merge pull request #14291 from Snuffleupagus/force-postMessageTransfers
[api-minor] Only use Workers when `postMessage` transfers are supported (PR 11123 follow-up)
2021-11-19 20:02:51 +01:00
Tim van der Meij
b1e9e214bf
Merge pull request #14229 from brendandahl/term-log
Add an easy way to log to the terminal during browser tests.
2021-11-19 19:48:59 +01:00
Brendan Dahl
c6cb39ef30
Merge pull request #14262 from Snuffleupagus/issue-14261
Include the /Lang-property, when it exists, in the StructTree-data (issue 14261)
2021-11-19 07:51:21 -08:00
Jonas Jenwald
6f22327e61 [api-minor] Only use Workers when postMessage transfers are supported (PR 11123 follow-up)
Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available.
This patch is a follow-up to PR 11123, which made it impossible to *manually* disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1]

Hence we'll now only support *proper* Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise.

---
[1] At the time of that PR we still "supported" IE, which is why this code was left intact.
2021-11-19 16:47:58 +01:00
Brendan Dahl
052db56a2e Add an easy way to log to the terminal during browser tests.
On the main thread call `driver.log` and the message will output in the
terminal with the pdf id and the message.

I've been using this a lot when trying to find certain PDFs or logging
stats.
2021-11-18 15:38:56 -08:00