From 90472e5130378cd443c05c5a8cbaaeff5b82b843 Mon Sep 17 00:00:00 2001
From: Jonas Jenwald <jonas.jenwald@gmail.com>
Date: Fri, 10 Dec 2021 17:14:58 +0100
Subject: [PATCH] Avoid overloading the worker-thread during eager page
 initialization in the viewer (PR 11263 follow-up)

This patch is essentially *another* continuation of PR 11263, which tried to improve loading/initialization performance of *very* large/long documents.

For most documents, unless they're *very* long, we'll eagerly initialize all of the pages in the viewer. For shorter documents having all pages loaded/initialized early provides overall better performance/UX in the viewer, however there's cases where it can instead *hurt* performance.
For documents with a couple of thousand pages[1], the parsing and pre-rendering of the *second* page of the document can be delayed (quite a bit). The reason for this is that we trigger `PDFDocumentProxy.getPage` for *all pages* early during the viewer initialization, which causes the worker-thread to be swamped with handling (potentially) thousands of `getPage`-calls and leaving very little time for other parsing (such as e.g. of operatorLists).

To address this situation, this patch thus proposes temporarily "pausing" the eager `PDFDocumentProxy.getPage`-calls once a threshold has been reached, to give the worker-thread a change to handle other requests.[2]

Obviously this may *slightly* delay the "pagesloaded" event in longer documents, but considering that it's already the result of asynchronous parsing that'll hopefully not be seen as a blocker for these changes.[3]

---
[1] A particularly problematic example is https://github.com/mozilla/pdf.js/files/876321/kjv.pdf (16 MB large), which is a document with 2236 pages and a /Pages-tree that's only *one* level deep.

[2] Please note that I initially considered simply chaining the `PDFDocumentProxy.getPage`-calls, however that'd slowed things down for all documents which didn't seem appropriate.

[3] This patch will *hopefully* also make it possible to re-visit PR 11312, since it seems that changing `Catalog.getPageDict` to an `async` method wasn't the problem in itself. Rather it appears that it leads to slightly different timings, thus exacerbating the already existing issues with the worker-thread being overloaded by `getPage`-calls.
Having recently worked with that method, there's a couple of (very old) issues that I'd also like to address and having `Catalog.getPageDict` be `async` would simplify things a great deal.
---
 web/base_viewer.js | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/web/base_viewer.js b/web/base_viewer.js
index 931958ced..8fac5d0ce 100644
--- a/web/base_viewer.js
+++ b/web/base_viewer.js
@@ -57,6 +57,7 @@ const DEFAULT_CACHE_SIZE = 10;
 const PagesCountLimit = {
   FORCE_SCROLL_MODE_PAGE: 15000,
   FORCE_LAZY_PAGE_INIT: 7500,
+  PAUSE_EAGER_PAGE_INIT: 500,
 };
 
 /**
@@ -625,7 +626,7 @@ class BaseViewer {
         // Fetch all the pages since the viewport is needed before printing
         // starts to create the correct size canvas. Wait until one page is
         // rendered so we don't tie up too many resources early on.
-        this._onePageRenderedOrForceFetch().then(() => {
+        this._onePageRenderedOrForceFetch().then(async () => {
           if (this.findController) {
             this.findController.setDocument(pdfDocument); // Enable searching.
           }
@@ -650,7 +651,7 @@ class BaseViewer {
             return;
           }
           for (let pageNum = 2; pageNum <= pagesCount; ++pageNum) {
-            pdfDocument.getPage(pageNum).then(
+            const promise = pdfDocument.getPage(pageNum).then(
               pdfPage => {
                 const pageView = this._pages[pageNum - 1];
                 if (!pageView.pdfPage) {
@@ -671,6 +672,10 @@ class BaseViewer {
                 }
               }
             );
+
+            if (pageNum % PagesCountLimit.PAUSE_EAGER_PAGE_INIT === 0) {
+              await promise;
+            }
           }
         });