Trying to shadow a non-existent property is always an implementation mistake, since it leads to the `shadow`-call not having any effect.
In PR 14152 I overlooked the fact that it's fairly easy to enforce this during development/testing, since that can help catch e.g. simple spelling bugs.
Currently the `Catalog.metadata` getter only handles errors during parsing, however in a *corrupt* PDF document fetching of the raw /Metadata can obviously fail as well.
Without this patch the `PDFDocumentProxy.getMetadata` method, in the API, can thus fail which it *never* should and this will cause the viewer to not initialize all state as expected.
Fixes one of the documents in issue 14305.
*Please note:* This is similar to the method that existed prior to PR 3848, but the new method will *only* be used as a fallback when parsing of corrupt PDF documents.
The implementation in PR 14311 unfortunately turned out to be *way* too simplistic, as evident by the recently added test-files in issue 14303, since it may *cause* infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1]
To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the *entire* /Pages-tree when the /Count-entry validation fails during document initialization.
Fixes *at least* two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents.
---
[1] The whole point of PR 14311 was obviously to *get rid of* infinte loops during document initialization, not to introduce any more of those.
This was added in PR 14311, but given that I completely missed to update the `PDFDocument.getPage` signature accordingly it's completely unused.
Given that things work just as fine as-is, let's simply remove that optional parameter for now; sorry about the churn here!
This only applies to severely corrupt documents, where it's possible that the `Parser` throws when we try to access e.g. a /Kids-entry in the /Pages-tree.
Fixes two of the issues listed in issue 14303, namely the `poppler-742-0.pdf...` and `poppler-937-0.pdf...` documents.
*Please note:* While this patch on its own is sufficient to prevent the worker-thread from hanging, however in combination with PR 14311 these PDF documents will both load *and* render correctly.
Rather than focusing on the particular structure of these PDF documents, it seemed (at least to me) to make sense to try and prevent all circular references when fetching/looking-up data using the XRef table.
To avoid a solution that required tracking the references manually everywhere, the implementation settled on here instead handles that internally in the `XRef.fetch`-method. This should work, since that method *and* the `Parser`/`Lexer`-implementations are completely synchronous.
Note also that the existing `XRef`-caching, used for all data-types *except* Streams, should hopefully help to lessen the performance impact of these changes.
One *potential* problem with these changes could be certain *browser* exceptions, since those are generally not catchable in JavaScript code, however those would most likely "stop" worker-thread parsing anyway (at least I hope so).
Finally, note that I settled on returning dummy-data rather than throwing an exception. This was done to allow parsing, for the rest of the document, to continue such that *one* bad reference doesn't prevent an entire document from loading.
Fixes two of the issues listed in issue 14303, namely the `poppler-91414-0.zip-2.gz-53.pdf` and `poppler-91414-0.zip-2.gz-54.pdf` documents.
*This patch basically extends the approach from PR 10392, by also checking the last page.*
Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).
Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.
Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
- This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
- For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
- This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.
As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).
Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
Given that not all pages necessarily are being accessed, or that the pages may be accessed out of order, using a `Map` seems like a more appropriate data-structure here.
Furthermore, this patch also adds (currently missing) caching for XFA-documents. Loading a couple of such documents in the viewer, with logging added, shows that we're currently re-creating `Page`-instances unnecessarily for XFA-documents.
For this particular PDF document, we have `/W [1 2 166666666666666666666666666]` which obviously makes no sense.
While this patch makes no attempt at actually validating the entries in the /W-array, we'll now simply abort all processing when the end of the PDF document has been reached (thus preventing hanging the browser).
Please note that this patch doesn't enable the PDF document to be loaded/rendered, but at least it fails "correctly" now.
Fixes one of the issues listed in issue 14303, namely the `REDHAT-1531897-0.pdf`document.
This bug was surprisingly difficult to track down, since it didn't just depend on range-requests being used but also on how quickly the document was loaded. To even be able to reproduce this locally, I had to use a very small `rangeChunkSize`-value (note the unit-test).
The cause of this bug is a bogus entry in the XRef-table, causing us to attempt to request data from *beyond* the actual document size and thus getting into an infinite loop.
Fixes *one* of the issues listed in issue 14303, namely the `PDFBOX-4352-0.pdf` document.
*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.
The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.
Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.
Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).
---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see 41ac3f0c07/src/shared/util.js (L206-L232)
[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.
[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
Given that all modern browsers now support `postMessage` transfers, and have for years, it no longer seems necessary for the PDF.js library to support using Workers unless the `postMessage` transfers functionality is available.
This patch is a follow-up to PR 11123, which made it impossible to *manually* disable `postMessage` transfers for performance reasons (since it increases memory usage), which hasn't caused any bug reports as far as I know.[1]
Hence we'll now only support *proper* Worker implementations, with fully working `postMessage` transfers, and fallback to using "fake" Workers otherwise.
---
[1] At the time of that PR we still "supported" IE, which is why this code was left intact.
There's obviously no guarantee that this will work in general, if the document is sufficiently corrupt, but it should hopefully be better than just throwing `InvalidPDFException` as currently happens.
Please note that, as is often the case with corrupt documents, it's somewhat difficult to know if we're rendering the document "correctly" with this patch[1]. In this case even Adobe Reader cannot open the document, which is always a good sign that it's *really* corrupt, however we're at least able to render *something* with this patch.
---
[1] Whatever "correct" even means when dealing with corrupt PDF documents, where often times different PDF viewers won't agree completely.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=931481;
- real space chars are pushed in the chunk but when there is an extra spacing, the next char position must be compared with the previous one;
- for example, an extra spacing can cancel a space so visually there are no space.
- First step to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1737260;
- several interactive pdfs use the possibility to hide/show buttons to show different icons;
- render pushbuttons on their own canvas and then insert it the annotation_layer;
- update test/driver.js in order to convert canvases for pushbuttons into images.
Previously, when we created a shading pattern canvas we created it
as the same size as the page. This was good for caching if the same
pattern was used over and over again, but when lots of different
shadings are created that caused us to create many full page
canvases.
Instead of creating the full page canvses, create the canvas
as the same size as the current path bounding box. This reduces memory
consumption by a lot since most paths are pretty small. Also, in real world
PDFs it's rare for a shading (non shading fill) to be reused over and over again.
Bug 1721949 is an example where the same pattern is reused and it will be slightly
slower than before.
In `beginGroup` we create a new canvas that is the size of the
bounding box and we translate it to the offset. This means we don't need to
also apply the bounding box during `paintFormXObjectBegin`.
This improves #6961 quite a bit, but it still is missing the indention
in the ruler.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1739502;
- when the target area was the current content area, everything was pushed in it instead of creating a new one (and consequently a new pageArea is created).
- the pdf shows an alignment issue on page 4:
- the hAlign is "center" but the subform was the width of its parent, so compute the real width of the subform with tb layout;
- there is an extra empty page at the end of the pdf:
- there is a subform with some hidden elements which are not rendered for now (since there is no plugged JS engine it isn't possible to draw them in changing their visibility).
- so in case a subform is empty and has no real dimensions (at least one is 0), we just consider it as empty.
We were incorrectly using the transform in the pattern before it had been
adjusted causing the pattern to be misplaced relative to the page.
Fixes: ShowText-ShadingPattern.pdf (already in corpus)
Fixes: #8111Fixes: #9243
Subfrom nomin displays even though it's subform is set to <occur max=-1 min=0>
If we look through specs of XFA 3.3 : https://www.pdfa.org/norm-refs/XFA-3_3.pdf
- The min attribute is used when processing a form that contains data. Regardless of the data at least this number of instances is included. It is permissible to set this value to zero, in which case the container is entirely excluded if there is no data for it.
However, in our case it doesn't happen, because we let our empty dataNode get through. Though by setting a clause:
- eliminate unmatched data with occur min=0
we are checking our empty data and sending it to uselessNode array where at the end it gets removed;
Very short strings can narrowly miss the existing Bidi-detection threshold, leading to incorrect text-selection and copying behaviour.
In my testing, neither Adobe Reader or PDFium seem to handle copying "correctly" for this document. Hence it's not entirely clear to me that we actually want to fix this, since tweaking these heuristics can *obviously* cause regressions elsewhere (and our test coverage for RTL-text isn't exactly great).
Starting a new path will wipe out any of the current subpaths in the
current graphics state, so we should reset the min/maxes.
This makes a number of the bounding boxes smaller and reduces the number
of composed pixels. For the smask tests in the corpus, the number of
composed pixesl goes from 19,872,109 to 19,676,905. The difference is much
larger on other PDFs though.