The only reason, as far as I can tell, for parsing the Metadata on the main-thread is how it was originally implemented. When Metadata support was first implemented, it utilized the [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) which isn't available in workers.
Today, with the custom XML-parser being used, that's no longer an issue and it seems reasonable to move the Metadata parsing to the worker-thread[1], since that's where all parsing should happen (for performance reasons).
Based on these changes, we'll be able to reduce the now unnecessary duplication of the XML-parser (and related code) in both of the *built* `pdf.js`/`pdf.worker.js` files.
Finally, this patch changes the `_repair` method to use "Array + join" rather than string concatenation.
---
[1] This needed the previous patch, to enable sending of `Map`s between threads with workers disabled.
* don't set a value in annotationStorage by default:
- having an undefined when the annotation is rendered for saving/printing means nothing has changed so use normal appearance
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1681687
* change the way to compute font size when this one is null in DA:
- make fontSize proportional to line height
- in multiline case, take into account the number of lines for text entered to adapt the font size
*As far as I can tell, this has been broken ever since PR 3289 (back in 2013) without anyone noticing.*
For any non-`MissingDataException` errors encountered in `ObjectLoader._walk`, we're simply throwing immediately which thus has the potential to *completely* break rendering of an entire page.
In practice this is obviously only an issue for PDF documents which are in one way or another corrupt, since that's the only way that `XRef.fetch` will throw non-`MissingDataException` errors. To make matters worse these errors are *intermittent*, since they can only occur if the document is still loading when the `ObjectLoader`-code runs (note the early return in `ObjectLoader.load`).
Please note that we cannot simply catch the error and let "normal" parsing continue in `ObjectLoader._walk`, since that could lead to errors elsewhere given that resources "below" the current one (in the graph) might not be checked as intended then.
All-in-all, the only way to make absolutely sure that we won't cause *unexpected* `MissingDataException`s somewhere else in the code-base is to fallback to fetching the *entire* document in this edge-case.
- the parser is base on a class extending XMLParserBase
- it handle xml namespaces:
* each namespace is assocated with a builder
* builder builds nodes belonging to the namespace
* when a node is inserted in the parent namespace compatibility is checked (if required)
- to avoid name collision between xml names and object properties, use Symbol.
Note how, in the `if (this.stateManager.stateStack.length !== 0) {` branch, we're attempting to access the not yet defined variable[1] `args`. If this code-path is ever hit, an Error will be thrown and parsing will thus be aborted immediately (likely leading to e.g. rendering bugs).
Note that I found this purely by accident, since I happened to glance at the LGTM report. However, I've since found that the error is also present during the unit-test[2] and with this patch we're actually testing the *intended* thing here.
As part of fixing this, and to avoid re-introducing a similar bug in the future, we'll now instead always reset `args.length` *before* attempting to read the next operator.
Also, we can use the existing `EvaluatorPreprocessor.savedStatesDepth` getter to simplify the save/restore detection a tiny bit.
---
[1] The ESLint rule `no-use-before-define` would have helped catch this problem, but unfortunately we cannot enable that without quite a bit of refactoring all over the code-base.
[2] The unit-test was updated such that it would fail in the `master`-branch.
- For wrapped scrolling, we unfortunately need to do a fair bit of parsing of the *current* page layout. Compared to e.g. the spread-modes, where we can easily tell how the pages are laid out, with wrapped scrolling we cannot tell without actually checking. In particular documents with varying page sizes require some care, since we need to check all pages on the "row" of the current page are visible and that there aren't any "holes" present. Otherwise, in the general case, there's a risk that we'd skip over pages if we'd simply always advance to the previous/next "row" in wrapped scrolling.
- For horizontal scrolling, this patch simply maintains the current behaviour of advancing *one* page at a time. The reason for this is to prevent inconsistent behaviour for the next and previous cases, since those cannot be handled identically. For the next-case, it'd obviously be simple to advance to the first not completely visible page. However for the previous-case, we'd only be able to go back *one* page since it's not possible to (easily) determine the page layout of non-visible pages (documents with varying page sizes being a particular issue).
- For vertical scrolling, this patch maintains the current behaviour by default. When spread-modes are being used, we'll now attempt to advance to the next *spread*, rather than just the next page, whenever possible. To prevent skipping over a page, this two-page advance will only apply when both pages of the current spread are visible (to avoid breaking documents with varying page sizes) and when the second page in the current spread is fully visible *horizontally* (to handle larger zoom values).
In order to reduce the performance impact of these changes, note that the previous/next-functionality will only call `getVisibleElements` for the scroll/spread-modes where that's necessary and that "normal" vertical scrolling is thus unaffected by these changes.
To support these changes, the `getVisibleElements` helper function will now also include the `widthPercent` in addition to the existing `percent` property.
The `PDFViewer._updateHelper` method is changed slightly w.r.t. updating the `currentPageNumber` for the non-vertical/spread modes, i.e. won't affect "normal" vertical scrolling, since that helped simplify the overall calculation of the page advance.
Finally, these new `BaseViewer` methods also allow (some) simplification of previous/next-page functionality in various viewer components.
*Please note:* There's one thing that this patch does not attempt to change, namely disabling of the previous/next toolbarButtons respectively the firstPage/lastPage secondaryToolbarButtons. The reason for this is that doing so would add quite a bit of complexity in general, and if for some reason `BaseViewer._getPageAdvance` would get things wrong we could end up incorrectly disabling the buttons. Hence it seemed overall safer to *not* touch this, and accept that the buttons won't be `disabled` despite in some edge-cases no further scrolling being possible.
* Add a parser to get font data from the default appearance
- pdfium & poppler use a special parser too to get these info.
* Update src/core/default_appearance.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
- Remove a *duplicated* reference test, see "issue12810", from the manifest.
- Improve the spelling in a couple of comments in `src/core/canvas.js`, most notable of the word "parallelogram".
- Update a comment, also in `src/core/canvas.js`, to actually agree with the value used to reduce confusion when reading the code.
While PR 12725 fixed bug 1671312 as reported, i.e. the "In the upper right corner "Purposes' has bad kerning."-part, it however broke other parts of the text rendering.
Note in particular the tables, e.g. on page 2 and beyond, where the glyphs are now rendered too close together. The reason for this is that the fonts in question are non-embedded ArialNarrow, which we just replace with Helvetica which obviously is not narrow. Given that the font replacement isn't a perfect fit for non-embedded ArialNarrow, we still need to re-measure the glyph widths in this case.
This patch is a rebased *and* refactored version of PR 9448, such that it applies cleanly given that `PDFFindController` has changed since that PR was opened; obviously keeping the original author information intact.
This patch will thus ensure that e.g. fractions, and other things that we normalize before searching, will still be highlighted correctly in the textLayer.
Furthermore, this patch also adds basic unit-tests for this functionality.
*Note:* The `[api-minor]` tag is added, since third-party implementations of the `PDFFindController` must now always use the `pageMatchesLength` property to get accurate length information (see the `web/text_layer_builder.js` changes).
Co-authored-by: Ross Johnson <ross@mazira.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
* add a comment to explain how minimal linewidth is computed.
* when context.linewidth < 1 after transform, firefox and chrome
don't render in the same way (issue #12810).
* set lineWidth to 1 after transform and before stroking
- aims fix issue #12295
- a pixel can be transformed into a rectangle with both heights < 1.
A single rescale leads to a rectangle with dim equals to 1 and
the other to something greater than 1.
* change the way to render rectangle with null dimensions:
- right now we rely on the lineWidth set before "re" but
it can be set after "re" and before "S" and in this case the rendering
will be wrong.
- render such rectangles as a single line.
Note that these changes were done automatically, using `gulp lint --fix`.
With this rule, we'll thus enforce a *consistent* formatting of zero-lengths in our CSS files.
Please find additional details about the Stylelint rule at https://stylelint.io/user-guide/rules/length-zero-no-unit
There's built-in ESLint rule, see `sort-imports`, to ensure that all `import`-statements are sorted alphabetically, since that often helps with readability.
Unfortunately there's no corresponding rule to sort `export`-statements alphabetically, however there's an ESLint plugin which does this; please see https://www.npmjs.com/package/eslint-plugin-sort-exports
The only downside here is that it's not automatically fixable, but the re-ordering is a one-time "cost" and the plugin will help maintain a *consistent* ordering of `export`-statements in the future.
*Note:* To reduce the possibility of introducing any errors here, the re-ordering was done by simply selecting the relevant lines and then using the built-in sort-functionality of my editor.
Given that the PDF document in the issue contains the same very large JPEG image *three* times, this patch includes a test-case where only the first page has been extracted from it.
Given that the API will now, after PR 12039, automatically pick the correct factories to use depending on the environment (browser vs. Node.js), we can utilize that in the unit-tests as well. This way we don't have to manually repeat the same initialization code in *multiple* unit-tests.
*Note:* The *official* PDF.js API is defined in `src/pdf.js`, hence the new exports in `src/display/api.js` will not affect that.
Also, updates the unit-test `FileReaderFactory` helpers similarily.
*Drive-by change:* Fix the `CMapReaderFactory` usage in the annotation unit-tests, since the cache should only contain raw data and not a Promise. While this obviously works as-is, having unit-tests that "abuse" the intended data format can easily lead to unnecessary failures if changes are made to the relevant `src/core/` code.
Currently any errors thrown in `preEvaluateFont`, which is a *synchronous* method, will not be handled at all in the `loadFont` method and we were thus failing to return an `ErrorFont`-instance as intended here.
Also, add an *explicit* check in `PartialEvaluator.preEvaluateFont` to ensure that Type0-fonts always have a *valid* dictionary.
This follows the same principle as the `once` option that exists in the native `addEventListener` method, and will thus automatically remove an `EventBus` listener when it's invoked; see https://developer.mozilla.org/en-US/docs/Web/API/EventTarget/addEventListener#Parameters
Finally, this patch also tweaks some the existing `EventBus`-code to use modern features such as optional chaining and logical assignment operators.
Similar to other markers that we currently skip, by ignoring unsupported Coding style default (COD) options we'll at least render *something* here (although some JPEG 2000 images may look slightly wrong).
Note that if the unsupported COD options lead to additional errors, during parsing, we'll still abort parsing of the JPEG 2000 image.
It seems that the timeout is way too short in practice, since this new integration-test failed *intermittently* already in PR 12702 (which is where the test was added).
The ideal solution here would be to simply await an event, dispatched by the viewer, however that unfortunately doesn't appear to be supported by Puppeteer.
Instead, the solution implemented here is to add a new method in `PDFViewerApplication` which Puppeteer can query to check if the scripting/sandbox has been fully initialized.
* the goal is to execute actions like Open or OpenAction
* can be tested with issue6106.pdf (auto-print)
* once #12701 is merged, we can add page actions
Similar to other markers that we currently skip, by ignoring the Coding style component (COC) marker we'll at least prevent outright errors (although some JPEG 2000 images may look slightly wrong).
* move set/clear|Timeout/Interval and crackURL code in pdf.js
* remove the "backdoor" in the proxy (used to dispatch event) and so return the dispatch function in the initializer
* remove listeners if an error occured during sandbox initialization
* add support for alert and prompt in the sandbox
* add a function to eval in the global scope
It appears that the PDF document in [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) now renders "correctly"[1] when compared to e.g. Adobe Reader and PDFium. Most likely this bug was fixed by a *somewhat* recent patch, or patches, to the `XRef.indexObjects` method.
Before just closing [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) as WFM, I figured that it probably can't hurt to add it as a new test-case to avoid accidentally regressing this document in the future.
---
[1] Given that the XRef table is corrupt, and that we're forced to recover, there's generally speaking probably some question as to what actually constitutes "correct" in this case.
There doesn't seem to be anything definitive about this in
the spec, but from experimenting, it seems acrobat lets
PDFs override the widths of the standard fonts.
This simplifies not just this code, but the unit-tests as well, and should be sufficient as far as I can tell.
Note also that currently, in the *built* `pdf.sandbox.js` file, there's even a line reading `testMode = testMode && false;` because of an accidentally flipped pre-processor statement.
Finally, in the `scripting_spec.js` unit-test, defines `sandboxBundleSrc` at the top of the file to make it easier to find and/or change it when necessary.
This seems like a very minor issue, since in general we can't really help if domains are blocked from certain networks, however in this particular case I suppose that using the Internet Archive should work.