[api-minor] Remove the `normalizeWhitespace` option in the `PDFPageProxy.{getTextContent, streamTextContent}` methods (issue 14519, PR 14428 follow-up)
With these changes, we'll now *always* replace all whitespaces with standard spaces (0x20). This behaviour is already, since many years, the default in both the viewer and the browser-tests.
Either the latest Chromium update, the latest Puppeteer update, or a combination of them both are now causing the Windows bot to timeout during the browser-tests; please see PR 14392.
- it aims to fix#14502 and bug 1721335;
- Acrobat and Pdfium do the same;
- it'll avoid to have truncated data when printed;
- change the factor to compute font size in using field height: lineHeight = 1.35*fontSize
- this is the value used by Acrobat.
- in order to not have truncated strings on the bottom, add few basic metrics for standard fonts.
This `disableCreateObjectURL` option was originally introduced all the way back in PR 4103 (eight years ago), in order to work-around `URL.createObjectURL()`-bugs specific to Internet Explorer.
In PR 8081 (five years ago) the `disableCreateObjectURL` option was extended to cover Google Chrome on iOS-devices as well, since that configuration apparently also suffered from `URL.createObjectURL()`-bugs.[1]
At this point in time, I thus think that it makes sense to re-evaluate if we should still keep the `disableCreateObjectURL` option.
- For Internet Explorer, support was explicitly removed in PDF.js version `2.7.570` which was released one year ago and all IE-specific compatibility code (and polyfills) have since been removed.
- For Google Chrome on iOS-devices, while we still "support" such configurations, it's *not* the focus of any development and platform-specific bugs are thus often closed as WONTFIX.
Note here that at this point in time, the `disableCreateObjectURL` option is *only* being used in the viewer and any `URL.createObjectURL()`-bugs browser/platform bugs will thus not affect the main PDF.js library. Furthermore, given where the `disableCreateObjectURL` option is being used in the viewer the basic functionality should also remain unaffected by these changes.[2]
Furthermore, it's also possible that the `URL.createObjectURL()`-bugs have been fixed in *browser* itself since PR 8081 was submitted.[3]
Obviously you could argue that this isn't a lot of code, w.r.t. number of lines, and you'd be technically correct. However, it does add additional complexity in a few different viewer components which thus add overhead when reading and working with this code.
Finally, assuming the `URL.createObjectURL()`-bugs are still present in Google Chrome on iOS-devices, I think that we should ask ourselves if it's reasonable for the PDF.js project (and its contributors) to keep attempting to support a configuration if the *browser* developers still haven't fixed these kind of bugs!?
---
[1] According to https://groups.google.com/a/chromium.org/forum/#!topic/chromium-html5/RKQ0ZJIj7c4, which is linked in PR 8081, that bug was mentioned/reported as early as the 2014 (eight years ago).
[2] Viewer functionality such as e.g. downloading and printing may be affected.
[3] I don't have access to any iOS-devices to test with.
Either the latest Chromium update, the latest Puppeteer update, or a combination of them both are now causing the Windows bot to timeout during the browser-tests.
To unblock both the updates and other improvements (i.e. the `structuredClone` polyfill), let's simply disable the problematic configuration for now since this a Mozilla project after all.
This allows us to remove the manually implemented `structuredClone` polyfill, thus reducing the maintenance burden for the `LoopbackPort` class; refer to https://github.com/zloirock/core-js#structuredclone
*Please note:* While `structuredClone` support landed already in Firefox 94, Google Chrome only added it in version 98 (currently in Beta). However, given that the `LoopbackPort` will only be used together with *fake workers* in browsers this shouldn't be too much of a problem.[1]
For Node.js environments, where *fake workers* are unfortunately necessary, using a `legacy/`-build is already required which thus guarantees that the `structuredClone` polyfill is available.
Also, the patch updates core-js to the latest version since that one includes `structuredClone` improvements; please see https://github.com/zloirock/core-js/releases/tag/v3.20.3
---
[1] Given that we only support browsers with proper worker support, if *fake workers* are being used that essentially indicates a configuration problem/error.
- it aims to fix#14497;
- previously, only rotations with an angle 0, 90, 180 or 270 were taken into account;
- so generalize to any angle but keep the fast path for 0, 90, ... because they're likely more common than anything else.
So that Firefox doesn't switch to pixel mode for compat with other
browsers.
This should fix https://github.com/mozilla/pdf.js/issues/14476, in terms
of restoring the previous behavior.
We probably want to change the pixel-based scrolling code to not scroll
so much (the deltaMode stuff normalizes to +/-1 tick for each wheel
event, perhaps the pixel-based value should do the same).
This commit fixes Bug 1743245 (Grided PDF file lines rendered too thick) which was created by a fix for #12868 .
The lineWidth was set to round(1 * this._combinedScaleFactor) when the pixel is drawn as a parallelorgam with a height <1. This fix changes this to floor(1*this._combinedScaleFactor) .
This change shows a visual result comparable to Chrome and Acrobat.
Regarding the last PR 3 statements in canvas.js are affected and will change with this commit (stroke and paintChar).
renaming the reference files to naming comvention
Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing.
Obviously the `Glyph`-instances are being cached *per* font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1].
Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` *unique* unicode-chars being checked.
*Please note:* In practice I suppose that this won't have a *huge* effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review.
---
[1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.
- it aims to fix issue #14307;
- this event has been added recently in Firefox and we can now use it;
- fix few bugs in aform.js or in annotation_layer.js;
- add some integration tests to test keystroke events (see `AFSpecial_Keystroke`);
- make dispatchEvent in the quickjs sandbox async.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1749563;
- use some helper functions to get (u|i)int** values in buffer: it helps to have a clearer code;
- in composite glyphes the translations values with a transformations are signed so consequently get some int8 instead of uint8;
- add few TODOs.
Please refer to https://www.pdfa.org/norm-refs/Type1Fonts.pdf#page=15 for the expected format for the /CharStrings entries.
In the referenced PDF document the /CharStrings are missing the expected end-token, which causes us to swallow the start of the next glyph name.