Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the *built* `pdf.js` and `pdf.worker.js` files.
Currently these classes take a bunch of parameters (somewhat randomly ordered), probably because this is very old code that's been extended over the years.
Hence this patch changes the constructors to use parameter-objects instead, which improves consistency and (slightly) reduces the amount of code as well.
*Please note:* Also removes the `msgHandler`-property on these classes, since I cannot find a single call-site that accesses it.
Currently this helper function only has two call-sites, and both of them only pass in `ArrayBuffer` data. Given how it's implemented there's a couple of code-paths that are completely unused (e.g. the "string" one), and in particular the intended fast-paths don't actually work.
This patch re-factors and simplifies the helper function, and it'll no longer accept anything except `ArrayBuffer` data (hence why it's also re-named).
Note that at the time when `arraysToBytes` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
When printing the pdf in #12233 in Acrobat, we can see that the combo for country
is empty: it's because the V entry doesn't have to be one of the options.
We're using this helper function when reading data from the [`PDFWorkerStreamReader.read`](a49d1d1615/src/core/worker_stream.js (L90-L98)) and [`PDFWorkerStreamRangeReader.read`](a49d1d1615/src/core/worker_stream.js (L122-L128)) methods, and as can be seen they always return `ArrayBuffer` data. Hence we can simply get the `byteLength` directly, and don't need to use the helper function.
Note that at the time when `arrayByteLength` was added we still supported browsers without TypedArray functionality, and we'd then simulate them using regular Arrays.
*Please note:* I cannot reproduce the problem reported in bug 1811668, regarding the context menu, and in any case it's not clear that that part is even a PDF Viewer bug.
Looking at bug 1811668 I couldn't help but noticing that the textLayer isn't correct, and it's unfortunately once again a problem with the `adjustType1ToUnicode` function. That's intended to help improve text-selection for fonts without a /ToUnicode-entry, and in many cases it does help (the original PR fixed lots of issues) however it's also caused some problems.
In order to improve text-selection in bug 1811668, we'll now properly ignore fonts that have a predefined *named* encoding specified since that's really the intention with PR 14050.
The JBIG2 images in this PDF document are corrupt enough that even Adobe Reader warns about it when opening the file.
*Please note:* I don't really know the JBIG2 image format at all, however from a very brief look at the specification it seems that integers should be 32-bit.
The relevant TrueType font is missing both /ToUnicode *and* /Encoding entires, either of which would have prevented the (current) broken textLayer rendering.
My first idea was that we could use the `post` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6post.html, to get the actual glyphNames and amend the fallback ToUnicode-map that way. Unfortunately that didn't work, since the `post` table only contained ".notdef" and "" (i.e. empty string) entries.
Instead we try to use the `name` table in the TrueType font, see https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6name.html, to determine if the platform is Windows and thus fallback to generate a ToUnicode-map from the `WinAnsiEncoding`.
Note how all over the `src/core/annotation.js`-code we're assuming that if an `appearance`-entry exists it's also a Stream. However, we're not actually checking that thoroughly enough which causes issues in some badly generated PDF documents.
*Please note:* The reduced test-case is *not* a perfect reproduction of the original PDF document, since this one fails to open in e.g. Adobe Reader, but I do believe that it captures the most important points here.
For corrupt *and* encrypted PDF documents, it's possible that only some trailer dictionaries actually contain an /Encrypt-entry. Previously we'd could easily miss that, since we generally pick the first not obviously corrupt trailer dictionary, and the solution implemented here is to simply pre-parse all trailer dictionaries to see if there's any /Encrypt-entries.
This fixes a warning reported by CodeQL, and should also make general sense given that we parse the font-data to determine the *actual* `type`/`subtype` rather than trusting the PDF document.
This was deprecated in PR 15758 but it's unfortunately quite difficult to tell if third-party users are depending on this, e.g. to implement custom error reporting, and if so to what extent.
However, thanks to the pre-processor we can limit *most* of this code to GENERIC builds which still seem like a worthwhile change.
These changes reduce the bundle size of the Firefox PDF Viewer by 3.8 kB in total.
This was deprecated in PR 15758 and given that it's quite unlikely that any third-party users are relying on this functionality, since it was only ever added to support telemetry reporting in the Firefox PDF Viewer, it should hopefully be fine to remove this fairly quickly.
These changes reduce the bundle size of the Firefox PDF Viewer by 4.5 kB in total.
When trying to find incomplete objects, i.e. those missing the "endobj"-string at the end, there's unfortunately a number of possible operators that we need to check for. Otherwise we could miss e.g. the "trailer" at the end of a corrupt PDF document, which is why the referenced document didn't work.
Currently we do all searching on the "raw" bytes of the PDF document, for efficiency, however this doesn't really work when we need to check for *multiple* potential command-strings. To keep the complexity manageable we'll instead use regular expressions here, but we can at least avoid creating lots of substrings thanks to the `RegExp.lastIndex` property; which is well supported across browsers according to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex#browser_compatibility
Note that this repeated regular expression usage could perhaps be slightly less efficient than the old code, however this method is only invoked for corrupt PDF documents.
Previously we'd abort all parsing if an Error was encountered, despite the fact that multiple `startXRefQueue`-entries may be available and that continued parsing could thus eventually be able to find usable data.
Note that in the referenced PDF document the `startxref`-operator, at the end of the file, points to a position in the middle of an arbitrary `stream` which is why things break.
*Please note:* I don't really expect that this is will be an observable change, since virtually all PDF documents already order e.g. /MediaBox and /CropBox entries correctly.
By normalizing boundingBoxes already on the worker-thread, we can be sure that even a corrupt document won't cause issues.
Note how we're passing the `view`-getter to the `PartialEvaluator.getTextContent` method, in order to detect textContent which is outside of the page, hence it makes sense to ensure that it's formatted as expected.
Furthermore, by normalizing this once on the worker-tread we should no longer have to worry about a possibly negative width/height in the `PageViewport` constructor.
Finally, the patch also simplifies the `view`-getter a little bit.
The use of `Array.prototype.reduce()` is, in my opinion, hurting overall readability since it's not particularly easy to look at the relevant code and immediately understand what's going on here. Furthermore this code leads to strictly speaking unnecessary allocations and parsing, since we could just track the min/max values directly in the relevant loop instead.
Currently *some* functions in this file have names while others don't, and in a few cases the names are no longer entirely accurate.
For the relevant functions there should really be no need to name them, and if memory serves this was originally done since browsers (many years ago) didn't always handle anonymous functions correctly in stack traces.
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
Given that this helper function is only used on the worker-thread, there's no reason to duplicate it in both of the `pdf.js` and `pdf.worker.js` files.
Given that these functions are virtually identical, with the latter only adding a BOM, we can combine the two. Furthermore, since both functions were only used on the worker-thread, there's no reason to duplicate this functionality in both of the `pdf.js` and `pdf.worker.js` files.
Having just played around with adding FreeText-annotations and then trying to print, there were `FreeTextAnnotation: OffscreenCanvas is not supported, annotation may not render correctly.` messages printed in the console.
The reason for this is that `FreeTextAnnotation` inherits from `MarkupAnnotation`, however only `WidgetAnnotation` actually defines the `_isOffscreenCanvasSupported` property.
Given that this `assert` is only intended to catch any implementation bugs in our code, and not actually to validate the PDF data directly[1], we can avoid making this function call unconditionally.
---
[1] In those cases, for example a `FormatError` should have been thrown instead.
- For text fields
* when printing, we generate a fake font which contains some widths computed thanks to
an OffscreenCanvas and its method measureText.
In order to avoid to have to layout the glyphs ourselves, we just render all of them
in one call in the showText method in using the system sans-serif/monospace fonts.
* when saving, we continue to create the appearance streams if the fonts contain the char
but when a char is missing, we just set, in the AcroForm dict, the flag /NeedAppearances
to true and remove the appearance stream. This way, we let the different readers handle
the rendering of the strings.
- For FreeText annotations
* when printing, we use the same trick as for text fields.
* there is no need to save an appearance since Acrobat is able to infer one from the
Content entry.