- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716838.
- some fonts in the pdf in the bug where bold when they shouldn't so write the font properties in the html to avoid to use some wrong inherited ones.
- partial fix for https://bugzilla.mozilla.org/show_bug.cgi?id=1716980;
- some pdf can contain an invalid font family (e.g. 'Windings 3') so in this case remove the space;
- the font family in typeface attribute doesn't always match the one defined in the FontDescriptor dictionary.
- PR #13554 is buggy, so this patch aims to fix bugs.
- check if a component fits into its parent in taking into account the parent layout.
- introduce method isSplittable for template nodes to know if a component can be splitted in case of overflow.
- some containers doesn't always have their 2 dimensions and those dimensions re based on contents;
- so in order to measure text, we must get the glyph widths (for the xfa fonts) before starting the layout;
- implement a word-wrap algorithm;
- handle font change during text layout.
*This implementation is basically a copy of the pre-existing `builtInCMapCache` implementation.*
For some, badly generated, PDF documents it's possible that we'll end up having to fetch the *same* standard font data over and over (which is obviously inefficient).
While not common, it's certainly possible that a PDF document uses *custom* font names where the actual font then references one of the standard fonts; see e.g. issue 11399 for one such example.
Note that I did suggest adding worker-thread caching of standard font data in PR 12726, however it wasn't deemed necessary at the time. Now that we have a real-world example that benefit from caching, I think that we should simply implement this now.
- Some js files contain scale factors for each glyph in order to rescale Liberation to have a final font with the correct width.
- A lot of XFA have some containers where their dimensions are based on their text content, so using default font from browser can lead to an almost unreadable pdf.
- some elements weren't displayed because their rotation angle was not taken into account;
- fix box model (XFA concept):
- remove use of outline;
- position correctly border which isn't part of box dimensions;
- fix margins issues (see issue #13474).
- move border on button instead of having it on wrapping div;
- attribute 'use' was already implemented but not usehref
- in general, usehref should make reference to current document
- add support for SOM expressions in use and usehref to search a node.
- get prototype for all nodes if any.
Given that the same `PartialEvaluator`-instance is used for a lot of these unit-tests, manually changing the options in any one test-case could lead to intermittently failing unit-tests since they're run in a random order.
To fix this, we simply have to use the existing method to clone the `PartialEvaluator`-instance but with the custom options.
- the only goal of this patch is to be able to get synchronously the fake html when printing from firefox:
- in order to print we need to inject some html in beforeprint callback but we cannot block in waiting for all the pages.
- from a memory point of view: it doesn't change anything since the fake HTML is deleted in the worker;
- this way we don't break any assumptions.
- I thought it was possible to rely on browser layout engine to handle layout stuff but it isn't possible
- mainly because when a contentArea overflows, we must continue to layout in the next contentArea
- when no more contentArea is available then we must go to the next page...
- we must handle breakBefore and breakAfter which allows to "break" the layout to go to the next container
- Sometimes some containers don't provide their dimensions so we must compute them in order to know where to put
them in their parents but to compute those dimensions we need to layout the container itself...
- See top of file layout.js for more explanations about layout.
- fix few bugs in other places I met during my work on layout.
According to the specification, see https://web.archive.org/web/20210404042322if_/https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.2384179, the keys of a NameTree/NumberTree should be ordered.
For corrupt PDF files, which violate this assumption, it's thus possible that trying to lookup a single entry fails.
Previously, in PR 10274, we implemented a fallback that only applies to the "bottom" node of a NameTree/NumberTree, which in general might not actually help for sufficiently corrupt NameTree/NumberTree data.
Instead we remove the current *limited* fallback from `NameOrNumberTree.get`, and defer to the call-site to handle this case explicitly e.g. by using `NameOrNumberTree.getAll` for data where that makes sense. For well-formed documents, these changes should *not* lead to any additional data fetching/parsing.
Finally, as part of these changes, the validation of named destination data is improved in the `Catalog` and a new unit-test is also added.
To get the maximum benefit from something like Prettier, you obviously don't want to disable the automatic formatting unless absolutely necessary. When we added Prettier there were a number of cases, mostly involving larger Arrays, which required disabling of the automatic formatting for overall readability and/or to not break inline comments.
With changes in Prettier version `2.3.0`, see [the release notes](https://prettier.io/blog/2021/05/09/2.3.0.html#concise-formatting-of-number-only-arrays-10106httpsgithubcomprettierprettierpull10106-10160httpsgithubcomprettierprettierpull10160-by-thorn0httpsgithubcomthorn0), there's now better formatting support for Arrays containing only numbers. Hence we can now remove a number of `// prettier-ignore` comments, and thus get the benefit of automatic formatting in (slightly) more of the code-base.
This patch was tested using the PDF file from issue 2618, i.e. https://bug570667.bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
{ "id": "issue2618",
"file": "../web/pdfs/issue2618.pdf",
"md5": "",
"rounds": 50,
"type": "eq"
}
]
```
which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ---- | -------------
firefox | Overall | 50 | 3417 | 3426 | 9 | 0.27 |
firefox | Page Request | 50 | 1 | 1 | 0 | 5.41 |
firefox | Rendering | 50 | 3416 | 3426 | 9 | 0.27 |
```
Based on these results, there's no significant performance regression from using standard classes and this patch should thus be OK.
- `FontFlags`, is used in both `src/core/fonts.js` and `src/core/evaluator.js`.
- `getFontType`, same as the above.
- `MacStandardGlyphOrdering`, is a fairly large data-structure and `src/core/fonts.js` is already a *very* large file.
- `recoverGlyphName`, a dependency of `type1FontGlyphMapping`; please see below.
- `SEAC_ANALYSIS_ENABLED`, is used by both `Type1Font`, `CFFFont`, and unit-tests; please see below.
- `type1FontGlyphMapping`, is used by both `Type1Font` and `CFFFont` which a later patch will move to their own files.
- Improve chunking in order to fix some bugs where the spaces aren't here:
* track the last position where a glyph has been drawn;
* when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
- there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
- and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
- Add some breaks in order to get lines;
- Remove the multiple whites spaces:
* some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
* other pdf readers replace spaces by one white space.
Update src/core/evaluator.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
First of all, while it should be very unlikely that the /ID-entry is an *indirect* object, note how we're using `Dict.get` when parsing it e.g. in `PDFDocument.fingerprint`. Hence we definitely should be consistent here, since if the /ID-entry is an *indirect* object the existing code in `src/core/writer.js` would already fail.
Secondly, to fix the referenced issue, we also need to check that the /ID-entry actually is an Array before attempting to access its contents in `src/core/writer.js`.
*Drive-by change:* In the `xrefInfo` object passed to the `incrementalUpdate` function, re-name the `encrypt` property to `encryptRef` since its data is fetched using `Dict.getRaw` (given the names of the other properties fetched similarly).
- Use `PDFManager.ensureDoc`, rather than `PDFManager.ensure`, in a couple of spots in the code. If there exists a short-hand format, we should obviously use it whenever possible.
- Fix a unit-test helper, to account for the previous changes. (Also, converts a function to be `async` instead.)
- Add one more exists-check in `PDFDocument.loadXfaFonts`, which I missed to suggest in PR 13146, to prevent any possible errors if the method is ever called in a situation where it shouldn't be.
Also, print a warning if the actual font-loading fails since that could help future debugging. (Finally, reduce overall indentation in the loop.)
- Slightly unrelated, but make a small tweak of a comment in `src/core/fonts.js` to reduce possible confusion.
- Different fonts can be used in xfa and some of them are embedded in the pdf.
- Load all the fonts in window.document.
Update src/core/document.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Update src/core/worker.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
There is no asynchronous code involved here, so we can get rid of all
done callbacks here and simply use the fact that if the function call
ends without failed assertion that the test passed.
There is no asynchronous code involved here, so we can get rid of all
done callbacks here and simply use the fact that if the function call
ends without failed assertion that the test passed.
With the current implementation of `PDFDocument.hasJSActions`, in the worker-thread, we're not actually handling not-yet-loaded data correctly. This can thus fail in *two* different ways:
- The `PDFDocument.fieldObjects` getter (and its helper method), while it may *return* a Promise, still fetches all of its data synchronously and it can thus throw a `MissingDataException` during parsing.
- The `Catalog.jsActions` getter, which is completely synchronous, can obviously throw a `MissingDataException` during parsing.
If either of these cases occur currently, the `PDFDocumentProxy.hasJSActions` method in the API can either return a *rejected* Promise (which it never should) or possibly "hang" and never resolve.
*Please note:* While I've not *yet* seen this error in an actual PDF document, it can happen during loading if you're unlucky enough with e.g. the structure of the PDF document and/or the download speed offered by the server.
This patch is thus based on code-inspection *and* on manually throwing a `MissingDataException` on the first access of `Catalog.jsActions` to simulate this situation.
Finally, this patch adds a couple of *API* unit-tests for this (since none existed).
This is first of all consistent with existing API-methods, where we return `null` when the data in question doesn't exist. Secondly, it should also be (slightly) more efficient since there's less dummy-data that we need to transfer between threads.
Finally, this prevents us from adding an empty/unnecessary span to *every* single page even in documents without any structure tree data.
The done callbacks are an outdated mechanism to signal Jasmine that a
unit test is done, mostly in cases where a unit test needed to wait for
an asynchronous operation to complete before doing its assertions.
Nowadays a much better mechanism is in place for that, namely simply
passing an asynchronous function to Jasmine, so we don't need callbacks
anymore (which require more code and may be more difficult to reason
about).
In these particular cases though the done callbacks never had any real
use since nothing asynchronous happens in these places. Synchronous
functions don't need to use done callbacks since Jasmine simply knows
it's done when the function reaches its normal end, so we can safely get
rid of these callbacks. The telltale sign is if the done callback is
used unconditionally at the end of the function.
This is all done in an effort to over time get rid of all callbacks in
the unit test code.
- but don't validate them for now;
- Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315)
- almost all (all ?) pdf readers display signatures;
- validation is done in edge but for now it's behind a pref.
When a PDF is "marked" we now generate a separate DOM that represents
the structure tree from the PDF. This DOM is inserted into the <canvas>
element and allows screen readers to walk the tree and have more
information about headings, images, links, etc. To link the structure
tree DOM (which is empty) to the text layer aria-owns is used. This
required modifying the text layer creation so that marked items are
now tracked.
Currently the `fontName`-property contains an actual /Name-instance, which is a problem given that its fallback value is an empty string; see ca7f546828/src/core/default_appearance.js (L35)
The reason that this is a problem can be seen in ca7f546828/src/core/primitives.js (L30-L34), since an empty string short-circuits the cache. Essentially, in PDF documents, a /Name-instance cannot be empty and the way that the `DefaultAppearanceEvaluator` does things is unfortunately not entirely correct.
Hence the `fontName`-property is changed to instead contain a string, rather than a /Name-instance, which simplifies the code overall.
*Please note:* I'm tagging this patch with "[api-minor]", since PR 12831 is included in the current pre-release (although we're not using the `fontName`-property in the display-layer).
Currently only URL-strings are officially supported by `getDocument`, however at this point in time I cannot really see any compelling reason to not support `URL`-objects as well.
Most likely the reason that we've don't already support `URL`-objects, in `getDocument`, is that historically `URL` wasn't fully implemented across browsers and our old polyfill wasn't perfect; see https://developer.mozilla.org/en-US/docs/Web/API/URL/URL#browser_compatibility
*Please note:* Because of how the `url` parameter is currently handled, there's actually *some* cases where passing a `URL`-object to `getDocument` already works. That, in my opinion, provides additional motivation for supporting `URL`-objects officially, since it makes the API more consistent.
The following is an attempt to summarize the *current* situation, based on the actual code rather than the JSDocs:
- `getDocument("url string")` works and is documented.[1]
- `getDocument({ url: "url string", })` works and is documented.[1]
- `getDocument(new URL(...))` throws immediately, since no supported parameters are found.
- `getDocument({ url: new URL(...), })` actually works even though it's not documented.[1] Originally, when data was fetched on the worker-thread, this would likely have thrown since `URL` isn't clonable.[2]
- `getDocument({ url: { abc: 123, }, })`, or some similarily meaningless input, will be "accepted" by `getDocument` and then throw a `MissingPDFException` when attempting to fetch the bogus data.
With the changes in this patch, not only is `URL`-objects now officially supported and documented when calling `getDocument`, but we'll also do a much better job at actually validating any URL-data passed to `getDocument` (and instead fail early).
---
[1] In *browsers*, we create a valid URL thus indirectly validating the input. In Node.js environments, on the other hand, no validation is done since obtaining a baseUrl is more difficult (and PDF.js is primarily written for browsers anyway).
[2] https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types
- implement few positioning properties: position, width, height, anchor;
- implement font element;
- implement fill element (used by font) and its children (linear, radial, ...);
- font property is inherited from ancestor container (see https://www.pdfa.org/wp-content/uploads/2020/07/XFA-3_3.pdf#page=43) so let CSS handles that stuff;
- in order to reduce the number of properties to set, only set non default properties and put the default in CSS;
- set a background to some containers to be able to see them (will be removed in a future commit).
- add an option to enable XFA rendering if any;
- for now, let the canvas layer: it could be useful to implement XFAF forms (embedded pdf in xml stream for the background and xfa form for the foreground);
- ui elements in template DOM are pretty close to their html counterpart so we generate a fake html DOM from template one:
- it makes easier to translate template properties to html ones;
- it makes faster the creation of the html element in the main thread.
It seems reasonable to place this alongside the *similar* `getFilenameFromUrl` helper function. This way, with the changes in the next patch, we also avoid having to expose the `isDataScheme` function in the API itself and we instead expose `getPdfFilenameFromUrl` in the API (which feels overall more appropriate).
A significant portion of the code-base has now been converted to use `let`/`const`, rather than `var`, hence it should be possible to simply enable the ESLint `no-var` rule globally.
This way we can ensure that new code won't accidentally use `var`, and it also removes the need to manually enable the rule in various folders.
Obviously it makes sense to continue the efforts to replace `var`, but that should probably happen on a file and/or folder basis.
Please note that this patch excludes the following code:
- The `extensions/` folder, since that seemed easiest for now (and I don't know exactly what the support situation is for the Chromium-extension).
- The entire `external/` folder is ignored, since most of it's currently excluded from linting.
For the code that isn't imported from elsewhere (and should be ignored), we should probably (at some point) bring the code up to the same linting/formatting standard as the rest of the code-base.
- Various files in the `test/` folder are ignored, as necessary, since the way that a lot of this code is loaded will require some care (or perhaps larger re-factoring) when removing `var` usage.
- strokeColor corresponds to borderColor;
- support fillColor and textColor;
- support colors on the different annotations;
- fix typo in aforms (+test).
Rather than converting the `AnnotationStorage`-data to an Object, before sending it to the worker-thread, we should be able to simply send the internal `Map` directly.
The "structured clone algorithm" doesn't have a problem with `Map`s, however the `LoopbackPort` used when workers are *disabled* (e.g. in Node.js environments) didn't use to support them. With PR 12997 having lifted that restriction, we should now be able to simply send the `AnnotationStorage`-data as-is rather than having to iterate through it to first create an Object.
*Please note:* The changes in `src/core/annotation.js` could have been a lot more compact if we were able to use optional chaining in the `src/core` folder. Unfortunately that's still not possible, since SystemJS is being used in the development viewer (i.g. `gulp server`) and fixing that is *still* blocked by [bug 1247687](https://bugzilla.mozilla.org/show_bug.cgi?id=1247687).
With the previous patch this function is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
With the previous patch this functionality is now *only* accessed on the worker-thread, hence it's no longer necessary to include it in the *built* `pdf.js` file.
The only reason, as far as I can tell, for parsing the Metadata on the main-thread is how it was originally implemented. When Metadata support was first implemented, it utilized the [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) which isn't available in workers.
Today, with the custom XML-parser being used, that's no longer an issue and it seems reasonable to move the Metadata parsing to the worker-thread[1], since that's where all parsing should happen (for performance reasons).
Based on these changes, we'll be able to reduce the now unnecessary duplication of the XML-parser (and related code) in both of the *built* `pdf.js`/`pdf.worker.js` files.
Finally, this patch changes the `_repair` method to use "Array + join" rather than string concatenation.
---
[1] This needed the previous patch, to enable sending of `Map`s between threads with workers disabled.
* don't set a value in annotationStorage by default:
- having an undefined when the annotation is rendered for saving/printing means nothing has changed so use normal appearance
- aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1681687
* change the way to compute font size when this one is null in DA:
- make fontSize proportional to line height
- in multiline case, take into account the number of lines for text entered to adapt the font size
- the parser is base on a class extending XMLParserBase
- it handle xml namespaces:
* each namespace is assocated with a builder
* builder builds nodes belonging to the namespace
* when a node is inserted in the parent namespace compatibility is checked (if required)
- to avoid name collision between xml names and object properties, use Symbol.
Note how, in the `if (this.stateManager.stateStack.length !== 0) {` branch, we're attempting to access the not yet defined variable[1] `args`. If this code-path is ever hit, an Error will be thrown and parsing will thus be aborted immediately (likely leading to e.g. rendering bugs).
Note that I found this purely by accident, since I happened to glance at the LGTM report. However, I've since found that the error is also present during the unit-test[2] and with this patch we're actually testing the *intended* thing here.
As part of fixing this, and to avoid re-introducing a similar bug in the future, we'll now instead always reset `args.length` *before* attempting to read the next operator.
Also, we can use the existing `EvaluatorPreprocessor.savedStatesDepth` getter to simplify the save/restore detection a tiny bit.
---
[1] The ESLint rule `no-use-before-define` would have helped catch this problem, but unfortunately we cannot enable that without quite a bit of refactoring all over the code-base.
[2] The unit-test was updated such that it would fail in the `master`-branch.