In this particular case the `CMap`-data that we create contains only numbers, but no strings, which causes `PartialEvaluator.readToUnicode` to create a ToUnicode-map with only empty strings.
*Please note:* This is yet another case where I don't know if it's necessarily the best and most correct solution, but it does fix the referenced issue.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1722036.
- AS and V should share the same value for checkbox: it's at least what the specs say;
- the pdf in the above bug opens correctly in Acrobat so it likely means that AS is chosen over V.
*Please note:* All of this feels very handwavy, but at least it passes all tests locally. Hopefully we have enough tests for this part of the font code.
For non-embedded composite standard fonts with an "incomplete" /CIDToGIDMap, we'll now fallback to an *explicitly defined* /ToUnicode map even when that one happens to be an /Identity-H or /Identity-V map.
The `Font.fallbackToSystemFont` method is unfortunately getting more and more special-cases, however that might be unavoidable given all the weird non-embedded fonts found in the wild :-(
While some of the output looks worse to my eye, this behavior more
closely matches what I see when I open the PDFs in Adobe acrobat.
Fixes: #4706, #9713, #8245, #1344
Currently it's possible to accidentally, e.g. by simply copy-and-pasting from an existing test-case, add an unnecessary `"link": true`-entry for locally available PDF files.
This leads to inconsistencies in the manifest file, and doesn't feel like a great developer experience. However we can easily fix it by having `verifyManifestFiles` fail in this situation, and doing so actually turned up a couple of existing cases.
In the referenced PDF document the /Contents stream contains MarkedContent-operators, however no optional content dictionary exists; according to [the specification](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G7.3883825):
> Null values or references to deleted objects shall be ignored. If this entry is
not present, is an empty array, or contains references only to null or deleted
objects, the membership dictionary shall have no effect on the visibility of
any content.
In the PDF document some of the glyphs have bogus `differences`-entries[1] that cannot be resolved to valid glyph names, thus causing the glyph mapping to fail.
My initial idea was to use a similar approach as in the `PartialEvaluator._simpleFontToUnicode`-method, to extract the charCodes from those entries, however it turned out that that didn't actually help in this case (the mapping was still wrong).
To fix this I'm thus proposing that we fallback to the /ToUnicode map when no other useable data exists (e.g. no post-table), since it *hopefully* shouldn't make things any worse than leaving parts of the glyph map empty (which currently happens).
---
[1] As can be seem below, some of the entries are completely normal while others are non-standard:
```
Differences (array)
0 = 65
1 = /g5167
2 = /space
3 = /g11927
4 = /g17737
5 = /g11540
6 = /g2180
7 = /K
8 = /P
9 = /two
10 = /zero
11 = /one
12 = /five
13 = /four
14 = /g6932
15 = /g7246
16 = /g1691
17 = /g2343
18 = /g14792
19 = /g3325
20 = /g4280
21 = /g20383
22 = /g18166
23 = /g16988
24 = /g17943
25 = /g19223
26 = /g10830
27 = 97
28 = /g982
29 = /g1226
30 = /g5059
31 = /g2677
32 = /g1042
33 = /g11568
34 = /L
35 = /three
36 = /seven
37 = /g2364
38 = /g12063
39 = /g5356
40 = /g2173
41 = /g17877
42 = /g7273
43 = /g7647
44 = /g7224
45 = /g19327
46 = /g5054
47 = /g2342
48 = /g10136
49 = /g6856
50 = /g13381
51 = /g7257
52 = /g12093
53 = /g2359
```
Despite its name, the fonts in ItcSymbol-family are "regular" fonts and not Symbol ones. However, given that the font name contains the word "Symbol" we ended up picking the wrong code-path in the `Font.fallbackToSystemFont`-method.
*Please note:* While this patch ensures that the text becomes readable, by falling back a standard font, the rendering will obviously not be perfect. However, that's the PDF generators "fault" since non-embedded fonts cannot be guaranteed to render correctly in all environments.
Currently, in the `PartialEvaluator`, we only support Optional Content in Form-/XObjects. Hence this patch adds support for Image-/XObjects as well, which looks like a simple oversight in PR 12095 since the canvas-implementation already contains the necessary code to support this.
- All the scale factors in for the substitution font were wrong because of different glyph positions between Liberation and the other ones:
- regenerate all the factors
- Text may have polish chars for example and in this case the glyph widths were wrong:
- treat substitution font as a composite one
- add a map glyphIndex to unicode for Liberation in order to generate width array for cid font
- In order to better compute text fields size, use line height with no gaps (and consequently guessed height for text are slightly better in general).
- Fix default background color in fields.
- an Image element was created, attached to its parent but the $globalData property was not set and that led to an error.
- the pdf in bug 1722029 has 27 rendered rows (checked in Acrobat) when only one was displayed: this patch some binding issues around the occur element.
This patch makes use of the existing `ignoreErrors` option, thus allowing a page to continue parsing/rendering even if (some of) its sub-streams are corrupt. Obviously this may cause *part* of a page to be broken/missing, however it should be better than (potentially) rendering nothing.
Also, to the best of my knowledge, this is the first bug of its kind that we've encountered.
To avoid having to pass in a bunch of, for a `BaseStream`-instance, mostly unrelated parameters when initializing a `StreamsSequenceStream`-instance, I settled on utilizing a callback function instead to allow conditional Error-suppression.
Note that the `StreamsSequenceStream`-class is a *special* stream-implementation that we only use when the `/Contents`-entry, in the `/Page`-dictionary, consists of an Array with streams.
In cases where even the very *first* attempt at reading from an object will throw, simply ignoring such objects will help improve rendering of *some* corrupt documents.
Note that this will lead to more parsing in some cases, but considering that this only applies to *corrupt* documents that shouldn't be a big deal.
Bug 1721218 has a shading pattern that was used thousands of times.
To improve performance of this PDF:
- add a cache for patterns in the evaluator and only send the IR form once
to the main thread (this also makes caching in canvas easier)
- cache the created canvas radial/axial patterns
- for shading fill radial/axial use the pattern directly instead of creating temporary
canvas
- in real life some xfa contains xml like <xfa:data><xfa:Foo><xfa:Bar>...</xfa:data>
since there are no Foo or Bar in the xfa namespace the JS representation are empty
and that leads to errors.
- so the idea is to make all nodes under xfa:data namespace agnostic which means
that ns are removed from nodes in the parser but only xfa:data descendants.
*Please note:* The PDF document in issue 13751 is *dynamically* created (in e.g. Adobe Reader), with pages added when certain buttons are clicked, hence this patch simply fixes the breaking error and nothing more.
It looks like the current code contains a little bit too much copy-and-paste from the *similar* `index` branch above, since we cannot set the `startIndex` to a negative value. Note how it's being used to initialize the loop-variable, which is then used to lookup values in an Array and accessing the `-1`th element of an Array obviously makes no sense.
*This is yet another case where I've got no idea if the patch is correct, but it does at least fix a breaking error :-)*
Note how in the [`Binder._bindValue` method](683ce66a48/src/core/xfa/bind.js (L92-L93)), we're assuming that if a `data`-value exists then it'll also be possible to actually access it. For the `XFAAttribute`-implementation however, the second method is missing and that's what causes the breaking errors in issue 13748.
Please note that another possible way of "fixing" the error wouldn't been to simply change the exists-check to return `false`, and I could see that being a preferred solution.
However, the reason for submitting the current patch is that we get *fewer* warnings about Nodes with mis-matched types this way.
As can be seen in the code (see below), the `searchNode` helper function will return `null` in some cases and all of its call-sites should protect against that before attempting to access the returned data.
While only one of these changes were necessary to fix the breaking errors in issue 13756, in order to prevent future bugs I've added similar defensive code throughout this file.
- 07955fa1d3/src/core/xfa/som.js (L169)
- 07955fa1d3/src/core/xfa/som.js (L239)
- 07955fa1d3/src/core/xfa/som.js (L254)
- font line height is taken into account by acrobat when it isn't with masterpdfeditor: I extracted a font from a pdf, modified some ascent/descent properties thanks to ttx and the reinjected the font in the pdf: only Acrobat is taken it into account. So in this patch, line heights for some substituted fonts are added.
- it seems that Acrobat is using a line height of 1.2 when the line height in the font is not enough (it's the only way I found to fix correctly bug 1718741).
- don't use flex in wrapper container (which was causing an horizontal overflow in the above bug).
- consequently, the above fixes introduced a lot of small regressions, so in order to see real improvements on reftests, I fixed the regressions in this patch:
- replace margin by padding in some case where padding is a part of a container dimensions;
- remove some flex display: some containers are wrongly sized when rendered;
- set letter-spacing to 0.01px: it helps to be sure that text is not broken because of not enough width in Firefox.
Previously, when we filled image masks we didn't copy over the current transformation,
this caused patterns to be misaligned when painted. Now we create a temporary
canvas with the mask and have the transform copied over and offset it relative to
where the mask would be painted. We also weren't properly offsetting tiling patterns.
This isn't usually noticeable since patters repeat, but in the case of #13561 the pattern
is only drawn once and has to be in the correct position to line up with the mask image.
These fixes broke #11473, but highlighted that we were drawing that correctly by
accident and not correctly handling negative bounding boxes on tiling patterns.
Fixes#6297, #13561, #13441
Partially fixes#1344 (still blurry but boxes are in correct position now)
- Fix a typo in order to open the pdf in issue #13679
- After fixing the fill default color there wer some regressions because of z-index
and when fixing z-index there were some regressions because of borders
- So fix the borders rendering.
- it aims to fix#13583;
- fix the switch to breakBefore target;
- force the layout of an unsplittable element on an empty page;
- don't fail when there is horizontal overflow (except in lr-tb);
- handle correctly overflow in the same content area (bug 1717805, bug 1717668);
- fix a typo in radial gradient first argument.
- the break element has been deprecated in XFA 2.4 but some old documents can use it, so replace it with one (or more) of its possible substitutions:
- breakBefore;
- breakAfter;
- overflow.
- when binding (after parsing) we get a map between some template nodes and some data nodes;
- so set user data in input handlers in using data node uids in the annotation storage;
- to save the form, just put the value we have in the storage in the correct data nodes, serialize the xml as a string and then write the string at the end of the pdf using src/core/writer.js;
- fix few bugs around data bindings:
- the "Off" issue in Bug 1716980.
According to a comment in `readCmapTable`, we're assuming that the cmap tables (when more than one exist) are sorted in ascending order. If that's not the case, keep checking the following cmap tables in order to fix the referenced issue.
- when the CSS line-height property is set to 'normal' then the value depends of the user agent. So use a line height based on the font itself and if for any reasons this value is not available use 1.2 as default.
- it's a partial fix for https://bugzilla.mozilla.org/show_bug.cgi?id=1717681.
Apparently some really bad PDF software can create documents with *empty* `Name`-entries, which we thus need to somehow deal with.
While I don't know if this patch is necessarily the best solution, it should at least ensure that the *empty* `Name`-instance cannot accidentally match a proper `Name`-instance (and it doesn't require changes to a lot of existing code).[1]
---
[1] I briefly considered using a `Symbol` rather than an Object, but quickly decided against that since the former one [is not clonable](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types) and `Name`-instances may be sent to the API.
- it aims to fix https://bugzilla.mozilla.org/show_bug.cgi?id=1716838.
- some fonts in the pdf in the bug where bold when they shouldn't so write the font properties in the html to avoid to use some wrong inherited ones.
- partial fix for https://bugzilla.mozilla.org/show_bug.cgi?id=1716980;
- some pdf can contain an invalid font family (e.g. 'Windings 3') so in this case remove the space;
- the font family in typeface attribute doesn't always match the one defined in the FontDescriptor dictionary.
- PR #13554 is buggy, so this patch aims to fix bugs.
- check if a component fits into its parent in taking into account the parent layout.
- introduce method isSplittable for template nodes to know if a component can be splitted in case of overflow.
- some containers doesn't always have their 2 dimensions and those dimensions re based on contents;
- so in order to measure text, we must get the glyph widths (for the xfa fonts) before starting the layout;
- implement a word-wrap algorithm;
- handle font change during text layout.
Note, this only really fixes Radial/Axial shading patterns with masks.
I'm guessing tiling patterns and mesh patterns would also be broken
if applied like the test pdf. Hopefully I'll have some time to make
test cases for the other shadings.
Fixes#13372
- and fix few bugs:
- avoid infinite loop when layout the document;
- avoid confusion between break and layout failure;
- don't add margin width in tb layout when getting available space.
- it aims to avoid to loop forever when opening pdf in #13213;
- the idea is to consider subformSet as inexistent when running in the tree. So if we've subformA > subformSet > subformB then subformB will be visited as a direct child of subformA.
- Some js files contain scale factors for each glyph in order to rescale Liberation to have a final font with the correct width.
- A lot of XFA have some containers where their dimensions are based on their text content, so using default font from browser can lead to an almost unreadable pdf.
This patch uses the new option added in PR 12726 to *also* allow fetching binary CMap data directly in the worker-thread in browsers.
Given that these changes remove the need to transfer data between threads for the default (browser) use-case, we can also revert the changes in PR 11118 since that simplifies the overall implementation.
For Type3 fonts where the /CharProcs-streams of the individual glyph starts with a `d1` operator, we can use that to build a fallback bounding box for the font and thus improve text-selection in some cases.
For HighlightAnnotations with a built-in appearance stream, we still rely on it to specify the opacity correctly via a suitable blend mode. However, if the Annotation-drawing operators are placed *within* a /XObject of the /Form-type, the /ExtGState won't apply to the final rendering and the result is that the highlighting obscures the underlying text.
The more *correct* and general solution would likely be to somehow modify the implementation in `src/display/canvas.js`, to special-case handling of /Form-type /XObjects when rendering Annotations. Since we can very easily work-around this problem for now by using the "no appearance stream" code-path, doing *something* here ought to be preferable.
This patch is (obviously) merely a work-around, but given that the referenced issue is (as far as I know) the first case we've seen of this problem a simple solution will hopefully suffice for now.
This fixes the colours, by respecting the strokeAlpha/fillAlpha-values, for a couple of Annotations in the PDF document from issue 13447.[1]
---
[1] Some of the annotations still won't render at all, when compared with Adobe Reader, but that could/should probably be handled separately.
According to the specification, see https://web.archive.org/web/20210404042322if_/https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.2384179, the keys of a NameTree/NumberTree should be ordered.
For corrupt PDF files, which violate this assumption, it's thus possible that trying to lookup a single entry fails.
Previously, in PR 10274, we implemented a fallback that only applies to the "bottom" node of a NameTree/NumberTree, which in general might not actually help for sufficiently corrupt NameTree/NumberTree data.
Instead we remove the current *limited* fallback from `NameOrNumberTree.get`, and defer to the call-site to handle this case explicitly e.g. by using `NameOrNumberTree.getAll` for data where that makes sense. For well-formed documents, these changes should *not* lead to any additional data fetching/parsing.
Finally, as part of these changes, the validation of named destination data is improved in the `Catalog` and a new unit-test is also added.
- Right now, a glyph with an erroneous outline is replaced by an empty glyph
if the error is far enough from the start there's likely something to render
so the idea is to replace a command with args by an endchar when no args are
on the stack: this way OTS is likely happy (no remaining args on stack) and we
can draw something which is likely better than nothing.
Previously, we set the base transformation and pattern matrix
directly to the main rendering ctx of the page, however doing this
caused the current transform to be lost. This would cause issues
with things like shear missing so the pattern was misaligned or when
stroke was used the scale of the line width or dash would be wrong.
Instead we should leave the current transform and use setTransfrom
on the pattern so it is applied correctly. For axial and radial shadings I had
to create a temporary canvas to draw the shading so I could in turn
use setTransform.
Fixes: #13325, #6769, #7847, #11018, #11597, #11473
The following already in the corpus are improved:
issue8078-page1
issue1877-page1
Without this some *composite* fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
*Please note:* Given that the document, in the referenced issue, doesn't embed *any* of its fonts there's no guarantee that it renders correctly in all configurations even with this patch.
- but don't validate them for now;
- Firefox will display a bar to warn that the signature validation is not supported (see https://bugzilla.mozilla.org/show_bug.cgi?id=854315)
- almost all (all ?) pdf readers display signatures;
- validation is done in edge but for now it's behind a pref.
The fontName, as defined in the PDF document, cannot be found in *any* of the "name"-tables in the TrueType Collection font. To work-around that, this patch adds a *fallback* code-path to allow using an approximately matching fontName rather than outright failing.
Fixes#13107
In the issue, some TrueType glyph names have the format `uniXXXX`.
Font's `Encoding` dictionary has the entry `Differences` but no
`BaseEncoding`. `uniXXXX` names are converted to glyph indices
using font's `post` table but currently that is done only when
`BaseEncoding` exists. We must enable the conversion also when only
`Differences` exists.
* JS - Handle correctly hierarchy of fields
- it aims to fix#13132;
- annotations can inherit their actions from the parent field;
- there are some fields which act as a container for other fields:
- they can be access through js so need to add them with an empty type (nothing in the spec about that but checked in Acrobat);
- calculation order list (CO) can reference them so need make them through this.getField;
- getArray method must return kids.
- field values are number, string, ... depending of their type but nothing in the spec on how to know what's the type:
- according to the comment for Canonical Format: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=461
- it seems that this "type" can be guessed from js action Format (when setting a type in Acrobat DC, the only affected thing is this action).
- util.scand with an empty string returns the current date.
This extends PR 13033 slightly, with a heuristic to support corrupt PDF documents where the `LineAnnotation`s have an empty /Rect-entry. Please note that while I have no idea if this is "correct", this patch at least makes us output the same /BBox as re-saving in Adobe Reader does.
*As far as I can tell, this has been broken ever since PR 3289 (back in 2013) without anyone noticing.*
For any non-`MissingDataException` errors encountered in `ObjectLoader._walk`, we're simply throwing immediately which thus has the potential to *completely* break rendering of an entire page.
In practice this is obviously only an issue for PDF documents which are in one way or another corrupt, since that's the only way that `XRef.fetch` will throw non-`MissingDataException` errors. To make matters worse these errors are *intermittent*, since they can only occur if the document is still loading when the `ObjectLoader`-code runs (note the early return in `ObjectLoader.load`).
Please note that we cannot simply catch the error and let "normal" parsing continue in `ObjectLoader._walk`, since that could lead to errors elsewhere given that resources "below" the current one (in the graph) might not be checked as intended then.
All-in-all, the only way to make absolutely sure that we won't cause *unexpected* `MissingDataException`s somewhere else in the code-base is to fallback to fetching the *entire* document in this edge-case.
- Remove a *duplicated* reference test, see "issue12810", from the manifest.
- Improve the spelling in a couple of comments in `src/core/canvas.js`, most notable of the word "parallelogram".
- Update a comment, also in `src/core/canvas.js`, to actually agree with the value used to reduce confusion when reading the code.
While PR 12725 fixed bug 1671312 as reported, i.e. the "In the upper right corner "Purposes' has bad kerning."-part, it however broke other parts of the text rendering.
Note in particular the tables, e.g. on page 2 and beyond, where the glyphs are now rendered too close together. The reason for this is that the fonts in question are non-embedded ArialNarrow, which we just replace with Helvetica which obviously is not narrow. Given that the font replacement isn't a perfect fit for non-embedded ArialNarrow, we still need to re-measure the glyph widths in this case.
* add a comment to explain how minimal linewidth is computed.
* when context.linewidth < 1 after transform, firefox and chrome
don't render in the same way (issue #12810).
* set lineWidth to 1 after transform and before stroking
- aims fix issue #12295
- a pixel can be transformed into a rectangle with both heights < 1.
A single rescale leads to a rectangle with dim equals to 1 and
the other to something greater than 1.
* change the way to render rectangle with null dimensions:
- right now we rely on the lineWidth set before "re" but
it can be set after "re" and before "S" and in this case the rendering
will be wrong.
- render such rectangles as a single line.
Given that the PDF document in the issue contains the same very large JPEG image *three* times, this patch includes a test-case where only the first page has been extracted from it.
Currently any errors thrown in `preEvaluateFont`, which is a *synchronous* method, will not be handled at all in the `loadFont` method and we were thus failing to return an `ErrorFont`-instance as intended here.
Also, add an *explicit* check in `PartialEvaluator.preEvaluateFont` to ensure that Type0-fonts always have a *valid* dictionary.
Similar to other markers that we currently skip, by ignoring unsupported Coding style default (COD) options we'll at least render *something* here (although some JPEG 2000 images may look slightly wrong).
Note that if the unsupported COD options lead to additional errors, during parsing, we'll still abort parsing of the JPEG 2000 image.
* the goal is to execute actions like Open or OpenAction
* can be tested with issue6106.pdf (auto-print)
* once #12701 is merged, we can add page actions
Similar to other markers that we currently skip, by ignoring the Coding style component (COC) marker we'll at least prevent outright errors (although some JPEG 2000 images may look slightly wrong).
It appears that the PDF document in [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) now renders "correctly"[1] when compared to e.g. Adobe Reader and PDFium. Most likely this bug was fixed by a *somewhat* recent patch, or patches, to the `XRef.indexObjects` method.
Before just closing [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) as WFM, I figured that it probably can't hurt to add it as a new test-case to avoid accidentally regressing this document in the future.
---
[1] Given that the XRef table is corrupt, and that we're forced to recover, there's generally speaking probably some question as to what actually constitutes "correct" in this case.
There doesn't seem to be anything definitive about this in
the spec, but from experimenting, it seems acrobat lets
PDFs override the widths of the standard fonts.
In addition to the existing /Root and /Pages validation, also check that the /Pages-entry actually is a dictionary and that it has a valid /Count-entry.
This way we can avoid picking a trailer candidate which e.g. the `Catalog.numPages` getter will just end up rejecting, thus breaking PDF document loading completely.
* remove 1st param of _createPopup (almost useless for a method)
* prepend popup div to avoid to have them on top of some highlights (and so "disable" partially mouse events)
* add a ref test for issue #12504
* in some pdf, there are actions with "event.source.hidden = ..."
* in order to handle visibility when printing, annotationStorage is extended to store multiple properties (value, hidden, editable, ...)
Different fonts incorrectly end up with *identical* hashes, despite having different /ToUnicode data.
The issue, and it's very interesting that we've apparently not seen it before, appears to be caused by the fact that different /ToUnicode entries share the *same* underlying `ArrayBuffer`, which thus becomes problematic at the `const dataUint32 = new Uint32Array(data.buffer, 0, blockCounts);` line. The simplest solution thus seem to be to just *copy* the input, when it's an `ArrayBuffer`, rather than using it as-is. (Note that if we'd stringified the input, when calling `MurmurHash3_64.update`, the issue would also have been fixed. In this case, we're already creating an unique TypedArray.)
This changes the `transformOrigin` calculations in `AnnotationElement._createContainer` and `PopupAnnotationElement.render`, to ensure that e.g. the clickable area of annotations and/or popups are both positioned correctly.
The problem occurs for *negative* values, since they're not negated correctly because of how the `transformOrigin` strings were build; see issue 12406 for a more in-depth explanation. Previously, for negative values, the `transformOrigin` strings would thus be ignored since they're not valid.
This patch contains a possible approach for fixing issue 12294, which compared to other PRs is purposely limited to the affected `WidgetAnnotation` code.
As mentioned elsewhere, considering that we're (at least for now) trying to fix *one specific* case, I think that we should avoid modifying the `Dict` primitive[1] and/or avoid a solution that (indirectly) modifies an existing `Dict`-instance[2].
This patch simply fixes the issue at hand, since that seems easiest for now, and I'd suggest that we worry about a more general approach if/when that actually becomes necessary.
Hence the solution implemented here, for `WidgetAnnotation`, is to simply use a combination of the local *and* AcroForm /DR resources during OperatorList-parsing to ensure that things work correctly regardless of where a particular /Font resource is found.
For saving of form-data, on the other hand, we want to avoid increasing the file-size unnecessarily and need to be smarter than just merging all of the available resources. To achive this, a new `WidgetAnnotation._getSaveFieldResources` method will when necessary produce a combined resources `Dict` with only the minimum amount of data from the AcroForm /DR resources included.
---
[1] You want to avoid anything that could cause the general `Dict` implementation to become slower, or more complex, just for handling an edge-case in my opinion.
[2] If an existing `Dict`-instance is modified unexpectedly, that could very easily lead to problems elsewhere since e.g. `Dict`-instances created during parsing are not expected to be changed.
In issue 12120, the font has a 1,0 cmap and is marked symbolic which
according to the spec means we should directly use the cmap instead of
the extra steps that are defined in 9.6.6.4.
However, just fixing that caused bug 1057544 to break. The font in bug
1057544 has a 0,1 cmap (Unicode 1.1) which we were not using, but is
easy to support. We're also easily able to support some of the other
unicode cmaps, so I added those as well.
There was also a second issue with bug 1057544, the cmap doesn't have
a mapping for the "quoteright" glyph, but it is defined in the post
table. To handle this, I've moved post table as a fallback for any
font that has an encoding.
In addition to the unit tests these reference tests make sure that this
document, that triggered some edge cases in our code, can be rendered
and printed successfully now.
This is *similar* to the existing transfer function support for SMasks, but extended to simple image data.
Please note that the extra amount of data now being sent to the worker-thread, for affected /ExtGState entries, is limited to *at most* 4 `Uint8Array`s each with a length of 256 elements.
Refer to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G9.1658137 for additional details.
Some fonts have loca tables that aren't sorted or use 0 as an offset to
signal a missing glyph. This fixes the bad loca tables by sorting them
and then rewriting the loca table and potentially re-ordering the glyf
table to match.
Fixes#11131 and bug 1650302.
Issue 4398 was fixed by PR 4437, however a test-case wasn't included as far as I can tell. Given that PR 12186 is now in the process of re-factoring that code, adding a test-case cannot hurt as far as I'm concerned.
Add a new method to the API to get the optional content configuration. Add
a new render task param that accepts the above configuration.
For now, the optional content is not controllable by the user in
the viewer, but renders with the default configuration in the PDF.
All of the test files added exhibit different uses of optional content.
Fixes#269.
Fix test to work with optional content.
- Change the stopAtErrors test to ensure the operator list has something,
instead of asserting the exact number of operators.
The default viewer, and thus Firefox, depends on the `RenderTask.onContinue` functionality to pause/continue rendering (such that the most visible page always renders first).
Despite this functionality thus being very important, it has however never actually been tested *at all* as far as I can tell. Hence this patch which adds a new boolean `renderTaskOnContinue` parameter (`false` by default), that can be used to force a reference-test to use the `RenderTask.onContinue` code-path in the `InternalRenderTask` class.
Note that I purposely made this new reference-test behaviour *optional*, since I didn't want to negatively affect the general runtime of the tests (given that there's a slight delay added to the rendering). Also, for e.g. benchmarking you'd most likely want to stay away from the `RenderTask.onContinue` functionality for similar reasons.
This should reduce the possibility of accidentally truncating some inline images, while *not* causing the "EI" detection to become significantly slower.[1]
There's obviously a possibility that these added checks are not sufficient to catch *every* single case of "EI" sequences within the actual inline image data, but without specific test-cases I decided against over-engineering the solution here.
*Please note:* The interpolation issues are somewhat orthogonal to the main issue here, which is the truncated image, and it's already tracked elsewhere.
---
[1] I've looked at the issue a few times, and this is the first approach that I was able to come up with that didn't cause *unacceptable* performance regressions in e.g. issue 2618.
*First of all, I should mention that my understanding of the finer details of the `QueueOptimizer` (and its related `CanvasGraphics` methods) is somewhat limited.*
Hence I'm not sure if there's actually a very good reason for *only* considering ImageMasks where the "skew" transformation matrix elements are zero as *repeated*, however simply looking at the code I just don't see why these elements cannot be non-zero as long as they are *all identical* for the ImageMasks.
Furthermore, looking at the *group* case (which is what we're currently falling back to), there's no particular limitation placed upon the transformation matrix elements.
While this patch obviously isn't enough to *completely* fix the issue, since there should be a visible Pattern rendered as well[1], it seem (at least to me) like enough of an improvement that submitting this is justified.
With these changes the referenced PDF document will no longer hang the *entire* browser, and rendering also finishes in a *reasonable* time (< 10 seconds for me) which seem fine given the *huge* number of identical inline images present.[2]
---
[1] Temporarily changing the Pattern to a solid color *does* render the correct/expected area, which suggests that the remaining problem is a pre-existing issue related to the Pattern-handling itself rather than the `QueueOptimizer` functionality.
[2] The document isn't exactly rendered immediately in e.g. Adobe Reader either.
Because of a really stupid `Promise`-related mistake on my part, when re-factoring `PDFImage.buildImage` during the `NativeImageDecoder` removal, we're no longer re-throwing errors occuring during image parsing/decoding as intended.
The result is that some (fairly) corrupt documents will never finish loading, and unfortunately there were apparently no sufficiently corrupt images in the test-suite to catch this.
On ISO/IEC 10918-6:2013 (E), section 6.1: (http://www.itu.int/rec/T-REC-T.872-201206-I/en)
"Images encoded with three components are assumed to be RGB data encoded as YCbCr unless the image contains an APP14 marker segment as specified in 6.5.3, in which case the colour encoding is considered either RGB or YCbCr according to the application data of the APP14 marker segment"
But common jpeg libraries consider RGB too if components index are ASCII R (0x52), G (0x47) and B (0x42): https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048
Issue #11931
Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons:
- It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library.
- The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image.
- While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first.
In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should *always* be the case.
- The native decoding, for anything except the *simplest* of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707).
Furthermore this also leads to data being *parsed* on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons.
- Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests.
- Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used.
Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues.
At this point in time, there's two kinds of failure with this patch:
- Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder.
This type of "failure" accounts for the *vast* majority of the total number of changes in the reference tests.
- Changes where the JPEG images now looks *ever so slightly* blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough).
Basically if you disable [this downscaling in canvas.js](8fb82e939c/src/display/canvas.js (L2356-L2395)), which is what happens when zooming in, the differences simply vanish!
Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that *all* images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.
This should ensure that a page will always render successfully, even if there's errors during the Annotation fetching/parsing.
Additionally the `OperatorList.addOpList` method is also adjusted to ignore invalid data, to make it slightly more robust.
- Add a reduced test-case for issue 11768, to prevent future regressions.
(Given that PR 11769 is only a work-around, rather than a proper solution, it may not be entirely accurate for the issue to be closed as fixed.)
- Add more validation of the charCode, as found by the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode` to prevent future issues.
At this point in time, compared to when the "ignore single-char" code was added, we *should* generally be doing a much better job of combining text into as few chunks as possible.
However, there's still bad cases where we're not able to combine text as much as one would like, which is why I'm *not* proposing to simply measure/scale all text. Instead this patch will to only measure/scale single-char text in cases where the horizontal/vertical scale is off significantly, since that's were you'd expect bad text-selection behaviour otherwise.
Note that most of the movement caused by this patch is with Type3 fonts, which is a somewhat special font type and one where our current text-selection behaviour is probably the least good.
Fixes#11718 in which the `ff` ligature glyph is at index zero in a CFF font. Beacuse this is a CIDFont, glyph names are CIDs, which are integers. Thus the string `".notdef"` is not correct. The rest of the charset data is already parsed correctly as integers when the boolean argument `cid` is true.
The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a *possible* Cdd{d}/cdd{d} glyphName by the existing heuristics.
Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point.
To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds *additional* validation of the charCode found by the heuristics.
The PDF document in question is *corrupt*, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.
Fixes#11477
The PDF draws many space characters but the embedded fonts don't have a glyph named `space`, so `.notdef` should be drawn instead. PDF.js assumed that Type1 fonts define `.notdef` as the first glyph (index 0). However, now the fonts have the glyph `A` at index 0 and `.notdef` is the last one, so `A` appears where spaces are expected.
Because the rest of the font machinery in `core/fonts.js` assumes `.notdef` is at index zero, it's easiest to modify `core/type1_parser.js` so that it "repairs" fonts and makes sure `.notdef` is at index 0.
The PDF document in question is *corrupt*, since it contains multiple instances of incorrect operators.
We obviously don't want to slow down parsing of *all* documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.
*This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.*
To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.
In the current `AnnotationLayer` implementation, Popup annotations require that the parent annotation have already been rendered (otherwise they're simply ignored).
Usually the annotations are ordered, in the `/Annots` array, in such a way that this isn't a problem, however there's obviously no guarantee that all PDF generators actually do so. Hence we simply ensure, when rendering the `AnnotationLayer`, that the Popup annotations are handled last.
- Re-factor the "incorrect encoding" check, since this can be easily achieved using the general `findNextFileMarker` helper function (with a suitable `startPos` argument).
- Tweak a condition, to make it easier to see that the end of the data has been reached.
- Add a reference test for issue 1877, since it's what prompted the "incorrect encoding" check.
Fixes#11403
The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this.
In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`.
The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.
In the PDF document in question, there's an ASCII85Decode inline image where the '>' part of EOD (end-of-data) marker is missing; hence the PDF document is corrupt.
For documents with a Linearization dictionary the computed `startXRef` position will be relative to the raw file, rather than the actual PDF document itself (which begins with `%PDF-`).
Hence it's necessary to subtract `stream.start` in this case, since otherwise the `XRef.readXRef` method will increment the position too far resulting in parsing errors.
This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.
Obviously this won't look exactly right, but considering that the PDF file doesn't bother embedding non-standard fonts this is the best that we can do here.
Originally only `skipPages` existed, but given that `firstPage`/`lastPage` has existed for a long time now using them whenever possible looks simpler overall.
- In the `ibwa-bad` case the sixteenth page contains corrupt/incomplete commands, but given that we're suppressing `Error`s by default now skipping hardly seems warranted any more.
- In the `geothermal.pdf` case the first page contains an unsupported ColourSpace, but again we're suppressing `Error`s by default now and skipping hardly seems warranted any more.
This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1]
Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break.
The only reason that I'm even submitting this patch, is because of the number of open issues that it would address.
Generally speaking though, the best course of action would probably be if `XRef.indexObjects` was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them.
---
[1] Especially given that it's reverting part of PR 5910, however in the case of issue 5909 it seems that other (more recent) changes have actually made that PR redundant.
As part of attempting to fix a number issues containing PDF documents with corrupt XRef tables, I'd like to improve the reference test-coverage slightly *first*.
Obviously this will increase the runtime of the tests a bit, however I'd rather "waste" resources on the bots instead of developer time fixing regressions which could have been avoided.
*Please note:* I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code.
However, it occured to me that we're essentially trying to add a heuristic *on top* of another heuristic, and that it shouldn't matter how efficient the code is as long as it works.
In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.
Hopefully this patch makes sense, and in order to reduce the regression risk the implementation ensures that only completely missing widths are being replaced.
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".
The patch makes the following notable changes:
- Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
- Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
- Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
- Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
- Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.
---
[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
This patch will not incur any (measurable) overhead, since the glyphlist is already quite long and one more entry won't really matter, which is important given that this sort of PDF corruption ought to be very rare.
Furthermore, this patch purposely does *not* add a bunch of similarly modified ligature names on pure speculation. Any similar additions, for other ligatures, should only be made if there's real-world examples of PDF files where that's actually necessary.
The border `width` will instead fallback to the default value of `1`, rather than ignoring it altoghether, to also ensure that e.g. `LinkAnnotation`s become clickable as intended.
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1552113