This patch refactors the `CropBox` code to combine fetching and
validation code in a getter, like we already did for the `MediaBox`
property. Combined with variable name changes, this improves readability
of the code and makes the `view` getter simpler as well.
I've not actually, thus far, come across a PDF file that this patch fixes. However, given the string of recent patches that has fixed issues with indirect objects in arrays, I think that it makes sense to proactively avoid any issues in this code.
This patch avoids the creation of extra arrays when initializing an
array with default (zero) values. Doing this additionally makes the code
more readable by allocating enough space for the number of color
components.
For Arabic characters, the Unicode character codes are mapped to Unicode
character types using the character codes for indexing. However, the
character code 0x061D is undefined (and therefore invalid) in the
Unicode standard. The imported list does not contain this entry, but not
having it in the list breaks indexing for items after it. Therefore, put
an empty string on its position to make indexing work properly and issue
a warning in the unlikely event that we encounter this character.
*This patch fixes something that I noticed while debugging https://bugzilla.mozilla.org/show_bug.cgi?id=1308536.*
The PDF file contains a font called "NuptialScript", which unfortunately is not embedded. Since that is a non-standard font we will not be able to render it entirely correct. However, by adding "NuptialScript" to the `getNonStdFontMap`, we can at least improve the rendering slightly by using an italic (serif) fallback font.
This patch adds support for non-embedded Arial Black fonts, that use a `Arial-Black...` format for the font names.
Also, this patch changes `canvas.js` such that we always render Arial Black fonts with the maximum weight, which actually improves a number of existing test-cases. This should thus explain the test "failures", which are clear improvements compared with e.g. Adobe Reader.
Fixes 7835.
For `PartialEvaluator_getTextContent`, the same `args` Array should be re-used for every `EvaluatorPreprocessor_read` call. Hence we want to ensure that it's not accidentally replaced with `null` in `EvaluatorPreprocessor_read`, since otherwise corrupt PDF files (with too few arguments for certain commands) will cause errors in `PartialEvaluator_getTextContent`.
Perhaps a micro-optimization, but this patch also changes two `!args` comparisons to `args === null`, since that should be a tiny bit more efficient.
By only allowing very specific type of `JavaScript` actions, and also utilizing the existing `URL` validation, this patch shouldn't pose too much risk.
Fixes one of the points in issue 3897 (with the PDF file taken from issue 3438).
Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=843699 (probably, since that bug doesn't contain a test-case).
It seems that certain bad PDF generators can create badly encoded "Prefix" entries for Page Labels, one example being http://ukjewishfilm.org/wp-content/uploads/2015/09/Jewish-Film-Festival-Programme-ONLINE.pdf.
Unfortunately I didn't come across such a PDF file while adding the API support for Page Labels, but with them now being used in the viewer I just found this issue. With this patch, we now display the Page Labels in the same way as Adobe Reader.
The original code is difficult to read and, more importantly, performs
actions that are not described in the specification. It replaces empty
names with a backtick and an index, but this behavior is not described
in the specification. While the specification is not entirely clear
about what should happen in this case, it does specify that the `T`
field is optional and that multiple field dictionaries may have the same
fully qualified name, so to achieve this it makes the most sense to
ignore missing `T` fields during construction of the field name. This is
the most specification-compliant solution and, judging by opened issue #6623, also the required and expected behavior.
In general we neither want, nor can, support arbitrary `Launch` actions. But in practice, all the cases we've seen so far just contains relative URLs to other PDF files. Building on PR 7689, we can thus at least support basic `Launch` actions.
Note that in `FIREFOX/MOZCENTRAL/CHROME` builds of the standard viewer the `docBaseUrl` parameter will be set by default, since in that case it makes sense to use the current URL as a base.
For the `GENERIC` viewer, or the API itself, it doesn't make sense to try and set the `docBaseUrl` by default. However, custom deployments/implementations may still find the parameter useful.
Note that this will automatically reject any relative URL.
To make the API more useful to consumers, URLs that are rejected will be available via the `unsafeUrl` property in the data object returned by `PDFPageProxy_getAnnotations`.
The patch also adds a bit more validation of the data for `Named` actions.
This not only reduces code duplication, but it also allow us to easily support the same kind of URLs we currently do for Link annotations in the Outline as well.
Each well-formed SVG image has the following structure:
SVG element
- Definitions element
- Root group
- Other group 1
- ...
- Other group n
This patch factors out initialization code into a private method in such
as way that the creation of this structure is clear from the code. The
root group is the replacement for the parent group from before. We need
this group as we cannot apply the viewport transform on the SVG element
itself (this caused issues in Chrome). If other code appends groups to
the SVG image, in reality it is appending those groups to the root
group, but this detail is abstracted away by this patch.
This patch ensures that we only create transformation groups when it is
actually required and that we re-use transform groups as much as possible.
It reduces the number of transform groups for the Tracemonkey paper from
2790 to 1271, thereby making the DOM much lighter and rendering/scrolling
smoother. Moreover, it simplifies the code and prevents duplication.
Finally, we issue a warning when an unimplemented graphic state is
encountered. Before, this was ignored silently, making debugging harder.
Let `Parser_makeFilter` pass in the `DecodeParms` data to various image `Stream`s, instead of re-fetching it in various `[...]Stream.prototype.ensureBuffer` methods
In `Parser_filter` the `DecodeParms` data is fetched and passed to `Parser_makeFilter`, where we also make sure that a `Ref` is resolved to a direct object.
We can thus pass this along to the various image `Stream` constructors, to avoid the current situation where we lookup/resolve data that is already available.
Note also that we currently do *not* handle the case where `DecodeParms` is an Array entirely correct in the various image `Stream`s, and this patch fixes that for free.
While the array argument to TJ should only contain strings and numbers, other
unfortunate items are found in PDFs in the wild, e.g.:
[(Grandes) 0.0 Tc
-250.0 (Client\350les,) 0.0 Tc
-250.0 (Financements) 0.0 Tc
-250.0 (et) 0.0 Tc
-250.0 (March\351s) ] TJ
getOperatorList already properly ignores any non-string, non-numeric values in
TJ arrays; without this patch to getTextContent, returned text items can have
NaN widths due to calculations being applied to those non-numeric values.
For PDF files with multiple `/Filter`s, where the `/Length` entry is zero, we fail to render the file correctly. The reason is that `maybeLength` is `null` for the every filter except the first, and `!maybeLength` is thus truthy.
Hence it seems that we should completely ignore the `/Length` entry and also explicitly check `maybeLength === 0`.
Note that I've not (yet) come across a PDF file with this issue in the wild, but given all the stupid things PDF generators do I wouldn't be surprised if such a file actually exists. In order to prevent a possible future bug, I'm submitting this patch which includes a hand-edited PDF file that we currently cannot render correctly (but e.g. Adobe Reader can).
Directly use the hexadecimal representation, just like the
`AnnotationFlags`, to avoid calculations and to improve readability.
This allows us to simplify the unit tests for text widget annotations as
well.
When debugging issue 7643, I noticed that the `forms` tests currently doesn't look like the rendering in the viewer (with `renderInteractiveForms = true` set).
After scratching my head for a little while, I realized that PR 7633 make the implicit assumption that `Page_getOperatorList` (in `core/document.js`) is called *before* fetching the annotation with `PDFPageProxy_getAnnotations` (in `display/api.js`).
Hence this patch, that changes it so that we instead pass in the `renderInteractiveForms` parameter to `Annotation_appendToOperatorList` to ensure that it's always correctly set.
If interactive forms are enabled, then the display layer takes care of
rendering the form elements. There is no need to draw them on the canvas
as well. This also leads to issues when values are prefilled, because
the text fields are transparent, so the contents that have been rendered
onto the canvas will be visible too.
We address this issue by passing the `renderInteractiveForms` parameter
to the render task and handling it when the page is rendered (i.e., when
the canvas is rendered).
This patch improves the unit tests by testing the support for read-only
and multiline fields. Moreover, we add a reference test to ensure that
the text widgets are not only rendered, but also that their contents are
styled properly.
Finally, we perform minor improvements in `src/core/annotation.js`, for
example adding missing comments.
For embedded Type1 fonts without included `ToUnicode`/`Encoding` data, attempt to improve text selection by using the `builtInEncoding` to amend the `toUnicode` map (issue 6901, issue 7182, issue 7217, bug 917796, bug 1242142)
I intended to provide proper benchmarking results here, as outlined in https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes, but after wasting a couple of hours over the weekend getting weird results I gave up.
It appears that there's a lot of, i.e. way too much, variance between subsequent runs of `text` tests for the results to be meaningful.
(Previously I've only benchmarked `eq` tests, so I don't know if the `text` tests has never worked well or if it's a newer problem. For reference, please see the results of back-to-back benchmark runs on the current `master` with a *very* simple manifest file: [link here].)
Instead I used `console.time/timeEnd` in `appendText` and `expandTextDivs` to be able to compare the performance with/without the patch. The entire viewer was (skip-cache) reloaded between measurements, and the result are available here: [link here].
Given the troubles I've had with benchmarking, I've not yet computed any statistics on the results (e.g. mean, variance, confidence intervals, and so on).
However, just by looking at the data I think it's safe to say that this patch first of all doesn't seem to regress the current performance. Secondly it certainly looks *very* likely that this patch actually improves the performance, especially for the one-glyph-per-text-div case (cf. issue 7224).
Re: issue 7584.
Even though this patch passes all tests (unit/font/reference) locally, including the new ones that I added in PR 7621, I'm still a bit nervous about modifying the code that choose the fallback encoding for fonts without an `/Encoding` entry.
Note that over the years this code has been changed on a number of occasions, see a possibly incomplete [list here], to deal with various cases of incorrect font data.
According to the PDF specification, see http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904184, it seems that we should fallback to the `StandardEncoding` for Nonsymbolic fonts.
There's obviously a risk that fixing this particular issue *could* break other PDF files for which we don't have tests. However I've tried to change the logic as little as possible in this patch, to hopefully reduce possible breakage.
Based on debugging numerous font issue, it seems that a lot of fonts actually set the Symbolic flag, even when they are in fact *not* Symbolic. Fonts actually marked as Nonsymbolic seem to be somewhat less common, which I hope should reduce the risk of the patch somewhat.
Fixes 7580.
Note that in `parseCodestream` I purposly left the `throw new Error` instances inside of the `try` block, since we don't want to throw any `Errors` while in recovery mode.
Finally somewhat unrelated to the rest of the patch, but I moved the `doNotRecover` variable declaration outside of the `try` block to avoid variable hoisting given that it's accessed inside the `catch` block.
Note that in order to prevent any possible issues, this patch does *not* try to amend the `toUnicode` data for Type1 fonts that contain either `ToUnicode` or `Encoding` entries in the font dictionary.
Fixes, or at least improves, issues/bugs such as e.g. 6658, 6901, 7182, 7217, bug 917796, bug 1242142.
Moreover, we refactor the code a bit to extract code that is shared
between the two branches and we only apply text alignment (and create
the array) when it is actually defined, since it's optional and left is
already the default.
This patch is the first step towards implementing support for
interactive forms (AcroForms). It makes it possible to render text
widget annotations exactly like Adobe Reader/Acrobat.
Everything we implement for AcroForms is disabled by default using a
preference, mainly because it is not ready to use yet, but has to
implemented in many steps to avoid complexity. The preference allows us
to work with the code while not exposing the behavior by default. Mainly
storing entered values and printing them is still absent, which would be
minimal requirements for enabling this by default.
This patch is yet another instalment in the (never ending) series of patches for PDF files that specify completely incorrect Type/Subtype for its fonts. In this case Type1/Type1C, when in fact OpenType would have been correct.
Fixes 7598.
Currently, we only support text widget annotations (field type 'Tx')
partially. However, the current code does not make this entirely clear
and does not provide a warning when an unsupported field type is
encountered, making it harder to determine why rendering fails.
Moreover, in the display layer we make no distinction between the
various types of widget annotations, causing the code for text widget
annotations to also be executed for other types of widget annotations in
a fallback situation.
This patch improves the structure of the widget annotation code. In the
core layer, we use the same structure we use for non-widget annotations
in the factory and provide a clear warning when an unsupported type is
encountered. In the display layer, we do the same and split the
`WidgetAnnotationElement` class into two classes, namely
`TextWidgetAnnotationElement` for text widget annotations and
`WidgetAnnotationElement` for other unsupported annotations as a
fallback. From this it clear that we only support text widget
annotations and nothing else.
In the case where the document was destroyed, we were rejecting the `Promise` in `JpegDecode` with a string instead of an `Error`. The patch also brings the wording more inline with other such rejections.
Use the `isInt` utility function when validating the `pageNumber` parameter in `WorkerTransport_getPage`, to make it more obvious what's actually happening. There's also a couple more unit-tests added, to ensure that we always fail in the expected way.
Finally, we can simplify the rejection handling in `WorkerTransport_getPageIndexByRef` somewhat. (Note that the only reason for using `catch` here is that since the promise is rejected on the worker side, the `reason` becomes a string instead of an `Error` which is why we "re-reject" on the display side.)
This patch avoids having to calculate the scale twice by saving it in
the properties object.
Moreover, we remove a temporary variable and place parentheses around a
calculation inside a string concatenation.
This allows us to remove the `try/catch` statements used in `src/core/stream.js` when parsing JPEG images.
As far as I can tell, the only reason for the current usage of plain `throw` is that `jpg.js` originally was external code. Given that this code now lives in our repo, this patch brings the JPEG code more in line with e.g. `src/core/jpx.js` and `src/core/jbig2.js`.
Assign the `quantizationTables` after parsing the entire JPEG image, to prevent issues when the DQT (Define Quantization Tables) marker is encountered after SOF{n} (Start of Frame) markers (issue 7406)
This patch improves performance by avoiding unnecessary type
conversions, which also help the JIT for optimizations.
Moreover, this patch fixes issues with the div expansion code where
`textScale` would be undefined in a division. Because of the `dataset`
usage, other comparisons evaluated to `true` while `false` would have
been correct. This makes the expansion mode now work correctly for cases
with, for example, each glyph in one div.
The polyfill for `WeakMap` has been provided by @yurydelendik.
We pass many parameters to `appendText` while we might as well pass the
`task` object that contains them. This saves a few lines of code and
makes the signature of `appendText` more clear. We do the same for
`expand`, which is useful for the next commit in which we replace
`div.dataset` with a `WeakMap`.
Furthermore, this patch adds a missing parameter to a comment block to
make it clear which parameters remain.