Commit Graph

1071 Commits

Author SHA1 Message Date
Brendan Dahl
2cba290361
Merge pull request #12836 from calixteman/update_buttons
JS -- update radio/checkbox values even if there are no actions
2021-01-21 14:00:26 -08:00
Calixte Denizet
0d1b19632d Enforce linewidth to 1px when at least one of scale factor is lower than 1 2021-01-15 13:18:24 +01:00
Jonas Jenwald
cf7eb87934 Remove a duplicated reference test (PR 12812 follow-up)
- Remove a *duplicated* reference test, see "issue12810", from the manifest.

 - Improve the spelling in a couple of comments in `src/core/canvas.js`, most notable of the word "parallelogram".

 - Update a comment, also in `src/core/canvas.js`, to actually agree with the value used to reduce confusion when reading the code.
2021-01-15 10:57:15 +01:00
Brendan Dahl
6619f1f3f2
Merge pull request #12812 from calixteman/too_thin
Enforce line width to be at least 1px after applied transform
2021-01-14 15:21:44 -08:00
Jonas Jenwald
2600e59acb Always re-measure non-embedded ArialNarrow fonts (bug 1671312, PR 12725 follow-up)
While PR 12725 fixed bug 1671312 as reported, i.e. the "In the upper right corner "Purposes' has bad kerning."-part, it however broke other parts of the text rendering.
Note in particular the tables, e.g. on page 2 and beyond, where the glyphs are now rendered too close together. The reason for this is that the fonts in question are non-embedded ArialNarrow, which we just replace with Helvetica which obviously is not narrow. Given that the font replacement isn't a perfect fit for non-embedded ArialNarrow, we still need to re-measure the glyph widths in this case.
2021-01-14 15:51:48 +01:00
Ross Johnson
6dae2677d5 [api-minor] Highlight search results correctly for normalized text (PR 9448)
This patch is a rebased *and* refactored version of PR 9448, such that it applies cleanly given that `PDFFindController` has changed since that PR was opened; obviously keeping the original author information intact.

This patch will thus ensure that e.g. fractions, and other things that we normalize before searching, will still be highlighted correctly in the textLayer.

Furthermore, this patch also adds basic unit-tests for this functionality.

*Note:* The `[api-minor]` tag is added, since third-party implementations of the `PDFFindController` must now always use the `pageMatchesLength` property to get accurate length information (see the `web/text_layer_builder.js` changes).

Co-authored-by: Ross Johnson <ross@mazira.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
2021-01-12 18:08:08 +01:00
calixteman
1de1ae0be6
Merge pull request #12838 from calixteman/authors
[api-minor] Change the "dc:creator" Metadata field to an Array
2021-01-12 02:44:58 -08:00
Calixte Denizet
43d5512f5c [api-minor] Change the "dc:creator" Metadata field to an Array
- add scripting support for doc.info.authors
 - doc.info.metadata is the raw string with xml code
2021-01-11 21:34:07 +01:00
Calixte Denizet
b3dccd66ab Enforce line width to be at least 1px after applied transform
* add a comment to explain how minimal linewidth is computed.
 * when context.linewidth < 1 after transform, firefox and chrome
   don't render in the same way (issue #12810).
 * set lineWidth to 1 after transform and before stroking
   - aims fix issue #12295
   - a pixel can be transformed into a rectangle with both heights < 1.
     A single rescale leads to a rectangle with dim equals to 1 and
     the other to something greater than 1.
 * change the way to render rectangle with null dimensions:
   - right now we rely on the lineWidth set before "re" but
     it can be set after "re" and before "S" and in this case the rendering
     will be wrong.
   - render such rectangles as a single line.
2021-01-10 18:02:12 +01:00
Jonas Jenwald
cd9422a075 Improve handling of JPEG images without an EOI marker (issue 12841)
Given that the PDF document in the issue contains the same very large JPEG image *three* times, this patch includes a test-case where only the first page has been extracted from it.
2021-01-09 20:19:39 +01:00
Calixte Denizet
7172f0a928 JS -- update radio/checkbox values even if there are no actions 2021-01-08 16:43:16 +01:00
Jonas Jenwald
78c32c2697 Improve the handling of errors, in PartialEvaluator.loadFont, occuring in PartialEvaluator.preEvaluateFont (issue 12823)
Currently any errors thrown in `preEvaluateFont`, which is a *synchronous* method, will not be handled at all in the `loadFont` method and we were thus failing to return an `ErrorFont`-instance as intended here.

Also, add an *explicit* check in `PartialEvaluator.preEvaluateFont` to ensure that Type0-fonts always have a *valid* dictionary.
2021-01-07 11:38:38 +01:00
Tim van der Meij
ca18af6af3
Merge pull request #12774 from calixteman/doc_action_test
JS -- Add tests for print/save actions
2021-01-03 18:46:37 +01:00
Tim van der Meij
50303fc8f4
Merge pull request #12766 from Snuffleupagus/issue-11004
Ignore, rather than throwing on, unsupported Coding style default (COD) options in JPEG 2000 images (issue 11004)
2020-12-28 20:26:10 +01:00
Calixte Denizet
ffd4bc790c JS -- Add tests for print/save actions
* change PDFDocument::hasJSActions to return true when there are JS actions in catalog.
2020-12-24 18:51:00 +01:00
Calixte Denizet
7c3facb174 JS -- Add support for buttons
* radio buttons
 * checkboxes
2020-12-22 16:41:51 +01:00
Jonas Jenwald
cffb7af3b0 Ignore, rather than throwing on, unsupported Coding style default (COD) options in JPEG 2000 images (issue 11004)
Similar to other markers that we currently skip, by ignoring unsupported Coding style default (COD) options we'll at least render *something* here (although some JPEG 2000 images may look slightly wrong).
Note that if the unsupported COD options lead to additional errors, during parsing, we'll still abort parsing of the JPEG 2000 image.
2020-12-21 20:35:52 +01:00
Brendan Dahl
3ea1c43b15
Merge pull request #12751 from calixteman/da_not_a_string
Add a default DA for textfield to avoid issues when printing or saving
2020-12-21 09:44:08 -08:00
Calixte Denizet
a7c682c600 Add a default DA for textfield to avoid issues when printing or saving
* it aims to fix issue #12750
2020-12-19 23:38:45 +01:00
calixteman
e6e2809825
Merge pull request #12702 from calixteman/doc_actions
JS - Collect and execute actions at doc level
2020-12-18 21:33:32 +01:00
Calixte Denizet
1e2173f038 JS - Collect and execute actions at doc and pages level
* the goal is to execute actions like Open or OpenAction
 * can be tested with issue6106.pdf (auto-print)
 * once #12701 is merged, we can add page actions
2020-12-18 20:03:59 +01:00
Jonas Jenwald
48a76aea2b Ignore, rather than throwing on, Coding style component (COC) markers in JPEG 2000 images (issue 12752)
Similar to other markers that we currently skip, by ignoring the Coding style component (COC) marker we'll at least prevent outright errors (although some JPEG 2000 images may look slightly wrong).
2020-12-18 18:18:32 +01:00
Calixte Denizet
c6c9cc96bf Follow-up of #12707: Add an integration test for checkboxes as radio buttons
* Integration tests: Add a function to load a pdf and wait for a selected element
 * Integration tests: Add a function to close all the open pages
2020-12-15 00:00:04 +01:00
Jonas Jenwald
7f89be5dbf Add a test-case for bug 1292316
It appears that the PDF document in [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) now renders "correctly"[1] when compared to e.g. Adobe Reader and PDFium. Most likely this bug was fixed by a *somewhat* recent patch, or patches, to the `XRef.indexObjects` method.

Before just closing [bug 1292316](https://bugzilla.mozilla.org/show_bug.cgi?id=1292316) as WFM, I figured that it probably can't hurt to add it as a new test-case to avoid accidentally regressing this document in the future.

---
[1] Given that the XRef table is corrupt, and that we're forced to recover, there's generally speaking probably some question as to what actually constitutes "correct" in this case.
2020-12-12 13:24:31 +01:00
Tim van der Meij
d1848f5022
Merge pull request #12725 from brendandahl/remeasure-std
Use widths defined by font for standard fonts.
2020-12-11 20:36:19 +01:00
Jonas Jenwald
67e5db75d8 Ignore color-operators in Type3 glyphs beginning with a d1 operator (issue 12705)
Please refer to the PDF specification at https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1977497 and https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.3998470

This patch removes the color-operators in the evaluator, since that should be more efficient than doing it repeatedly in the main-thread when rendering the Type3 glyphs.
2020-12-11 15:49:13 +01:00
Brendan Dahl
45d9ab6e45 Use widths defined by font for standard fonts.
There doesn't seem to be anything definitive about this in
the spec, but from experimenting, it seems acrobat lets
PDFs override the widths of the standard fonts.
2020-12-10 15:30:39 -08:00
Jonas Jenwald
796a0d3155 Ensure that the /Annots-entry, on /Page-instances, is actually an Array (issue 12714)
In the referenced PDF document, the second and third page has *corrupt* /Annots-entries which contain /Dict-data rather than the intended Arrays.
2020-12-10 11:42:00 +01:00
Tim van der Meij
012e15f7a3
Fix non-standard quadpoints orders for annotations
This change requires us to use valid quadpoints arrays in the existing
unit tests too due to the normalization.
2020-12-06 16:02:41 +01:00
Jonas Jenwald
dc84c8a02a Update the link for the "pr8808" test-case (issue 12680)
This seems like a very minor issue, since in general we can't really help if domains are blocked from certain networks, however in this particular case I suppose that using the Internet Archive should work.
2020-12-02 15:06:09 +01:00
Jonas Jenwald
8a132f584d Check the top-level /Pages dictionary when finding the trailer in XRef.indexObjects (issue 12402)
In addition to the existing /Root and /Pages validation, also check that the /Pages-entry actually is a dictionary and that it has a valid /Count-entry.
This way we can avoid picking a trailer candidate which e.g. the `Catalog.numPages` getter will just end up rejecting, thus breaking PDF document loading completely.
2020-11-25 15:14:53 +01:00
Brendan Dahl
4ba28de260
Merge pull request #12567 from calixteman/hidden
[api-minor] JS -- hidden annotations must be built in case a script show them
2020-11-19 08:35:47 -08:00
Calixte Denizet
611207d2c9 Fix popup for highlights without popup (follow-up of #12505)
* remove 1st param of _createPopup (almost useless for a method)
 * prepend popup div to avoid to have them on top of some highlights (and so "disable" partially mouse events)
 * add a ref test for issue #12504
2020-11-10 17:33:54 +01:00
Calixte Denizet
b11592a756 JS -- hidden annotations must be built in case a script show them
* in some pdf, there are actions with "event.source.hidden = ..."
 * in order to handle visibility when printing, annotationStorage is extended to store multiple properties (value, hidden, editable, ...)
2020-11-10 12:48:34 +01:00
Jakub Olek
b642d49108 Make sure that Popup is rendered next to trigger for textAnnotation 2020-11-05 06:45:17 +01:00
Tim van der Meij
ea4d88a330
Merge pull request #12395 from calixteman/checks
Render not displayed annotations in using normal appearance when printing
2020-10-28 00:11:10 +01:00
Calixte Denizet
6be2f84b4e Render not displayed annotations in using normal appearance when printing 2020-10-27 19:00:31 +01:00
Jonas Jenwald
f2fa053c51 Ensure that MurmurHash3_64.update handles ArrayBuffer input correctly, to avoid hash-collisions (issue 12533)
Different fonts incorrectly end up with *identical* hashes, despite having different /ToUnicode data.
The issue, and it's very interesting that we've apparently not seen it before, appears to be caused by the fact that different /ToUnicode entries share the *same* underlying `ArrayBuffer`, which thus becomes problematic at the `const dataUint32 = new Uint32Array(data.buffer, 0, blockCounts);` line. The simplest solution thus seem to be to just *copy* the input, when it's an `ArrayBuffer`, rather than using it as-is. (Note that if we'd stringified the input, when calling `MurmurHash3_64.update`, the issue would also have been fixed. In this case, we're already creating an unique TypedArray.)
2020-10-26 16:27:33 +01:00
Calixte Denizet
37c86b2daa Fallback font for buttons must be ZapfDingbats.
Fix bug https://bugzilla.mozilla.org/show_bug.cgi?id=1669099.
2020-10-24 12:00:03 +02:00
Jani Pehkonen
935568c2f1 Fix invalid XUID entries in CFF fonts
In CFF fonts, entry `XUID` should be an array that has no more than
16 elements. In the issue, the length is 20, which causes the fonts to fail.
See Appendix B, "Implementation Limits" in PostScript Language Reference Manual
https://web.archive.org/web/20170218093716/https://www.adobe.com/products/postscript/pdfs/PLRM.pdf
Actually entries `XUID` and `UniqueID` are obsolete altogether.
https://blogs.adobe.com/CCJKType/2016/06/no-more-xuid-arrays.html
2020-10-05 17:38:01 +03:00
Jonas Jenwald
bd3b15b897 Use the cidToGidMap, if it exists, when building the glyph mapping for non-embedded composite fonts (issue 12418) 2020-09-28 14:40:43 +02:00
Jonas Jenwald
fca53a8eb0 Compute the transformOrigin correctly, for negative values, when rendering AnnotationElements (bug 1627030)
This changes the `transformOrigin` calculations in `AnnotationElement._createContainer` and `PopupAnnotationElement.render`, to ensure that e.g. the clickable area of annotations and/or popups are both positioned correctly.

The problem occurs for *negative* values, since they're not negated correctly because of how the `transformOrigin` strings were build; see issue 12406 for a more in-depth explanation. Previously, for negative values, the `transformOrigin` strings would thus be ignored since they're not valid.
2020-09-24 10:28:29 +02:00
Jonas Jenwald
2497e8eab9 Prevent errors if the InkList property, in InkAnnotations, is missing and/or not an Array (issue 12392)
To prevent a future bug, the `Vertices` property in PolylineAnnotations are handled the same way.
2020-09-19 15:34:32 +02:00
Tim van der Meij
b0c7a74a0c
Merge pull request #12361 from Snuffleupagus/_getSaveFieldResources
Ensure that all necessary /Font resources are included when saving a `WidgetAnnotation`-instance (issue 12294)
2020-09-15 00:09:31 +02:00
Tim van der Meij
3ecd984758
Implement resetting of created streams for annotations 2020-09-14 23:08:50 +02:00
Jonas Jenwald
c992b8e460 Ensure that all necessary /Font resources are included when saving a WidgetAnnotation-instance (issue 12294)
This patch contains a possible approach for fixing issue 12294, which compared to other PRs is purposely limited to the affected `WidgetAnnotation` code.

As mentioned elsewhere, considering that we're (at least for now) trying to fix *one specific* case, I think that we should avoid modifying the `Dict` primitive[1] and/or avoid a solution that (indirectly) modifies an existing `Dict`-instance[2].
This patch simply fixes the issue at hand, since that seems easiest for now, and I'd suggest that we worry about a more general approach if/when that actually becomes necessary.

Hence the solution implemented here, for `WidgetAnnotation`, is to simply use a combination of the local *and* AcroForm /DR resources during OperatorList-parsing to ensure that things work correctly regardless of where a particular /Font resource is found.
For saving of form-data, on the other hand, we want to avoid increasing the file-size unnecessarily and need to be smarter than just merging all of the available resources. To achive this, a new `WidgetAnnotation._getSaveFieldResources` method will when necessary produce a combined resources `Dict` with only the minimum amount of data from the AcroForm /DR resources included.

---
[1] You want to avoid anything that could cause the general `Dict` implementation to become slower, or more complex, just for handling an edge-case in my opinion.

[2] If an existing `Dict`-instance is modified unexpectedly, that could very easily lead to problems elsewhere since e.g. `Dict`-instances created during parsing are not expected to be changed.
2020-09-14 15:22:40 +02:00
Tim van der Meij
20c891542b
Merge pull request #12269 from calixteman/highlight
Add support for missing appearances for hightlights, strikeout, squiggly and underline annotations.
2020-09-06 22:25:36 +02:00
Calixte Denizet
65ecd981fe Add support for missing appearances for hightlights, strikeout, squiggly and underline annotations. 2020-09-06 15:40:15 +02:00
Tim van der Meij
06b53d770a
Merge pull request #12259 from brendandahl/cmap-fix
Fix handling of symbolic fonts and unicode cmaps.
2020-08-30 16:01:24 +02:00
Brendan Dahl
45e8a31cc0 Fix handling of symbolic fonts and unicode cmaps.
In issue 12120, the font has a 1,0 cmap and is marked symbolic which
according to the spec means we should directly use the cmap instead of
the extra steps that are defined in 9.6.6.4.

However, just fixing that caused bug 1057544 to break. The font in bug
1057544 has a 0,1 cmap (Unicode 1.1) which we were not using, but is
easy to support. We're also easily able to support some of the other
unicode cmaps, so I added those as well.

There was also a second issue with bug 1057544, the cmap doesn't have
a mapping for the "quoteright" glyph, but it is defined in the post
table. To handle this, I've moved post table as a  fallback for any
font that has an encoding.
2020-08-27 14:33:11 -07:00
Tim van der Meij
73bc8e1d9f
Include forms/print reference tests for the document from #12233
In addition to the unit tests these reference tests make sure that this
document, that triggered some edge cases in our code, can be rendered
and printed successfully now.
2020-08-23 13:00:02 +02:00
Jonas Jenwald
1058f16605 Add (basic) support for transfer functions to Images (issue 6931, bug 1149713)
This is *similar* to the existing transfer function support for SMasks, but extended to simple image data.
Please note that the extra amount of data now being sent to the worker-thread, for affected /ExtGState entries, is limited to *at most* 4 `Uint8Array`s each with a length of 256 elements.

Refer to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G9.1658137 for additional details.
2020-08-17 10:34:12 +02:00
Brendan Dahl
7fb01f9f2a
Merge pull request #12186 from brendandahl/loca-2
Fix bad truetype loca tables.
2020-08-10 20:34:19 -07:00
Brendan Dahl
f6dff81223 Fix bad truetype loca tables.
Some fonts have loca tables that aren't sorted or use 0 as an offset to
signal a missing glyph. This fixes the bad loca tables by sorting them
and then rewriting the loca table and potentially re-ordering the glyf
table to match.

Fixes #11131 and bug 1650302.
2020-08-10 14:15:49 -07:00
Jonas Jenwald
128e41d6be Add a test-case for issue 4398
Issue 4398 was fixed by PR 4437, however a test-case wasn't included as far as I can tell. Given that PR 12186 is now in the process of re-factoring that code, adding a test-case cannot hurt as far as I'm concerned.
2020-08-08 11:50:19 +02:00
Brendan Dahl
ac494a2278 Add support for optional marked content.
Add a new method to the API to get the optional content configuration. Add
a new render task param that accepts the above configuration.
For now, the optional content is not controllable by the user in
the viewer, but renders with the default configuration in the PDF.

All of the test files added exhibit different uses of optional content.

Fixes #269.

Fix test to work with optional content.

- Change the stopAtErrors test to ensure the operator list has something,
  instead of asserting the exact number of operators.
2020-08-04 09:26:55 -07:00
Jonas Jenwald
1d66fce781 Tweak the heuristic, in src/core/jpg.js, that handles JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10989) 2020-07-06 13:06:49 +02:00
Jonas Jenwald
fef24658e7 Adjust the heuristics used when dealing with rectangles, i.e. re operators, with zero width/height (issue 12010) 2020-07-02 00:02:49 +02:00
Jonas Jenwald
e451cabe37 Add a reduced test-case for issue 4260 (PR 4521 follow-up) 2020-06-30 09:26:41 +02:00
Jonas Jenwald
28d2ada59c Attempt to detect inline images which contain "EI" sequence in the actual image data (issue 11124)
This should reduce the possibility of accidentally truncating some inline images, while *not* causing the "EI" detection to become significantly slower.[1]
There's obviously a possibility that these added checks are not sufficient to catch *every* single case of "EI" sequences within the actual inline image data, but without specific test-cases I decided against over-engineering the solution here.

*Please note:* The interpolation issues are somewhat orthogonal to the main issue here, which is the truncated image, and it's already tracked elsewhere.

---
[1] I've looked at the issue a few times, and this is the first approach that I was able to come up with that didn't cause *unacceptable* performance regressions in e.g. issue 2618.
2020-06-26 13:15:06 +02:00
Jonas Jenwald
e18fa3fc45 Tweak the QueueOptimizer to recognize OPS.paintImageMaskXObject operators as *repeated* when the "skew" transformation matrix elements are non-zero (issue 8078)
*First of all, I should mention that my understanding of the finer details of the `QueueOptimizer` (and its related `CanvasGraphics` methods) is somewhat limited.*
Hence I'm not sure if there's actually a very good reason for *only* considering ImageMasks where the "skew" transformation matrix elements are zero as *repeated*, however simply looking at the code I just don't see why these elements cannot be non-zero as long as they are *all identical* for the ImageMasks.
Furthermore, looking at the *group* case (which is what we're currently falling back to), there's no particular limitation placed upon the transformation matrix elements.

While this patch obviously isn't enough to *completely* fix the issue, since there should be a visible Pattern rendered as well[1], it seem (at least to me) like enough of an improvement that submitting this is justified.
With these changes the referenced PDF document will no longer hang the *entire* browser, and rendering also finishes in a *reasonable* time (< 10 seconds for me) which seem fine given the *huge* number of identical inline images present.[2]

---
[1] Temporarily changing the Pattern to a solid color *does* render the correct/expected area, which suggests that the remaining problem is a pre-existing issue related to the Pattern-handling itself rather than the `QueueOptimizer` functionality.

[2] The document isn't exactly rendered immediately in e.g. Adobe Reader either.
2020-06-20 12:18:48 +02:00
Jonas Jenwald
4b51bcc733 Ensure that PDFImage.buildImage won't accidentally swallow errors, e.g. from ColorSpace parsing (issue 6707, PR 11601 follow-up)
Because of a really stupid `Promise`-related mistake on my part, when re-factoring `PDFImage.buildImage` during the `NativeImageDecoder` removal, we're no longer re-throwing errors occuring during image parsing/decoding as intended.
The result is that some (fairly) corrupt documents will never finish loading, and unfortunately there were apparently no sufficiently corrupt images in the test-suite to catch this.
2020-06-13 15:02:37 +02:00
Carlos Rodríguez
802aa14a99 Jpeg encoded with RGB -instead of YCbCr- write the components index as "RGB" in ASCII to say it so
On ISO/IEC 10918-6:2013 (E), section 6.1: (http://www.itu.int/rec/T-REC-T.872-201206-I/en)

"Images encoded with three components are assumed to be RGB data encoded as YCbCr unless the image contains an APP14 marker segment as specified in 6.5.3, in which case the colour encoding is considered either RGB or YCbCr according to the application data of the APP14 marker segment"

But common jpeg libraries consider RGB too if components index are ASCII R (0x52), G (0x47) and B (0x42): https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048

Issue #11931
2020-06-04 15:08:47 +02:00
Jonas Jenwald
56ebf01ae0 Avoid hanging the worker-thread for CMap data with ridiculously large ranges (issue 11922)
This patch was inspired by ad2b64f124/xpdf/CharCodeToUnicode.cc (L480-L484)
2020-05-22 15:23:17 +02:00
Jonas Jenwald
dda6626f40 Attempt to cache repeated images at the document, rather than the page, level (issue 11878)
Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1]

Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2]
However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages.

In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data).
While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour.

Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3]

*Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator.

---
[1] There's e.g. PDF documents that use the same image as background on all pages.

[2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer.

[3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
2020-05-21 18:13:45 +02:00
Jonas Jenwald
4aabd063fc Gracefully handle annotation parsing errors in Page.getOperatorList (issue 11871)
This should ensure that a page will always render successfully, even if there's errors during the Annotation fetching/parsing.
Additionally the `OperatorList.addOpList` method is also adjusted to ignore invalid data, to make it slightly more robust.
2020-05-04 17:09:48 +02:00
Tim van der Meij
96923eb2a6
Merge pull request #11805 from Snuffleupagus/issue-11794
Always skip over any additional, unexpected, RSTx (restart) markers in corrupt JPEG images (issue 11794)
2020-04-16 00:08:58 +02:00
Jonas Jenwald
44b4a74f48 A couple of small String.fromCodePoint improvements (PR 11698 and 11769 follow-up)
- Add a reduced test-case for issue 11768, to prevent future regressions.
   (Given that PR 11769 is only a work-around, rather than a proper solution, it may not be entirely accurate for the issue to be closed as fixed.)

 - Add more validation of the charCode, as found by the heuristics, in `PartialEvaluator._buildSimpleFontToUnicode` to prevent future issues.
2020-04-15 13:45:08 +02:00
Jonas Jenwald
06f6f8719f Always skip over any additional, unexpected, RSTx (restart) markers in corrupt JPEG images (issue 11794) 2020-04-14 23:27:08 +02:00
Jonas Jenwald
91efde5246 Add a heuristic to scale even single-char text, when the horizontal/vertical scaling differs significantly (issue 11713)
At this point in time, compared to when the "ignore single-char" code was added, we *should* generally be doing a much better job of combining text into as few chunks as possible.
However, there's still bad cases where we're not able to combine text as much as one would like, which is why I'm *not* proposing to simply measure/scale all text. Instead this patch will to only measure/scale single-char text in cases where the horizontal/vertical scale is off significantly, since that's were you'd expect bad text-selection behaviour otherwise.

Note that most of the movement caused by this patch is with Type3 fonts, which is a somewhat special font type and one where our current text-selection behaviour is probably the least good.
2020-04-07 00:36:23 +02:00
Jonas Jenwald
938d519192 Create the glyph mapping correctly for composite Type1, i.e. CIDFontType0, fonts (issue 11740)
This updates `Type1Font.getGlyphMapping` with a code-path "borrowed" from `CFFFont.getGlyphMapping`.
2020-04-06 11:21:02 +02:00
Jani Pehkonen
a22c0eab48 The first glyph in CFF CIDFonts must be named 0 instead of ".notdef"
Fixes #11718 in which the `ff` ligature glyph is at index zero in a CFF font. Beacuse this is a CIDFont, glyph names are CIDs, which are integers. Thus the string `".notdef"` is not correct. The rest of the charset data is already parsed correctly as integers when the boolean argument `cid` is true.
2020-03-24 15:56:50 +02:00
Jonas Jenwald
15e8692eff Don't accidentally accept invalid glyphNames which *appear* to follow the Cdd{d}/cdd{d} format in PartialEvaluator._buildSimpleFontToUnicode (issue 11697)
The /Differences array of the problematic font contains a `/c.1` entry, which is consequently detected as a *possible* Cdd{d}/cdd{d} glyphName by the existing heuristics.
Because of how the base 10 conversion is implemented, which is necessary for the base 16 special case, the parsed charCode becomes `0.1` thus causing `String.fromCodePoint` to throw since that obviously isn't a valid code point.

To fix the referenced issue, and to hopefully prevent similar ones in the future, the patch adds *additional* validation of the charCode found by the heuristics.
2020-03-13 23:35:47 +01:00
Jonas Jenwald
65e6ea2cb2 Prevent lookup errors in PartialEvaluator.hasBlendModes from breaking all parsing/rendering of a page (issue 11678)
The PDF document in question is *corrupt*, since it contains an XObject with a truncated dictionary and where the stream contents start without a "stream" operator.
2020-03-09 12:00:12 +01:00
Tim van der Meij
1a97c142b3
Merge pull request #11523 from Snuffleupagus/issue-10880
Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)
2020-03-06 23:03:09 +01:00
Tim van der Meij
c95b9b1e17
Merge pull request #11653 from Snuffleupagus/ensureStateFont
Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651)
2020-03-03 23:33:13 +01:00
Jani Pehkonen
71e7686950 Fix Type1 font parsing when .notdef is not at index zero
Fixes #11477
The PDF draws many space characters but the embedded fonts don't have a glyph named `space`, so `.notdef` should be drawn instead. PDF.js assumed that Type1 fonts define `.notdef` as the first glyph (index 0). However, now the fonts have the glyph `A` at index 0 and `.notdef` is the last one, so `A` appears where spaces are expected.

Because the rest of the font machinery in `core/fonts.js` assumes `.notdef` is at index zero, it's easiest to modify `core/type1_parser.js` so that it "repairs" fonts and makes sure `.notdef` is at index 0.
2020-03-03 21:55:51 +02:00
Jonas Jenwald
65e514e063 Ensure that there's always a setFont (Tf) operator before text rendering operators (issue 11651)
The PDF document in question is *corrupt*, since it contains multiple instances of incorrect operators.
We obviously don't want to slow down parsing of *all* documents (since most are valid), just to accommodate a particular bad PDF generator, hence the reason for the inline check before calling the `ensureStateFont` method.
2020-03-03 10:05:18 +01:00
Takashi Tamura
d8c9f119b0 Fix the vertical writing mode with horizontal scaling. #11555.
It is not valid to multiply textHScale when the writing mode is vertical.

See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762
2020-02-29 07:48:29 +09:00
Tim van der Meij
e1586016c5
Merge pull request #11577 from Snuffleupagus/Pages-tree-refs
Prevent circular references in the /Pages tree
2020-02-27 23:36:11 +01:00
Jonas Jenwald
c3c3b8cd81 Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880)
*This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.*

To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.
2020-02-22 14:16:07 +01:00
Jonas Jenwald
3c7b7be100 Prevent circular references in the /Pages tree 2020-02-19 01:49:39 +01:00
Takashi Tamura
512dbe3060 Fix text spacing with vertical fonts. #7687 and #11526.
When the writing mode is vertical, we have to reverse
the sign of spacing since we are subtracting it from
current.y. We have to add it to current.y.
See 9.4.4 Text Space Details, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1694762
2020-02-11 08:49:23 +09:00
Tim van der Meij
dced0a3821
Merge pull request #11579 from Snuffleupagus/issue-11578
Ignore spaces when normalizing the font name in `Font.fallbackToSystemFont` (issue 11578)
2020-02-09 17:33:09 +01:00
Tim van der Meij
61056a9238
Merge pull request #11551 from Snuffleupagus/issue-11549
Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)
2020-02-09 17:32:35 +01:00
Jonas Jenwald
7937165537 Ignore spaces when normalizing the font name in Font.fallbackToSystemFont (issue 11578) 2020-02-08 19:59:04 +01:00
Jonas Jenwald
88c35d872f Ensure that the PDF header contains an actual number (PR 11463 follow-up)
While it would be nice to change the `PDFFormatVersion` property, as returned through `PDFDocumentProxy.getMetadata`, to a number (rather than a string) that would unfortunately be a breaking API change.
However, it does seem like a good idea to at least *validate* the PDF header version on the worker-thread, rather than potentially returning an arbitrary string.
2020-02-07 12:25:07 +01:00
Brendan Dahl
09a6e17d22
Merge pull request #11528 from janpe2/type1-nonemb-notdef
Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths
2020-02-06 13:30:07 -08:00
Jonas Jenwald
4c54395ff6 Allow skipping of errors when reading broken/corrupt ToUnicode data (issue 11549)
This will allow font loading/parsing to continue, rather than immediately failing, when broken/corrupt CMap data is encountered.
2020-01-30 13:19:05 +01:00
Tim van der Meij
474fe1757e
Merge pull request #11508 from Snuffleupagus/jpg-default-marker
Simplify the handling of unsupported/incorrect markers in `src/core/jpg.js`
2020-01-26 21:32:13 +01:00
Jonas Jenwald
62b2b984cc Render Popup annotations last, once all other annotations have been rendered (issue 11362)
In the current `AnnotationLayer` implementation, Popup annotations require that the parent annotation have already been rendered (otherwise they're simply ignored).
Usually the annotations are ordered, in the `/Annots` array, in such a way that this isn't a problem, however there's obviously no guarantee that all PDF generators actually do so. Hence we simply ensure, when rendering the `AnnotationLayer`, that the Popup annotations are handled last.
2020-01-26 15:49:55 +01:00
Jonas Jenwald
13930e5202 Simplify the handling of unsupported/incorrect markers in src/core/jpg.js
- Re-factor the "incorrect encoding" check, since this can be easily achieved using the general `findNextFileMarker` helper function (with a suitable `startPos` argument).

 - Tweak a condition, to make it easier to see that the end of the data has been reached.

 - Add a reference test for issue 1877, since it's what prompted the "incorrect encoding" check.
2020-01-25 22:52:24 +01:00
Jani Pehkonen
809b96b40c Hide .notdef glyphs in non-embedded Type1 fonts and don't ignore Widths
Fixes #11403
The PDF uses the non-embedded Type1 font Helvetica. Character codes 194 and 160 (`Â` and `NBSP`) are encoded as `.notdef`. We shouldn't show those glyphs because it seems that Acrobat Reader doesn't draw glyphs that are named `.notdef` in fonts like this.

In addition to testing `glyphName === ".notdef"`, we must test also `glyphName === ""` because the name `""` is used in `core/encodings.js` for undefined glyphs in encodings like `WinAnsiEncoding`.

The solution above hides the `Â` characters but now the replacement character (space) appears to be too wide. I found out that PDF.js ignores font's `Widths` array if the font has no `FontDescriptor` entry. That happens in #11403, so the default widths of Helvetica were used as specified in `core/metrics.js` and `.nodef` got a width of 333. The correct width is 0 as specified by the `Widths` array in the PDF. Thus we must never ignore `Widths`.
2020-01-21 21:35:25 +02:00
Tim van der Meij
dfe42a5ca4
Include a unit test for OpenAction dictionaries without Type entries (PR 11443 follow-up)
The original issue did not contain a (reduced) test case that we could
include and linked test cases are not ideal for unit tests, so the
original PR could only be verified manually.

I found this a bit unfortunate considering that the print data is
exposed through the API, so I thought about how we could have an
automated test and managed to create a reduced test case with the
OpenAction dictionary from the file in the original issue.

Therefore, this commit includes a unit test for parsing OpenAction
dictionaries without `Type` entries. I verified that this PDF file
behaves the same as the original one, i.e., no print dialog is shown for
older viewers and the print dialog is shown for the most recent viewer.
2019-12-27 00:05:51 +01:00
Jonas Jenwald
5c0336872e Handle corrupt ASCII85Decode inline images with truncated EOD markers (issue 11385)
In the PDF document in question, there's an ASCII85Decode inline image where the '>' part of EOD (end-of-data) marker is missing; hence the PDF document is corrupt.
2019-12-05 15:53:18 +01:00
Jonas Jenwald
9199b02a42 Subtract stream.start when getting the startXRef property for documents with a Linearization dictionary (issue 11330)
For documents with a Linearization dictionary the computed `startXRef` position will be relative to the raw file, rather than the actual PDF document itself (which begins with `%PDF-`).
Hence it's necessary to subtract `stream.start` in this case, since otherwise the `XRef.readXRef` method will increment the position too far resulting in parsing errors.
2019-11-16 09:29:10 +01:00
Tim van der Meij
6972bbea74
Include a reduced test case for annotations without a Border/BS entry (PR 6180 follow-up) 2019-11-10 14:37:42 +01:00
Jonas Jenwald
835d8c2be5 Allow skipping of errors when parsing broken/unsupported ColorSpaces (issue 6707, issue 11287)
This will allow us to attempt to recover as much as possible of a page, rather than immediately failing, when a broken/unsupported ColorSpace is encountered. This patch thus extends the framework added in PRs such as e.g. 8240 and 8922, to also cover parsing of ColorSpaces.
2019-11-01 09:01:24 +01:00
Jonas Jenwald
5c266f0e8c Support Blend Modes which are specified in an Array of Names (issue 11279)
According to the specification, the first *supported* Blend Mode should be choosen in this case; please see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G10.4848607
2019-10-26 14:24:31 +02:00
Tim van der Meij
11f3851a97
Merge pull request #11243 from Snuffleupagus/issue-11242
Add a fallback for non-embedded *composite* Verdana fonts (issue 11242)
2019-10-18 23:56:46 +02:00
Jonas Jenwald
2fcb5afc7b Add a fallback for non-embedded *composite* Verdana fonts (issue 11242)
Obviously this won't look exactly right, but considering that the PDF file doesn't bother embedding non-standard fonts this is the best that we can do here.
2019-10-17 17:00:55 +02:00
Jonas Jenwald
ffc847eaa5 Allow over-writing entries, in XRef.indexObjects, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303)
This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1]

Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break.
The only reason that I'm even submitting this patch, is because of the number of open issues that it would address.

Generally speaking though, the best course of action would probably be if `XRef.indexObjects` was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them.

---
[1] Especially given that it's reverting part of PR 5910, however in the case of issue 5909 it seems that other (more recent) changes have actually made that PR redundant.
2019-10-14 22:10:04 +02:00
Jonas Jenwald
259551d144 Convert a number of reference tests, for documents with corrupt XRef tables, from load to eq
As part of attempting to fix a number issues containing PDF documents with corrupt XRef tables, I'd like to improve the reference test-coverage slightly *first*.
Obviously this will increase the runtime of the tests a bit, however I'd rather "waste" resources on the bots instead of developer time fixing regressions which could have been avoided.
2019-10-12 18:49:59 +02:00
Jonas Jenwald
f5be2d62a3 Improve the heuristics, in PartialEvaluator._buildSimpleFontToUnicode, for glyphNames of the Cdd{d}/cdd{d} format (issue 9655)
*Please note:* I've been thinking about possible ways of addressing this issue for a while now, but all of the solutions I came up with became too complicated and thus hurt readability of the code.
However, it occured to me that we're essentially trying to add a heuristic *on top* of another heuristic, and that it shouldn't matter how efficient the code is as long as it works.

In the PDF file in the issue the Encoding contains glyphNames of the `Cdd` format, which our existing heuristics will treat as base 10 values. However, in this particular file they actually contain base 16 values, which we thus attempt to detect and fix such that text-selection works.
2019-10-06 10:47:29 +02:00
Tim van der Meij
3da680cdfc
Merge pull request #11158 from janpe2/gradient-stops
Avoid floating point inaccuracy in gradient color stops
2019-09-19 13:15:11 +02:00
Jonas Jenwald
af22dc9b0c For Type1 fonts, replace missing font dictionary /Widths entries with ones from the font data (issue 11150)
Hopefully this patch makes sense, and in order to reduce the regression risk the implementation ensures that only completely missing widths are being replaced.
2019-09-18 10:15:09 +02:00
Jani Pehkonen
911df237f3 Avoid floating point inaccuracy in gradient color stops 2019-09-17 21:01:17 +03:00
Tim van der Meij
09df1ee0ce
Include a reduced, non-linked PDF file for the attachments API unit test 2019-08-25 15:14:57 +02:00
Tim van der Meij
fbe8c6127c
Merge pull request #11059 from Snuffleupagus/boundingBox-more-validation
Fallback gracefully when encountering corrupt PDF files with empty /MediaBox and /CropBox entries
2019-08-09 22:39:01 +02:00
Jonas Jenwald
d637b25e36 Fallback gracefully when encountering corrupt PDF files with empty /MediaBox and /CropBox entries
This is based on a real-world PDF file I encountered very recently[1], although I'm currently unable to recall where I saw it.
Note that different PDF viewers handle these sort of errors differently, with Adobe Reader outright failing to render the attached PDF file whereas PDFium mostly handles it "correctly".

The patch makes the following notable changes:
 - Refactor the `cropBox` and `mediaBox` getters, on the `Page`, to reduce unnecessary duplication. (This will also help in the future, if support for extracting additional page bounding boxes are added to the API.)
 - Ensure that the page bounding boxes, i.e. `cropBox` and `mediaBox`, are never empty to prevent issues/weirdness in the viewer.
 - Ensure that the `view` getter on the `Page` will never return an empty intersection of the `cropBox` and `mediaBox`.
 - Add an *optional* parameter to `Util.intersect`, to allow checking that the computed intersection isn't actually empty.
 - Change `Util.intersect` to have consistent return types, since Arrays are of type `Object` and falling back to returning a `Boolean` thus seem strange.

---

[1] In that case I believe that only the `cropBox` was empty, but it seemed like a good idea to attempt to fix a bunch of related cases all at once.
2019-08-09 10:18:13 +02:00
Jonas Jenwald
0f78fdb229 Handle some corrupt/truncated JPEG images that are missing the EOI (End of Image) marker (issue 11052)
Note that even Adobe Reader cannot render the PDF file completely, which is always a good indication that it's corrupt.
2019-08-08 10:37:41 +02:00
Jonas Jenwald
5ac9c7c384 Support corrupt PDF files with invalid/non-existent Group /CS entries (issue 11045)
The PDF file in question tries to reference a non-existent ColorSpace, which should be quite rare in practice.
2019-08-06 14:33:05 +02:00
Jonas Jenwald
9ad50521b1 Add a work-around, in glyphlist.js, for bad PDF generators which use a non-standard /f_f string in the Encoding dictionary when referring to the ff ligature (issue 11016)
This patch will not incur any (measurable) overhead, since the glyphlist is already quite long and one more entry won't really matter, which is important given that this sort of PDF corruption ought to be very rare.

Furthermore, this patch purposely does *not* add a bunch of similarly modified ligature names on pure speculation. Any similar additions, for other ligatures, should only be made if there's real-world examples of PDF files where that's actually necessary.
2019-07-30 17:06:58 +02:00
Tim van der Meij
ed3954fc7a
Merge pull request #10851 from brendandahl/shading-bbox
Apply bounding box before using shading patterns.
2019-07-12 22:52:07 +02:00
Tim van der Meij
87f36e3520
Merge pull request #10850 from brendandahl/scale-line-width
Scale stroking line width when using a tiling pattern.
2019-07-12 22:50:32 +02:00
Brendan Dahl
6fab0a0dac Apply bounding box before using shading patterns.
Fixes #8092
2019-07-08 14:05:48 -07:00
Brendan Dahl
446efab707 Scale stroking line width when using a tiling pattern. 2019-07-08 13:47:54 -07:00
Jonas Jenwald
876c962235 Ignore Annotations with too large border widths, to prevent the annotationLayer from rendering it over the surrounding document (bug 1552113)
The border `width` will instead fallback to the default value of `1`, rather than ignoring it altoghether, to also ensure that e.g. `LinkAnnotation`s become clickable as intended.

Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1552113
2019-06-01 15:51:22 +02:00
Jani Pehkonen
05c527f035 Fix glyph 0 in CIDFontType2 that has a CIDToGIDMap stream 2019-05-07 18:44:37 +03:00
Jonas Jenwald
5335285cda Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542)
First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.
2019-04-30 23:35:33 +02:00
Tim van der Meij
55d9b35d37
Merge pull request #10727 from Snuffleupagus/type3-image-resources
Support (rare) Type3 fonts which contains image resources (issue 10717)
2019-04-18 23:07:26 +02:00
Tim van der Meij
ae2a4dc3dd
Implement free text annotations 2019-04-13 18:45:22 +02:00
Jonas Jenwald
be604bd195 Support (rare) Type3 fonts which contains image resources (issue 10717)
The Type3 font type is not commonly used in PDF documents, as can be seen from telemetry data such as: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2019-04-09&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F68&measure=PDF_VIEWER_FONT_TYPES&min_channel_version=nightly%252F57&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2019-03-18&table=0&trim=1&use_submission_date=0 (see also https://github.com/mozilla/pdf.js/wiki/Enumeration-Assignments-for-the-Telemetry-Histograms#pdf_viewer_font_types).

Type3 fonts containing image resources are *very* rare in practice, usually they only contain path rendering operators, but as the issue shows they unfortunately do exist.
Currently these Type3-related image resources are not handled in any special way, and given that fonts are document rather than page specific rendering breaks since the image resources are thus not available to the *entire* document.
Fortunately fixing this isn't too difficult, but it does require adding a couple of Type3-specific code-paths to the `PartialEvaluator`. In order to keep the implementation simple, particularily on the main-thread, these Type3 image resources are completely decoded on the worker-thread to avoid adding too many special cases. This should not cause any issues, only marginally less efficient code, but given how rare this kind of Type3 font is adding premature optimizations didn't seem at all warranted at this point.
2019-04-13 18:27:50 +02:00
Tim van der Meij
b4c3b94592
Merge pull request #6606 from Rob--W/pattern-scaling
Improve performance and correctness of Tiling Patterns
2019-03-29 00:01:38 +01:00
Tim van der Meij
f9c58115fc
Merge pull request #10683 from janpe2/type0-noncid-cmap
Use CMap in Type0 fonts when CFF is not a CID font
2019-03-28 00:07:08 +01:00
Rob Wu
d3dc8f16b5 TilingPattern: Reverse transform after painting
This transform resulted in an incorrectly positioned object when the
bounding box's upper-left corner did not start at (0,0), because
the translation was not reverted. This patch adds the missing transform.

The test file (tiling-pattern-box.pdf) is based on the PDF from #2825.
All but the first cube (including the PDF data) have been removed.
To trigger the bug that is fixed by this commit, I changed the BBox of
the first pattern from "[ 0 0 596 842]" to "[90 0 596 842]". Without
this patch, the dashed vertical line that intersects the corners at A
and E would disappear.
2019-03-27 17:50:35 +01:00
Rob Wu
a72a8e921f Avoid extreme sizing / scaling in tiling pattern
The new test file (tiling-pattern-large-steps.pdf) was manually created,
to have the following characteristics:
- Large xstep and ystep (90000)
- Page width is 4000 (which is larger than MAX_PATTERN_SIZE)
- Visually, the page consists of a red rectangle with a black border,
  surrounded by a 50 unit white padding.
- Before patch: blurry; After patch: sharp

Fixes #6496
Fixes #5698
Fixes #1434
Fixes #2825
2019-03-27 17:44:04 +01:00
Jonas Jenwald
9077abc263 Take the FirstChar/LastChar properties into account when computing the hash in PartialEvaluator.preEvaluateFont (issue 10665)
Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
2019-03-27 16:27:10 +01:00
Jani Pehkonen
49c6233fbc Use CMap in Type0 fonts when CFF is not a CID font 2019-03-26 19:38:44 +02:00
Jonas Jenwald
88f9e633dd Try to improve text-selection for Type3 fonts that utilize a non-default /FontMatrix (bug 1513120)
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.

The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
2019-03-12 10:32:08 +01:00
Jonas Jenwald
3ce8fe7927 Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614)
There's a number of things wrong with the PDF document, since its inline images are first all *a lot* larger than the 4 KB limit (as mandated by the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.1852045).

Furthermore the actual ASCII85Decode data is interspersed with *a lot* of needless whitespace, in particular also "inside" of the EOD (end-of-data) marker which thus completely breaks the detection.
Note that according to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.1940130, this patch should be safe since it explicitly mentions that *all* whitespace should be ignored.
2019-03-04 23:41:36 +01:00
Jonas Jenwald
fb774a65b0 Avoid truncating/breaking some Type3 glyphs in compileType3Glyph (bug 1245391, issue 10568)
*Hopefully this patch makes sense, since I cannot claim to fully understand this function.*

With the changes made in PR 3354 *some* Type3 glyph outlines are no longer rendering correctly, since the final paths were being accidentally ignored.
The fact that Type3 fonts are not very common in PDF documents, and that most Type3 glyphs are unaffected by this regression, probably explains why this has gone unnoticed since 2013.
2019-02-21 23:29:43 +01:00
Tsukasa OI
96ba6afd47 Fix copying on supplementary plane characters
pdf.js had a problem when copying characters on supplementary planes
(0xPPXXXX where PP is nonzero).  This is because certain methods of
PartialEvaluator use classic String.fromCharCode instead of ES6's
String.fromCodePoint.

Despite the fact that readToUnicode method *tried* to parse out-of-UCS2
code points by parsing UTF-16BE, it was inadequate because
String.fromCharCode only supports UCS-2 range of Unicode.
2019-02-10 18:14:53 +09:00
Jonas Jenwald
6f94a05a29 Do the final text scaling correctly in flushTextContentItem (issue 8276)
It's necessary to take into account whether or not the text is vertical, to avoid either the textContent `width` or `height` becoming incorrect.
2019-01-29 15:24:04 +01:00
Tim van der Meij
e2701d5422
Merge pull request #10482 from janpe2/indexed-decode
Implement Decode entry in Indexed images
2019-01-24 23:46:55 +01:00
Jonas Jenwald
41fbc71ef9 Ensure that XRef.indexObjects can handle object numbers with zero-padding (issue 10491)
All objects in the PDF document follow this pattern:
```
0000000001 0 obj
<<
% Some content here...
>>
endobj
0000000002 0 obj
<<
% More content here...
endobj

```
2019-01-24 22:37:18 +01:00
Jani Pehkonen
26121177ab Implement Decode entry in Indexed images 2019-01-22 22:51:04 +02:00
Jonas Jenwald
b531fc4106 Avoid truncating inline images, where the data and the "EI" marker is glued together (issue 10388) (#10436)
Thanks to the *excellent* debugging done by @janpe2, this was easy to fix!
2019-01-12 20:31:23 +01:00
Jonas Jenwald
d4a3858ed5 Handle more cases of corrupt PDF files with missing 'endobj' operators, where the "obj" string is immediately followed by the dictionary (PR 9288 follow-up) 2019-01-10 17:55:28 +01:00
Brendan Dahl
32eace043b Fix reading number of HTMX metrics.
The length of the HHEA table can be incorrect, so it is better to
read the number of metrics offset from beginning of table instead.
2019-01-04 15:13:13 -08:00
Brendan Dahl
e2686db49b
Merge pull request #10277 from janpe2/cff-stems
Repair CFF fonts if stem hints are in wrong order
2019-01-03 10:30:43 -08:00
Jonas Jenwald
60bcce184e Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326)
For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].

Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.

Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2].

Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:

```javascript
console.time('first');
getDocument(...).promise.then((pdfDocument) => {
  console.timeEnd('first');
});

console.time('second');
getDocument(...).promise.then((pdfDocument) => {
  pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
    console.timeEnd('second');
  });
});
```

The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.

---
[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.

[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page.

[3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.
2018-12-29 12:47:25 +01:00
Jani Pehkonen
9e990f6f3e Repair CFF fonts if stem hints are in wrong order 2018-11-20 18:50:37 +02:00
Simon Leblanc
b5806735d8 Add support of Ink annotation 2018-10-03 00:28:49 +02:00
Tim van der Meij
66422eb83e
Merge pull request #9340 from brendandahl/private-use
Map all glyphs to the private use area and duplicate the first glyph.
2018-09-08 17:51:04 +02:00
Brendan Dahl
b76cf665ec Map all glyphs to the private use area and duplicate the first glyph.
There have been lots of problems with trying to map glyphs to their unicode
values. It's more reliable to just use the private use areas so the browser's
font renderer doesn't mess with the glyphs.

Using the private use area for all glyphs did highlight other issues that this
patch also had to fix:

  * small private use area - Previously, only the BMP private use area was used
    which can't map many glyphs. Now, the (much bigger) PUP 16 area can also be
    used.

  * glyph zero not shown - Browsers will not use the glyph from a font if it is
    glyph id = 0. This issue was less prevalent when we mapped to unicode values
    since the fallback font would be used. However, when using the private use
    area, the glyph would not be drawn at all. This is illustrated in one of the
    current test cases (issue #8234) where there's an "ä" glyph at position
    zero. The PDF looked like it rendered correctly, but it was actually not
    using the glyph from the font. To properly show the first glyph it is always
    duplicated and appended to the glyphs and the maps are adjusted.

  * supplementary characters - The private use area PUP 16 is 4 bytes, so
    String.fromCodePoint must be used where we previously used
    String.fromCharCode. This is actually an issue that should have been fixed
    regardless of this patch.

  * charset - Freetype fails to load fonts when the charset size doesn't match
    number of glyphs in the font. We now write out a fake charset with the
    correct length. This also brought up the issue that glyphs with seac/endchar
    should only ever write a standard charset, but we now write a custom one.
    To get around this the seac analysis is permanently enabled so those glyphs
    are instead always drawn as two glyphs.
2018-09-05 14:04:54 -07:00
Jonas Jenwald
e5a6d892b4
Revert "Attempt to combine separate beginText/endText sequences in getTextContent (issue 9984)" 2018-09-05 18:01:33 +02:00
Tim van der Meij
959ed3705b
Implement a permissions API 2018-09-02 21:23:09 +02:00
Tim van der Meij
a096e0c647
Merge pull request #10032 from timvandermeij/test-link
Replace broken link for `pr8808.pdf.link`
2018-09-02 14:52:21 +02:00
Tim van der Meij
b62f14f3f5
Replace broken link for pr8808.pdf.link
The current link had an invalid certificate and was a redirect to this
new link anyway. The MD5 hash is equal.
2018-09-02 14:48:26 +02:00