Commit Graph

1538 Commits

Author SHA1 Message Date
Jonas Jenwald
ff90aa4323 Inline the isCmd check in the Parser.shift method
For very large and complex PDF files this will help performance slightly, since `Parser.shift` is called *a lot* during parsing.

This patch was tested using the PDF file from issue 2618, i.e. http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471 (with well over *four million* `Parser.shift` calls for just the one page), using the following manifest file:
```
[
    {  "id": "issue2618",
       "file": "../web/pdfs/issue2618.pdf",
       "md5": "",
       "rounds": 100,
       "type": "eq"
    }
]
```

This gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat         | Count | Baseline(ms) | Current(ms) | +/- |    %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall      |   100 |         3386 |        3322 | -65 | -1.92 |        faster
Firefox | Page Request |   100 |            1 |           1 |   0 | -8.08 |
Firefox | Rendering    |   100 |         3385 |        3321 | -65 | -1.92 |        faster
```
2019-07-22 12:07:36 +02:00
Jonas Jenwald
b5254f2745 Attempt to significantly reduce the number of ChunkedStream.{ensureByte, ensureRange} calls by inlining the this.progressiveDataLength checks at the call-sites
The number of in particular `ChunkedStream.ensureByte` calls is often absolutely *huge* (on the order of million calls) when loading and rendering even moderately complicated PDF files, which isn't entirely surprising considering that the `getByte`/`getBytes`/`peekByte`/`peekBytes` methods are used for essentially all data reading/parsing.

The idea implemented in this patch is to inline an inverted `progressiveDataLength` check at all of the `ensureByte`/`ensureRange` call-sites, which in practice will often result in *several* orders of magnitude fewer function calls.
Obviously this patch will only help if the browser supports streaming, which all reasonably modern browsers now do (including the Firefox built-in PDF viewer), and assuming that the user didn't set the `disableStream` option (e.g. for using `disableAutoFetch`). However, I think we should be able to improve performance for the default out-of-the-box use case, without worrying about e.g. older browsers (where this patch will thus incur *one* additional check before calling `ensureByte`/`ensureRange`).

This patch was inspired by the *first* commit in PR 5005, which was subsequently backed out in PR 5145 for causing regressions. Since the general idea of avoiding unnecessary function calls was really nice, I figured that re-attempting this in one way or another wouldn't be a bad idea.
Given that streaming is now supported, which it wasn't back then, using `progressiveDataLength` seemed like an easier approach in general since it also allowed supporting both `ensureByte` and `ensureRange`.

This sort of patch obviously needs data to back it up, hence I've benchmarked the changes using the following manifest file (with the default `tracemonkey` file):
```
[
    {  "id": "tracemonkey-eq",
       "file": "pdfs/tracemonkey.pdf",
       "md5": "9a192d8b1a7dc652a19835f6f08098bd",
       "rounds": 250,
       "type": "eq"
    }
]
```

I get the following complete results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat         | Count | Baseline(ms) | Current(ms) | +/- |    %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall      |  3500 |          140 |         134 |  -6 | -4.46 |        faster
Firefox | Page Request |  3500 |            2 |           2 |   0 | -0.10 |
Firefox | Rendering    |  3500 |          138 |         131 |  -6 | -4.54 |        faster
```

Here it's pretty clear that the patch does have a positive net effect, even for a PDF file of fairly moderate size and complexity. However, in this case it's probably interesting to also look at the results per page:
```
-- Grouped By page, stat --
page | stat         | Count | Baseline(ms) | Current(ms) | +/- |     %  | Result(P<.05)
---- | ------------ | ----- | ------------ | ----------- | --- | ------ | -------------
0    | Overall      |   250 |           74 |          75 |   1 |   0.69 |
0    | Page Request |   250 |            1 |           1 |   0 |  33.20 |
0    | Rendering    |   250 |           73 |          74 |   0 |   0.25 |
1    | Overall      |   250 |          123 |         121 |  -2 |  -1.87 |        faster
1    | Page Request |   250 |            3 |           2 |   0 | -11.73 |
1    | Rendering    |   250 |          121 |         119 |  -2 |  -1.67 |
2    | Overall      |   250 |           64 |          63 |  -1 |  -1.91 |
2    | Page Request |   250 |            1 |           1 |   0 |   8.81 |
2    | Rendering    |   250 |           63 |          62 |  -1 |  -2.13 |        faster
3    | Overall      |   250 |           97 |          97 |   0 |  -0.06 |
3    | Page Request |   250 |            1 |           1 |   0 |  25.37 |
3    | Rendering    |   250 |           96 |          95 |   0 |  -0.34 |
4    | Overall      |   250 |           97 |          97 |   0 |  -0.38 |
4    | Page Request |   250 |            1 |           1 |   0 |  -5.97 |
4    | Rendering    |   250 |           96 |          96 |   0 |  -0.27 |
5    | Overall      |   250 |           99 |          97 |  -3 |  -2.92 |
5    | Page Request |   250 |            2 |           1 |   0 | -17.20 |
5    | Rendering    |   250 |           98 |          95 |  -3 |  -2.68 |
6    | Overall      |   250 |           99 |          99 |   0 |  -0.14 |
6    | Page Request |   250 |            2 |           2 |   0 | -16.49 |
6    | Rendering    |   250 |           97 |          98 |   0 |   0.16 |
7    | Overall      |   250 |           96 |          95 |  -1 |  -0.55 |
7    | Page Request |   250 |            1 |           2 |   1 |  66.67 |        slower
7    | Rendering    |   250 |           95 |          94 |  -1 |  -1.19 |
8    | Overall      |   250 |           92 |          92 |  -1 |  -0.69 |
8    | Page Request |   250 |            1 |           1 |   0 | -17.60 |
8    | Rendering    |   250 |           91 |          91 |   0 |  -0.52 |
9    | Overall      |   250 |          112 |         112 |   0 |   0.29 |
9    | Page Request |   250 |            2 |           1 |   0 |  -7.92 |
9    | Rendering    |   250 |          110 |         111 |   0 |   0.37 |
10   | Overall      |   250 |          589 |         522 | -67 | -11.38 |        faster
10   | Page Request |   250 |           14 |          13 |   0 |  -1.26 |
10   | Rendering    |   250 |          575 |         508 | -67 | -11.62 |        faster
11   | Overall      |   250 |           66 |          66 |  -1 |  -0.86 |
11   | Page Request |   250 |            1 |           1 |   0 | -16.48 |
11   | Rendering    |   250 |           65 |          65 |   0 |  -0.62 |
12   | Overall      |   250 |          303 |         291 | -12 |  -4.07 |        faster
12   | Page Request |   250 |            2 |           2 |   0 |  12.93 |
12   | Rendering    |   250 |          301 |         289 | -13 |  -4.19 |        faster
13   | Overall      |   250 |           48 |          47 |   0 |  -0.45 |
13   | Page Request |   250 |            1 |           1 |   0 |   1.59 |
13   | Rendering    |   250 |           47 |          46 |   0 |  -0.52 |
```

Here it's clear that this patch *significantly* improves the rendering performance of the slowest pages, while not causing any big regressions elsewhere. As expected, this patch thus helps larger and/or more complex pages the most (which is also where even small improvements will be most beneficial).
There's obviously the question if this is *slightly* regressing simpler pages, but given just how short the times are in most cases it's not inconceivable that the page results above are simply caused be e.g. limited `Date.now()` and/or limited numerical precision.
2019-07-18 17:30:22 +02:00
Tim van der Meij
6e96a158f4
Merge pull request #10820 from vlastimilmaca/annot-irt-rt-states
Annotations - Added parsing of IRT, RT, State and StateModel
2019-07-17 23:34:31 +02:00
vlastimilmaca
fe49f0f766 Annotations - Implement parsing of IRT, RT, State and StateModel 2019-07-16 23:33:07 +02:00
Jonas Jenwald
bea15b6ce5 Simplify the PDFDocument.fingerprint method slightly
The way that this method handles documents without an `ID` entry in the Trailer dictionary feels overly complicated to me. Hence this patch adds `getByteRange` methods to the various Stream implementations[1], and utilize that rather than manually calling `ensureRange` when computing a fallback `fingerprint`.

---
[1] Note that `PDFDocument` is only ever initialized with either a `Stream` or a `ChunkedStream`, hence why the `DecodeStream.getByteRange` method isn't implemented.
2019-07-15 13:26:08 +02:00
Tim van der Meij
ed3954fc7a
Merge pull request #10851 from brendandahl/shading-bbox
Apply bounding box before using shading patterns.
2019-07-12 22:52:07 +02:00
Brendan Dahl
6fab0a0dac Apply bounding box before using shading patterns.
Fixes #8092
2019-07-08 14:05:48 -07:00
Jonas Jenwald
bdc31f8b50 Make the find helper function, in src/core/document.js, more efficient by using peekBytes rather reading the stream one byte at a time
*Please note:* A a similar change was attempted in PR 5005, but it was subsequently backed out in PR 5069.

Unfortunately I don't think anyone ever tried to debug *exactly* why it didn't work, since it ought to have worked, and having re-tested this now I'm not able to reproduce the problem any more. However, given just how inefficient the current code is, with thousands of strictly unnecessary function calls for each `find` invocation, I'd really like to try fixing this again.
2019-07-06 11:44:17 +02:00
Jonas Jenwald
41745a5996 Reduce the number of isCmd calls slightly in the XRef class
This reduces the total number of function calls, when reading the XRef table respectively when fetching uncompressed XRef entries.
Note in particular the `XRef.readXRefTable` method, where there're *two* back-to-back `isCmd` checks rather than just one.
2019-06-29 16:28:45 +02:00
Tim van der Meij
f1867de492
Merge pull request #10925 from Snuffleupagus/eslint_no-unsanitized
Enable the `eslint-plugin-no-unsanitized` ESLint plugin to disallow unsafe usage of e.g. `innerHTML`
2019-06-27 20:32:24 +02:00
Jonas Jenwald
f710eb56e4 Change the signature of the Parser constructor to take a parameter object
A lot of the `new Parser()` call-sites look quite unwieldy/ugly as-is, with a bunch of somewhat randomly ordered arguments, which we can avoid by changing the constructor to accept an object instead. As an added bonus, this provides better documentation without having to add inline argument comments in the code.
2019-06-23 16:01:45 +02:00
Jonas Jenwald
5bb5e7741d Enable the eslint-plugin-no-unsanitized ESLint plugin to disallow unsafe usage of e.g. innerHTML
See https://github.com/mozilla/eslint-plugin-no-unsanitized

Since we've generally never allowed e.g. `innerHTML`, which is enforced during review, there's only one linting failure with this patch. (Which is white-listed, according to the existing comment and the fact that it's test-only code.)
2019-06-23 13:50:30 +02:00
Jonas Jenwald
021e5ffb88 Move PDFWorkerStream and related code to its own file
Since all other `IPDFStream` implementations live in their own files, it seems reasonable for these to do so as well.

Furthermore, converts all of the relevant code to ES6 classes and updates the interface definitions to mark a couple of methods `async`.
2019-06-15 13:05:25 +02:00
Jonas Jenwald
26bc630e19 Add support for outline items, in the default viewer, which default to collapsed when the outline is built
The PDF specification supports this feature, which is commonly used in large/long documents (such as the spec itself), and it seems reasonably straightforward to implement; see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G11.2095911
2019-06-07 12:26:23 +02:00
Jonas Jenwald
876c962235 Ignore Annotations with too large border widths, to prevent the annotationLayer from rendering it over the surrounding document (bug 1552113)
The border `width` will instead fallback to the default value of `1`, rather than ignoring it altoghether, to also ensure that e.g. `LinkAnnotation`s become clickable as intended.

Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1552113
2019-06-01 15:51:22 +02:00
Jonas Jenwald
a3742a9f83 Ensure that the Cmd/Name/Ref caches are cleared when terminating the worker (PR 10863 follow-up)
Usually when the worker is terminated it will also be completely destroyed/removed, which means that any global caches (such as the ones in `src/core/primitive.js`) should be automatically cleared in the process.

However, for certain ways of loading the `pdf.worker.js` file, e.g. passing in a re-usable worker to `getDocument`, using the `workerPort` functionality, or even disabling workers completely  (even though this is never a good idea), the worker file may be kept in memory and these caches will not be cleared as expected.
2019-05-30 20:57:28 +02:00
Jonas Jenwald
5e045bcdba Ensure that the Cmd/Name/Ref caches are cleared when running other cleanup code
The purpose of these caches is to reduce peak memory usage, by only ever having *a single* instance of a particular object.
However, as-is these caches are never cleared and they will thus remain until the worker is destroyed. This could very well have a negative effect on total memory usage, particularly for large/long documents, hence it seems to make sense to clear out these caches together with various other ones.
2019-05-26 14:29:59 +02:00
Jonas Jenwald
2fe9f3ff8f Add caching to reduce the number of Ref objects
This is similar to the existing caching used to reduced the number of `Cmd` and `Name` objects.
With the `tracemonkey.pdf` file, this patch changes the number of `Ref` objects as follows (in the default viewer):

|          | Loading the first page | Loading *all* the pages |
|----------|------------------------|-------------------------|
| `master` | 332                    | 3265                    |
| `patch`  | 163                    | 996                     |
2019-05-26 12:23:37 +02:00
Tim van der Meij
bc1eb49a77
Implement creation date only for markup annotations
The specification states that `CreationDate` is only available for
markup annotations instead of for all annotation types.

Moreover, popup annotations are not markup annotations according to the
specification, so the creation date inheritance from the parent
annotation is also removed there (note that only the modification date
is used in e.g., the viewer).
2019-05-25 15:31:06 +02:00
Tim van der Meij
cf07918ccb
Implement contents for every annotation type
The specification states that `Contents` can be available for every
annotation types instead of only for markup annotations.
2019-05-18 15:52:17 +02:00
Jonas Jenwald
0784c98172 Remove unused ref property from the parameters object used when creating annotations in AnnotationFactory._create
The only use-cases for this property was removed in PRs 7570 and 7775, and it's been completely unused ever since the latter one.
2019-05-16 08:33:38 +02:00
Tim van der Meij
c8c937c257
Merge pull request #10794 from janpe2/cidtogidmap-zero
Fix glyph at index zero in CIDFontType2 that has a CIDToGIDMap stream
2019-05-15 00:04:39 +02:00
Jonas Jenwald
173fbef05b Enable the consistent-return ESLint rule
This rule is already enabled in mozilla-central, and helps ensure more consistent functions/methods, see https://searchfox.org/mozilla-central/rev/b9da45f63cb567244933c77b2c7e827a057d3f9b/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#119-120

Please see https://eslint.org/docs/rules/consistent-return for additional information.
2019-05-11 14:27:21 +02:00
Jani Pehkonen
05c527f035 Fix glyph 0 in CIDFontType2 that has a CIDToGIDMap stream 2019-05-07 18:44:37 +03:00
Tim van der Meij
be1d6626a7
Implement creation/modification date for annotations
This includes the information in the core and display layers. The
date parsing logic from the document properties is rewritten according
to the specification and now includes unit tests.

Moreover, missing unit tests for the color of a popup annotation have
been added.

Finally the styling of the popup is changed slightly to make the text a
bit smaller (it's currently quite large in comparison to other viewers)
and to make the drop shadow a bit more subtle. The former is done to be
able to easily include the modification date in the popup similar to how
other viewers do this.
2019-05-05 14:51:03 +02:00
Jonas Jenwald
007fab6ab5 Change PartialEvaluator.handleColorN to throw when no valid pattern is found
Currently `handleColorN` will fallback to add a completely unparsed/unvalidated operator when no valid pattern was found. This is unfortunate, since it could very easily lead to a couple of different errors:
 - `DataCloneError`s when attempting to send the data to the main-thread, e.g. when `args` is `Dict`/`Stream`.
 - Errors in `getShadingPatternFromIR` on the main-thread, unless `args` just happens to have the expected format.
 - Errors when actually attempting to render the pattern on the main-thread, since the `args` will most likely not have the expected format.

Hence it probably makes sense to error in `PartialEvaluator.handleColorN`, and having invalid patterns fail gracefully via the existing `ignoreErrors` code-paths instead.
2019-05-04 12:53:18 +02:00
Tim van der Meij
155304a0c1
Merge pull request #10756 from Snuffleupagus/issue-10542
Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542)
2019-05-02 22:29:24 +02:00
Jonas Jenwald
96942d4f7f Ensure that the OperatorList constructor actually initializes a NullOptimizer when intended (PR 9089 follow-up)
It appears that this has been broken ever since PR 9089, which also introduced this code, since the `QueueOptimizer`/`NullOptimizer` choice was made based on the still undefined `this.intent` property.

Furthermore, fixing this also uncovered the fact that the `NullOptimizer.reset` method was missing.
2019-05-02 17:37:05 +02:00
Jonas Jenwald
5335285cda Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542)
First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.
2019-04-30 23:35:33 +02:00
Tim van der Meij
762c58e0fc
Merge pull request #10738 from Snuffleupagus/ViewerPreferences-api
[api-minor] Add support for ViewerPreferences in the API (issue 10736)
2019-04-20 18:39:32 +02:00
Jonas Jenwald
34952b732e Add a getDocId method to the idFactory, in Page instances, to avoid passing around PDFManager instances unnecessarily (PR 7941 follow-up)
This way we can avoid manually building a "document id" in multiple places in `evaluator.js`, and it also let's us avoid passing in an otherwise unnecessary `PDFManager` instance when creating a `PartialEvaluator`.
2019-04-20 13:11:17 +02:00
Tim van der Meij
55d9b35d37
Merge pull request #10727 from Snuffleupagus/type3-image-resources
Support (rare) Type3 fonts which contains image resources (issue 10717)
2019-04-18 23:07:26 +02:00
Jonas Jenwald
311bac3ebb [api-minor] Add support for ViewerPreferences in the API (issue 10736)
Please see the specification, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#M11.9.12864.1Heading.71.Viewer.Preferences

Furthermore, note that this patch *only* adds API support and unit-tests but does not attempt to integrate e.g. the `ViewerPreferences -> Direction` property into the viewer (which would be necessary to address issue 10736).
The reason for this is that it's not entirely clear to me exactly if/how that could be implemented; e.g. would it be as simple as setting the `dir` attribute on the `viewerContainer` DOM element, or will it be more complicated?
There's also the question of how the `ViewerPreferences -> Direction` value interacts with the `PageMode`, and this will generally require a fair bit of manual testing. Since the direction of the *entire* viewer depends on the browser locale, there's also a somewhat open question regarding what default value to use for different locales.
Finally, if the viewer supports `ViewerPreferences -> Direction` then I'm assuming that it will be necessary to allow users to override the default value, which will require (most likely) new `SecondaryToolbar` buttons and icons for those etc.

Hence this patch only lays the necessary foundation for eventually addressing issue 10736, but defers the actual implementation until later. (Time permitting, I'll try to look into the viewer part later.)
2019-04-14 14:20:52 +02:00
Tim van der Meij
ae2a4dc3dd
Implement free text annotations 2019-04-13 18:45:22 +02:00
Jonas Jenwald
be604bd195 Support (rare) Type3 fonts which contains image resources (issue 10717)
The Type3 font type is not commonly used in PDF documents, as can be seen from telemetry data such as: https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2019-04-09&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F68&measure=PDF_VIEWER_FONT_TYPES&min_channel_version=nightly%252F57&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2019-03-18&table=0&trim=1&use_submission_date=0 (see also https://github.com/mozilla/pdf.js/wiki/Enumeration-Assignments-for-the-Telemetry-Histograms#pdf_viewer_font_types).

Type3 fonts containing image resources are *very* rare in practice, usually they only contain path rendering operators, but as the issue shows they unfortunately do exist.
Currently these Type3-related image resources are not handled in any special way, and given that fonts are document rather than page specific rendering breaks since the image resources are thus not available to the *entire* document.
Fortunately fixing this isn't too difficult, but it does require adding a couple of Type3-specific code-paths to the `PartialEvaluator`. In order to keep the implementation simple, particularily on the main-thread, these Type3 image resources are completely decoded on the worker-thread to avoid adding too many special cases. This should not cause any issues, only marginally less efficient code, but given how rare this kind of Type3 font is adding premature optimizations didn't seem at all warranted at this point.
2019-04-13 18:27:50 +02:00
Tim van der Meij
2d0c38d626
Merge pull request #10696 from Snuffleupagus/makeSubStream-ensureByte
Update `ChunkedStream.makeSubStream` to actually check if (some) data exists when the `length` parameter is undefined
2019-04-13 17:12:20 +02:00
Vlastimil Máca
d96267c30c Annotations - _preparePopup method replaced with MarkupAnnotation base class. This is just refactoring, so it shouldn't break anything. It should move annotation API closer to PDF spec and enable future expansion. 2019-04-12 11:24:21 +02:00
Tim van der Meij
4055d0a302
Implement caret annotations
The file `test/pdfs/annotation-caret-ink.pdf` is already available in
the repository as a reference test for this since I supplied it for
another patch that implemented ink annotations.
2019-04-09 23:39:56 +02:00
Jonas Jenwald
7a999d1d67 [api-minor] Add basic support for PageLayout in the API and the viewer
Please see the specification, https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.2393749, and refer to the inline comments for additional details.
2019-04-05 11:32:01 +02:00
Jonas Jenwald
c6ddbd55e2 Add a progressiveDataLength fast-path to ChunkedStream.ensureByte
This is *similar* to the existing check using in `ChunkedStream.ensureRange`.
2019-03-29 20:00:28 +01:00
Jonas Jenwald
49e8a270c4 Update ChunkedStream.makeSubStream to actually check if (some) data exists when the length parameter is undefined
Note how `XRef.fetchUncompressed`, which is used *a lot* for most PDF documents, is calling the `makeSubStream` method without providing a `length` argument.
In practice this results in the `makeSubStream` method, on the `ChunkedStream` instance, calling the `ensureRange` method with `NaN` as the end position,  thus resulting in no data being requested despite it possibly being necessary.

This may be quite bad, since in this particular case it will lead to a new `ChunkedStream` being created *and* also a new `Parser`/`Lexer` instance. Given that it's quite possible that even the very first `Parser.getObj` call could throw `MissingDataException`, this could thus lead to wasted time/resources (since re-parsing is necessary once the data finally arrives).

You obviously need to be very careful to not have `ChunkedStream.makeSubStream` accidentally requesting the *entire* file, hence its `this.end` property is of no use here, but it should be possible to at least check that the `start` of the data is present before any potentially expensive parsing occurs.
2019-03-29 17:20:31 +01:00
Tim van der Meij
f9c58115fc
Merge pull request #10683 from janpe2/type0-noncid-cmap
Use CMap in Type0 fonts when CFF is not a CID font
2019-03-28 00:07:08 +01:00
Jonas Jenwald
9077abc263 Take the FirstChar/LastChar properties into account when computing the hash in PartialEvaluator.preEvaluateFont (issue 10665)
Without this some fonts may incorrectly end up with matching `hash`es, thus breaking rendering since we'll not actually try to load/parse some of the fonts.
2019-03-27 16:27:10 +01:00
Jonas Jenwald
a2a824ed01 Don't accidentally use an empty hash value when comparing preEvaluatedFonts in PartialEvaluator.loadFont
Note that `PartialEvaluator.preEvaluateFont` will return an empty string when no hash was computed. This will complete short-circuit the `fontAlias` comparison in `PartialEvaluator.loadFont`, since fonts which are totally different will then match if their `hash`es are empty.
2019-03-27 00:54:39 +01:00
Jani Pehkonen
49c6233fbc Use CMap in Type0 fonts when CFF is not a CID font 2019-03-26 19:38:44 +02:00
Tim van der Meij
33bfbef6ba
Merge pull request #10635 from timvandermeij/lexer-parser
Convert `src/core/parser.js` to ES6 syntax and write more unit tests for the lexer and the parser
2019-03-19 23:17:34 +01:00
Tim van der Meij
7d3cb19571
Convert the Linearization class in src/core/parser.js to ES6 syntax
Moreover, disable `var` usage for this file.
2019-03-17 13:27:45 +01:00
Jonas Jenwald
56eeeea1dc Re-factor the getTransfers helper function into a "private" getter method on the OperatorList
This function is currently called with the `OperatorList` instance as its argument, hence I cannot think of any good reason for not just moving it into the `OperatorList` properly. (This will also help with other planned changes regarding the `ImageCache` functionality.)
2019-03-16 13:06:51 +01:00
Jonas Jenwald
7273795eb6 Actually transfer eligible ImageMask data, rather than always copying it
By transfering `ArrayBuffer`s you can avoid having two copies of the same data, i.e. one copy on each of the worker/main-thread, for data that's used only *once* on the worker-thread.

Note how the code in [`PDFImage.createMask`](80135378ca/src/core/image.js (L284-L285)) goes to great lengths to actually enable tranfering of the image data. However in [`PartialEvaluator.buildPaintImageXObject`](80135378ca/src/core/evaluator.js (L336)) the `cached` property is always set to `true`, which disqualifies the image data from being transfered; see [`getTransfers`](80135378ca/src/core/operator_list.js (L552-L554)).

For most ImageMask data this patch won't matter, since images found in the `/Resources -> /XObject` dictionary will always be indexed by name. However for *inline* images which contains ImageMask data, where only "small" images are cached (in both `parser.js` and `evaluator.js`), the current code will result in some unnecessary memory usage.
2019-03-16 13:06:32 +01:00
Jonas Jenwald
88f9e633dd Try to improve text-selection for Type3 fonts that utilize a non-default /FontMatrix (bug 1513120)
For Type3 fonts text-selection is often not that great, and there's a couple of heuristics used to try and improve things. This patch simple extends those heuristics a bit, and fixes a pre-existing "naive" array comparison, but this all feels a bit brittle to say the least.

The existing Type3 test-coverage isn't that great in general, and in particular Type3 `text` tests are few and far between, hence why this patch adds *two* different new `text` tests.
2019-03-12 10:32:08 +01:00
Tim van der Meij
8d4d7dbf58
Convert the Lexer class in src/core/parser.js to ES6 syntax 2019-03-10 19:04:36 +01:00
Tim van der Meij
7d0ecee771
Convert the Parser class in src/core/parser.js to ES6 syntax 2019-03-10 19:04:35 +01:00
Tim van der Meij
d587abbceb
Merge pull request #10633 from Snuffleupagus/murmurhash-class
Convert `MurmurHash3_64` to an ES6 class
2019-03-09 21:07:12 +01:00
Jonas Jenwald
6b1ac44aea Convert MurmurHash3_64 to an ES6 class
Notable changes:
 - Remove the `return this;` from the `MurmurHash3_64.update` method, since it's completely unused and doesn't make a lot of sense.
 - Remove the loop(s) from the `MurmurHash3_64.hexdigest` method, since creating a temporary array and then looping over it is wasteful given how simple this can be written with modern JavaScript.
2019-03-09 17:03:06 +01:00
Jonas Jenwald
2665502055 Move NativeImageDecoder into a separate file, and convert it to a class
Given the size of the `src/core/evaluator.js` file, it cannot hurt to move some of its (image related) helper functionality into a separate file.
2019-03-09 15:59:04 +01:00
Tim van der Meij
8b149b818e
Merge pull request #10615 from Snuffleupagus/corrupt-inline-ASCII85Decode
Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614)
2019-03-08 23:06:01 +01:00
Jonas Jenwald
3ce8fe7927 Handle corrupt ASCII85Decode inline images with whitespace "inside" of the EOD marker (issue 10614)
There's a number of things wrong with the PDF document, since its inline images are first all *a lot* larger than the 4 KB limit (as mandated by the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G7.1852045).

Furthermore the actual ASCII85Decode data is interspersed with *a lot* of needless whitespace, in particular also "inside" of the EOD (end-of-data) marker which thus completely breaks the detection.
Note that according to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.1940130, this patch should be safe since it explicitly mentions that *all* whitespace should be ignored.
2019-03-04 23:41:36 +01:00
Jonas Jenwald
4170c414fa Reduce usage of Date.now() in src/core/worker.js
Currently for every single parsed/rendered page there's no less than *four* `Date.now()` calls being made on the worker-side. This seems totally unnecessary, since the result of these calls are, by default, not used for anything *unless* the verbosity level is set to `INFO`.
2019-03-02 20:23:52 +01:00
Brendan Dahl
7d6ab081eb Put the string name of the glyph in the charset array.
Also, only warn once per font when missing a glyph name.
2019-03-01 18:03:51 -08:00
Brendan Dahl
34022d2fd1
Merge pull request #10591 from brendandahl/fix-charset
Add unique glyph names for CFF fonts.
2019-02-28 17:22:29 -08:00
Brendan Dahl
8a596ef5d5 Add unique glyph names for CFF fonts.
Printing on MacOS was broken with the previous approach of just mapping
all the glyphs to notdef.
2019-02-27 15:00:29 -08:00
Jonas Jenwald
db5dc14158 Move worker-thread only functions from src/shared/util.js and into a new src/core/core_utils.js file
The `src/shared/util.js` file is being bundled into both the `pdf.js` and `pdf.worker.js` files, meaning that its code is by definition duplicated.
Some main-thread only utility functions have already been moved to a separate `src/display/display_utils.js` file, and this patch simply extends that concept to utility functions which are used *only* on the worker-thread.

Note in particular the `getInheritableProperty` function, which expects a `Dict` as input and thus *cannot* possibly ever be used on the main-thread.
2019-02-24 00:35:39 +01:00
Jonas Jenwald
60f6d49ff7 [api-minor] Expose the existence of a Collection dictionary via the getMetadata API method (issue 10555)
Given the complexity of this functionality, and the fact that it doesn't seem widely used, I highly doubt that it'd ever make sense to support Collections; see also https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#M11.9.39646.2Heading.824.Collections
2019-02-15 15:40:31 +01:00
Jonas Jenwald
b6d090cc14 Fallback to the built-in font renderer when font loading fails
After PR 9340 all glyphs are now re-mapped to a Private Use Area (PUA) which means that if a font fails to load, for whatever reason[1], all glyphs in the font will now render as Unicode glyph outlines.
This obviously doesn't look good, to say the least, and might be seen as a "regression" since previously many glyphs were left in their original positions which provided a slightly better fallback[2].

Hence this patch, which implements a *general* fallback to the PDF.js built-in font renderer for fonts that fail to load (i.e. are rejected by the sanitizer). One caveat here is that this only works for the Font Loading API, since it's easy to handle errors in that case[3].

The solution implemented in this patch does *not* in any way delay the loading of valid fonts, which was the problem with my previous attempt at a solution, and will only require a bit of extra work/waiting for those fonts that actually fail to load.

*Please note:* This patch doesn't fix any of the underlying PDF.js font conversion bugs that's responsible for creating corrupt font files, however it does *improve* rendering in a number of cases; refer to this possibly incomplete list:

[Bug 1524888](https://bugzilla.mozilla.org/show_bug.cgi?id=1524888)
Issue 10175
Issue 10232

---
[1] Usually because the PDF.js font conversion code wasn't able to parse the font file correctly.

[2] Glyphs fell back to some default font, which while not accurate was more useful than the current state.

[3] Furthermore I'm not sure how to implement this generally, assuming that's even possible, and don't really have time/interest to look into it either.
2019-02-11 10:27:08 +01:00
Tsukasa OI
96ba6afd47 Fix copying on supplementary plane characters
pdf.js had a problem when copying characters on supplementary planes
(0xPPXXXX where PP is nonzero).  This is because certain methods of
PartialEvaluator use classic String.fromCharCode instead of ES6's
String.fromCodePoint.

Despite the fact that readToUnicode method *tried* to parse out-of-UCS2
code points by parsing UTF-16BE, it was inadequate because
String.fromCharCode only supports UCS-2 range of Unicode.
2019-02-10 18:14:53 +09:00
Jonas Jenwald
6f94a05a29 Do the final text scaling correctly in flushTextContentItem (issue 8276)
It's necessary to take into account whether or not the text is vertical, to avoid either the textContent `width` or `height` becoming incorrect.
2019-01-29 15:24:04 +01:00
Jonas Jenwald
29f36d7a1b Reduce unnecessary duplication of the isDefaultDecode methods on ColorSpace instances
The recent PR 10482 made me realize that I missed an opportunity for simplification when doing the class conversion of this code in PR 10007.
2019-01-25 08:53:08 +01:00
Tim van der Meij
e2701d5422
Merge pull request #10482 from janpe2/indexed-decode
Implement Decode entry in Indexed images
2019-01-24 23:46:55 +01:00
Jonas Jenwald
41fbc71ef9 Ensure that XRef.indexObjects can handle object numbers with zero-padding (issue 10491)
All objects in the PDF document follow this pattern:
```
0000000001 0 obj
<<
% Some content here...
>>
endobj
0000000002 0 obj
<<
% More content here...
endobj

```
2019-01-24 22:37:18 +01:00
Jani Pehkonen
26121177ab Implement Decode entry in Indexed images 2019-01-22 22:51:04 +02:00
Jonas Jenwald
24a688d6c6 Convert some usage of indexOf to startsWith/includes where applicable
In many cases in the code you don't actually care about the index itself, but rather just want to know if something exists in a String/Array or if a String starts in a particular way. With modern JavaScript functionality, it's thus possible to remove a number of existing `indexOf` cases.
2019-01-18 17:57:41 +01:00
Jonas Jenwald
b531fc4106 Avoid truncating inline images, where the data and the "EI" marker is glued together (issue 10388) (#10436)
Thanks to the *excellent* debugging done by @janpe2, this was easy to fix!
2019-01-12 20:31:23 +01:00
Jonas Jenwald
d4a3858ed5 Handle more cases of corrupt PDF files with missing 'endobj' operators, where the "obj" string is immediately followed by the dictionary (PR 9288 follow-up) 2019-01-10 17:55:28 +01:00
Tim van der Meij
f162fed6b9
Convert src/core/charsets.js and src/core/standard_fonts.js to ES6 syntax
Moreover, include the "no var" ESLint comment to
`src/core/annotation.js` and `src/core/ps_parser.js` since they are
already converted.
2019-01-06 15:04:01 +01:00
Tim van der Meij
3b637e71d4
Convert src/core/arithmetic_decoder.js to ES6 syntax 2019-01-06 15:04:01 +01:00
Brendan Dahl
32eace043b Fix reading number of HTMX metrics.
The length of the HHEA table can be incorrect, so it is better to
read the number of metrics offset from beginning of table instead.
2019-01-04 15:13:13 -08:00
Jonas Jenwald
66fccd860b Adjust how AnnotationBorderStyle.setWidth handles the input being a Name (issue 10385)
In order to be consistent with the behaviour in Adobe Reader, the width will now always be set to zero when the input is a `Name`.
2019-01-04 10:38:10 +01:00
Tim van der Meij
2d00bb098b
Merge pull request #10404 from Snuffleupagus/issue-10401
Remove the `for ... of` loop from the `PDFDocument.fingerprint` getter (issue 10401)
2019-01-03 22:46:51 +01:00
Brendan Dahl
e2686db49b
Merge pull request #10277 from janpe2/cff-stems
Repair CFF fonts if stem hints are in wrong order
2019-01-03 10:30:43 -08:00
Jonas Jenwald
8c278530dd Remove the for ... of loop from the PDFDocument.fingerprint getter (issue 10401)
It appears that the `Symbol` polyfill doesn't work well in conjunction with `TypedArray`s, and that part of PR 10393 is thus reverted.
2019-01-03 11:17:45 +01:00
Tim van der Meij
d8f201ea2a
Merge pull request #10397 from Snuffleupagus/issue-10385
Ensure that `AnnotationBorderStyle.setWidth` is able to handle the input being a `Name`, to correctly deal with corrupt PDF documents (issue 10385)
2018-12-31 12:58:28 +01:00
Jonas Jenwald
76a9580aeb Ensure that AnnotationBorderStyle.setWidth is able to handle the input being a Name, to correctly deal with corrupt PDF documents (issue 10385) 2018-12-31 12:21:28 +01:00
Jonas Jenwald
15b3806937 Actually validate the input in AnnotationBorderStyle.setStyle 2018-12-31 12:15:15 +01:00
Tim van der Meij
d5e5d18430
Convert the PDFDocument class in src/core/document.js to ES6 syntax 2018-12-30 13:54:43 +01:00
Tim van der Meij
612fc9fcc2
Convert the Page class in src/core/document.js to ES6 syntax 2018-12-30 13:54:43 +01:00
Tim van der Meij
aad27ff9a0
Optimize the Ref class in src/core/primitives.js
The `toString` method always creates two string objects (for the 'R'
character and for the `num` concatenation) and in the worst case
creates three string objects (one more for the `gen` concatenation).
For the Tracemonkey paper alone, this resulted in 12000 string
objects when scrolling from the top to the bottom of the document.
Since this is a hot function, it's worth minimizing the number of string
objects, especially for large documents, to reduce peak memory usage.

This commit refactors the `toString` method to always create only one
string object.
2018-12-29 17:48:41 +01:00
Jonas Jenwald
60bcce184e Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326)
For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].

Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.

Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2].

Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:

```javascript
console.time('first');
getDocument(...).promise.then((pdfDocument) => {
  console.timeEnd('first');
});

console.time('second');
getDocument(...).promise.then((pdfDocument) => {
  pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
    console.timeEnd('second');
  });
});
```

The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.

---
[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.

[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page.

[3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.
2018-12-29 12:47:25 +01:00
Tim van der Meij
360c3d3813
Remove the unused url argument for the ChunkedStreamManager class 2018-12-24 13:14:42 +01:00
Tim van der Meij
47344197f4
Convert src/core/chunked_stream.js to ES6 syntax 2018-12-24 13:14:42 +01:00
Jonas Jenwald
b05f053287 [api-minor] Add support for OpenAction destinations (issue 10332)
Note that the OpenAction dictionary may contain other information besides just a destination array, e.g. instructions for auto-printing[1].
Given first of all that an arbitrary `Dict` cannot be sent from the Worker (since cloning would fail), and second of all that the data obviously needs to be validated, this patch purposely only adds support for fetching a destination from the OpenAction entry[2].

---
[1] This information is, currently in PDF.js, being included through the `getJavaScript` API method.

[2] This significantly reduces the complexity of the implementation, which seems fine for now. If there's ever need for other kinds of OpenAction to be fetched, additional API methods could/should be implemented as necessary (could e.g. follow the `getOpenActionWhatever` naming scheme).
2018-12-19 11:45:16 +01:00
Jonas Jenwald
ba2edeae18 [api-minor] Add support, in getMetadata, for custom information dictionary entries (issue 5970, issue 10344) (#10346)
The custom entries, provided that they exist *and* that their types are safe to include, are exposed through a new `Custom` infoDict entry to clearly separate them from the standard ones.

Fixes 5970.
Fixes 10344.
2018-12-18 23:26:02 +01:00
Jonas Jenwald
437fb8a8a7 Ignore the fieldValue for Signature annotations, since they're currently unsupported (issue 10374)
Given that Signature (Widget) annotations are currently not supported, since they cannot be validated, simply ignoring the `fieldValue` seems OK for now considering that attempting to blindly include unparsed/unvalidated data isn't very useful.

Fixes 10347.
2018-12-12 18:01:43 +01:00
Jonas Jenwald
d0fec7c6fb Fix NameOrNumberTree.get to actually perform a binary search to find the requested key
The intent of the code, based on existing comments, is to perform a binary search. However, because of what appears to be a typo in the code responsible for computing the current search index, this code is always checking *every* entry (albeit only at the "final" node) starting from the last one.
2018-11-23 23:52:33 +01:00
Jonas Jenwald
fdad0a0b0b Fallback to an exhaustive search, in corrupt PDF files, for NameTrees/NumberTrees that are not correctly ordered (issue 10272)
According to the specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G6.2384179, the keys of NameTree/NumberTree should be ordered.
For corrupt PDF files, which violate this assumption, we thus need to fallback to an exhaustive search in order to e.g. find all destinations.

*Please note:* Given that this only implements a fallback for the "final" node of the Tree, there's obviously a risk that the patch isn't sufficient for dealing with all kinds of out-of-order corruption. However, this kind of problem should be rare in practice, and without a real-world test-case it's difficult to implement a completely general solution (and there's obviously a question if you'd even want to).
2018-11-20 17:50:47 +01:00
Jani Pehkonen
9e990f6f3e Repair CFF fonts if stem hints are in wrong order 2018-11-20 18:50:37 +02:00
Jonas Jenwald
4829f567c1 Move the interface definitions out of src/core/worker.js and into their own file
These interfaces are already used in different files, in both the `src/core/` and `src/display/` folders, and having them reside in their own file seems a lot clearer and is also similar to the existing viewer interfaces.

As part of moving the `interface` definitions, they're also converted to ES6 classes.
2018-11-08 13:21:37 +01:00
Jonas Jenwald
a963d139dc Convert src/core/ps_parser.js to use ES6 classes
Besides being a fairly small and self-contained file, this code also shows a possible way of defining static constants on classes.
2018-11-03 17:43:06 +01:00
Jani Pehkonen
9cd5f94f03 Normalize the BBox of form XObjects on the /core side 2018-10-22 14:17:05 +03:00
Simon Leblanc
b5806735d8 Add support of Ink annotation 2018-10-03 00:28:49 +02:00
Jonas Jenwald
ad3e937816 Replace the Font.loading property with, the already existing, Font.missingFile property
The `Font.loading` property is only ever used *once* in the code, whereas `Font.missingFile` is more widely used. Furthermore the name `loading` feels, at least to me, slight less clear than `missingFile`. Finally, note that these two properties are the inverse of each other.
2018-09-29 15:57:04 +02:00